ARM plans upgrades as it marks 30 years of atmospheric data collection
Newswise – As the Department of Energy’s Atmospheric Radiation Measurement User Facility celebrates 30 years of collecting continuous measurements of the Earth’s atmosphere this year, Oak Ridge National Laboratory’s ARM Data Center is leading changes in its operations to facilitate the treasure trove of data accessible and useful to scientists studying the Earth’s climate around the world.
The observations, comprising more than 3.3 petabytes of data so far, begin with raw data from more than 460 instruments around the world. Observational measurements include daily records of temperature, wind speed, humidity, cloud cover, atmospheric particles called aerosols, and dozens of other atmospheric processes that are critically important to weather and climate.
The ARM data center team refines the data to make it more useful to researchers and ensures its quality. In some cases, experts use this processed data to create high-end data products that enhance high-resolution models.
Over the past 30 years, the multi-lab ARM facility has amassed over 11,000 data products. That’s the capacity of about 50,000 smartphones, at 64 gigabytes per phone. With so much data available, ARM is taking steps over the next decade to upgrade its field measurements, data analytics, data model interoperability, and data services. The upgrades and aspirations are outlined in a 31-page 10-year vision document released last year.
ARM’s head of data services, Giri Prakash, said when he started at ORNL in 2002, ARM had about 16 terabytes of observational data stored.
“I considered it big data,” he said.
In 2010, the total was 200 terabytes. In 2016, ARM reached one petabyte of data.
Collecting those first 16 terabytes took almost 10 years. Today, ARM, a DOE Office of Science user facility supported by nine national labs, collects as much data about every six days. Its data trove is growing at the rate of one petabyte per year.
Prakash attributes this meteoric rise to more complex data, more sophisticated instruments, higher resolution measurements (mostly from radar), more field campaigns and more high resolution models.
Rethinking data management
How to manage all this data?
“We had to completely rethink our approach to data management and redesign a lot of it from scratch,” Prakash said. “We need an end-to-end data services competency to further streamline and automate the data process. We’ve refreshed nearly 70 data processing tools and workflows over the past four years. »
This effort brought recognition. Since 2020, the ARM data center has been recognized as a CoreTrustSeal repository, has been named a DOE Office of Science PuRe (Public Reusable Research) data resource, and has become a member of the World Data System.
All of these important professional credentials require a rigorous review process.
“ARM is special,” said Prakash, who represents the United States on the Data Committee of the International Science Council. “We have a robust and operationally mature data service that allows us to process quality data and deliver it to users.”
ARM measurements, free to researchers around the world, come continuously from field instruments at six fixed and mobile observatories. The instruments operate in climate-critical regions around the world.
Jim Mather, ARM technical director at Pacific Northwest National Laboratory, said that as part of the 10-year vision, increasingly complex ARM data will be driven by new data management practices, hardware and software, which are increasingly sophisticated.
Data services, “as the name suggests,” Mather said, “are in direct service to enable data analysis.”
This service includes different types of ARM assets, he said, including physical infrastructure, software tools, and new policies and frameworks for software development.
Meanwhile, Prakash adds, ARM uses FAIR guidelines for its data management and stewardship. FAIR stands for Findability, Accessibility, Interoperability and Reuse. Adhering to FAIR principles helps ensure that data is findable and useful for reproducible research, as scientists increasingly rely on digitization of data and artificial intelligence.
One of the steps in ARM’s decade-long metamorphosis will be to improve its operational and research computing infrastructure. Larger compute, memory and storage assets will make it easier to couple large datasets – from scanning radars, for example – with high-resolution models. Greater computing power and new software tools will also support machine learning and other techniques required by big data science.
The ARM data center already supports the computing and data access needs of the user installation. But the data center is being expanded to bolster its current mix of high-performance and cloud-computing resources by providing seamless access to data and computing.
Mather presented the challenge: ARM has over 2,500 active data streams from its hundreds of instruments. Processing bottlenecks are possible when you add the pressure of these data streams to the challenge of managing petabytes of information. Overall, volumes like this could mean it’s harder to make scientific progress with ARM data.
To circumvent this, in the area of hardware, Mather said, ARM will provide “more powerful computing services” for data processed and stored at the ARM data center.
The need continues to grow
Some of this increased computing power has come online over the past few years to support a new ARM modeling framework, where large-scale simulations, or LES, require a lot of computing power.
So far, LES ARM’s Symbiotic Simulation and Observation Activity, or LASSO, has created an extensive library of simulations informed by ARM data. These exhaustively filtered and streamlined datasets, for atmospheric researchers, are proxies for the atmosphere. For example, they make it easier to test the accuracy of climate models.
Designed in 2015, LASSO initially focused on shallow cumulus clouds. Currently, data beams are being developed for a deep convection scenario. Some of this data was made available through a beta release in May 2022.
Still, “the need continues to grow” for more computing power, Mather said. “Looking forward, we need to continuously assess the scale and nature of IT needs.”
ARM has a new Cumulus high-performance computing cluster at the Oak Ridge Leadership Computing Facility, which provides more than 16,000 processing cores to ARM users. The average laptop has four to six cores.
If needed, ARM users can request more computing power from other DOE facilities, such as the National Energy Research Scientific Computing Center. Access to external cloud computing resources is also available through the DOE.
Prakash envisions a menu of user-friendly tools, including Jupyter Notebook, available for ARM users to work with ARM data. The tools are designed so users can transition from a laptop or workstation while accessing petabytes of ARM data at a time.
Prakash said, “Our goal is to deliver ARM data wherever computing power is available.”
Development of a data workshop
“Software tools are also essential,” says Mather. “We expect single-case deep convection (LASSO) simulations to come to be on the order of 100 terabytes each. Exploiting this data will require sophisticated tools to visualize, filter and manipulate the data.
Imagine, for example, he said, LASSO trying to visualize convective cloud fields in three dimensions. It’s a big software challenge.
Such challenges require more engagement than ever with the atmospheric research community to identify the right software tools.
Greater engagement helped shape the Ten Year Vision document. To gather insights, Mather relied on workshops and direct contact with users and staff to gather insights on increasing ARM’s scientific impact.
Given the growth in data volume, there was a clear need to give a wider audience of data users even more seamless access to ARM data center resources. They already have access to ARM’s data, analytics, computing resources and databases. ARM data users can also select data by date range or by conditional statements.
For deeper access, ARM is developing an ARM Data Workbench.
Prakash envisions the workbench as an extension of the current data discovery interface, one that will “deliver transformative knowledge discovery” by offering an integrated ecosystem of data computing. This would allow users to discover data of interest using advanced data queries. Users could perform advanced data analysis using ARM’s vast data mine as well as software tools and computing resources.
The workshop will allow users to exploit open source visualization and analysis tools. The open source code, free for everyone, can also be redistributed or modified. They could also use technologies such as Apache Cassandra or Apache Spark for large-scale data analysis.
In early 2023, Prakash said, a preliminary version of the workbench will be online. Achieving this will require more hours of consultation with ARM data users to define their workbench needs.
From then on, he adds, the workbench will be “continuously developed” through the end of fiscal 2023.
Prakash calls the workbench, with its improved access and open source tools, “a revolutionary way to interact with ARM data.”
ARM recently revamped its open source code capabilities and added data services organizations to the GitHub software sharing site.
“Within ARM, we have limited capacity to develop the necessary processing and analysis codes,” Mather said. “But these open source software practices give us a way to pool our development resources to implement the best ideas and minimize any duplication of effort.”
Ultimately, he added, “it’s about improving the impact of ARM data.”
UT-Battelle manages ORNL for the Department of Energy’s Office of Science, the largest support of basic physical science research in the United States. The Office of Science strives to meet some of the most pressing challenges of our time. For more information, please visit energy.gov/science.
Editor’s note: Adapted from a article by Corydon Ireland of the Pacific Northwest National Laboratory, where ARM is headquartered.