The Earth Observer, May/June, 1995 Issue


Mining Data from Climate Models and Observations

Jarrett S. Cohen (jcohen@jacks.gsfc.nasa.gov), Hughes STX Corp., High Performance Computing Branch, Goddard Space Flight Center

Todays climatologists have a variety of tools at their disposal. Two of the most powerful are supercomputer models, which may be descriptive, predictive, or a combination of both; and observations; EOS will be the most comprehensive source of observational data. Beyond understanding the Earth system as a whole, scientists also want to probe the data for specific weather patterns and other phenomena. With many gigabytes of spatially and temporally complex data often on hand, this task is only feasible with a query processing environment that addresses both conceptual abstraction and exploratory analysis.

Researchers at the UCLA Data Mining Laboratory1 and the NASA/Jet Propulsion Laboratory (JPL)2 are building such a system in CONQUEST (CONcurrent QUEries over Space and Time), a computational environment for content-based searching. "To us [data mining] is being able to express easily and execute efficiently complex queries that allow you to get at phenomena," says Director Richard Muntz, professor of computer science. "The scientific user expresses what he or she is looking for; machine learning has algorithms that look for patterns in data."

data mining graphic description
Figure 1.

"The goal is to make this painless," adds Edmond Mesrobian, laboratory co-director and postdoctoral researcher. Designed towards providing automated data exploration and analysis, CONQUEST (see Figure 1) consists of a Scientist Workbench, a Visualization Manager, Information Repositories, and the CONQUEST Parallel Query Processing System. CONQUEST is very similar to popular interactive packages such as the Application Visualization System (AVS), although the data model and execution paradigm are quite different. The Scientist Workbench acts as the top-level, graphical interface. From the Workbench, the user either chooses to visualize the retrieved data using the Visualization Manager or to index and store them in the Information Repositories for later use.

Made up of the Query Manager and the Execution Server, the CONQUEST Parallel Query Processing System executes the queries delivered from the Workbench. A library of operators, akin to modules in AVS, perform both generic algebraic functions and application-specific processes. Many are built in, but Mesrobian emphasizes that it is an open system to which researchers can add their own operators. This design is based on the University of Colorados Volcano extensible query processing system. Another important aspect is a data model into which geoscientific datasets in multiple formats, e.g., HDF, GRIB, DRS, netCDF from a variety of sources, e.g., database management systems, geographic information systems, and general circulation models can be mapped. "We define a global data model and representation for the heterogeneous datasets to minimize what the user has to know," explains Eddie Shek, graduate student in computer science.

Applications

The UCLA researchers have been collaborating with several climate modeling teams in analyzing model output and have just initiated similar work with observational datasets. Studied models include the UCLA Atmospheric General Circulation Model (AGCM), the European Center for Medium-range Weather Forecasts (ECMWF) AGCM, and the ECMWF Global Basic Surface and Upper Air Advanced Analyses (observational data that have been assimilated using an AGCM). A data broker allows CONQUEST to do both post-processing and "live processing" of data from a climate model as it is running. The data broker can also work with several models, as it converts varying grid sizes to a common grid that is more convenient for the user.

The two principal phenomena under investigation thus far are cyclones and blocking events. "We are extracting cyclones and blocking events not only because they are important phenomena but also because they interact in ways that are imperfectly understood," says Paul Stolorz, a physicist in JPL's Robotic Systems and Advanced Computing Technology group. He describes one goal of the process as automating detection of features and then seeing if it is possible to extract and predict correlations and irregularities over a large time scale.

Cyclones

Cyclones are areas of minimum sea level pressure (SLP), hundreds of kilometers in size, that are the generators of most of the weather. "With cyclones, the most difficult thing is tracking," Muntz says. In this several-step procedure (see Figure 2), distinct operators read the SLP values, extract minima from them, and use upper-altitude winds to determine the cyclone centers' most likely direction of movement. The final operator combines the minima and wind values to track the cyclone.

graphic of several-step procedure to extract minima and use upper-altitude winds to determine the cyclone centers' most likely direction of movement. The final operator combines the minima and wind values to track the cyclone.
Figure 2.

Cyclone trajectories, or "tracks," are put into the information repositories. "From 100 gigabytes, you can get down to 1 megabyte of salient periods of cyclones; you can then search these cyclones and do analysis," Muntz says. He explains that over 10 years of the model run, the user might obtain a few thousand cyclones. Searches can be basic and about the cyclones themselves, such as "find all the cyclones in the winter months." They can also be much more complicated, involving extraction of data related to the cyclone, such as "record the temperature in a 100 km region from the cyclone center," which in turn can be tracked as well.

The Muntz team has produced several telling visualizations of cyclone tracks. In one, a world map outlined in white against a black background serves as a backdrop for eruptions of red slivers, which mark the cyclones' paths. Visualization is particularly useful for comparing different climate models. Mesrobian says that since scientists know where cyclones occur from observations, they can test the relative accuracies of the models. Density maps of cyclopresence (see Figure 3) for two ECMWF models show several differences; notably, the ECMWF AGCM (Figure 3b) generates significantly more cyclones than observed.

world map of cyclopresence
Figure 3a.

world map of cyclopresence
Figure 3b.

Blocking Events

A blocking event is a class of persistent anomaly in which the westerly jet stream in mid-latitudes splits in two and remains in this condition for 10 days or so. "It blocks the passage of normal wind flow, which affects the climate in that region; for example, storms flow around it," Stolorz explains. He says that scientists want to predict or understand the dynamics of typical patterns as well as why there would be deviations from the usual blocking pattern.

Plugging in operators for blocking event detection, a study of a 5-year dataset was carried out using the ECMWF Analyses and the UCLA AGCM (see Figure 4). "The results show where blocking events occur regularly, and we can see how they occur globally, both spatially and temporally," Stolorz says. From an initial 1 gigabyte of raw information, they found between 175,000 and 620,000 grid points with a strong blocking signal, which represent approximately 50 blocking events.

world map of blocking density
Figure 4a.

world map of ucla agcm
Figure 4b.

"We can use information theory to look at this behavior in more detail," Stolorz says. For example, scientists might want to know if the occurrence of 10 events in a certain place over 5 years is correlated with 20 events occurring in another location. Such an understanding leads to a more general comprehension of the climate system. "This is hard, outside the scope of the model," Stolorz stresses. "The computer has to put together information from a lot of different points. This step requires massively parallel computing resources, but it is do-able provided the search space is kept to a reasonable size."

Parallel Computing and Extensions

The sheer size and heterogeneity of the datasets together with the complexity of searches make data mining a "Grand Challenge" problem and thus a candidate for parallel computing. Parallel computers achieve multi-gigaFLOP (floating-point operations per second) speeds by dividing a problem across a large number of microprocessors, often the same ones as today's workstations. The NASA High Performance Computing and Communications Program is aimed at furthering the use of these machines, and Muntz's team receives primary funding from the Earth and Space Sciences (ESS) Project managed by the Goddard Space Flight Center. ESS also funds two Grand Challenge teams in global climate modeling and one in data assimilation, in addition to four astrophysics teams.

"The data model of CONQUEST allows you to view a large dataset as a sequence of data 'chunks' so that you don't have to have all the data in the machine at one time," Muntz explains. "Even the biggest massively parallel machines can't handle all the data . . . , not even a significant fraction." First developed for SUN workstations, CONQUEST now runs on the IBM SP-1/SP-2 family and the Intel Paragon.

A parallel computing technique known as dataflow processing manages the data and allows computation in different processes to overlap. CONQUEST supports several types of parallelism, including:

All these methods result in considerable speed-up of a content-based query. Postprocessing of 10 years of cyclone tracking data, for instance, used to take 20 minutes on a Sun Sparc 10 workstation but now takes less than 3 minutes on four Sun Sparc workstations and 23 seconds on an eight-node IBM SP-2. For now, the user specifies the optimization, but Muntz says they are aiming for automatic optimization. "Optimizations would be based on the manner in which the datasets are organized and on the computational complexity of the individual operators," he says. One type of decision that would be made automatically is if CONQUEST should replicate an operator several times to perform the same function simultaneously on different data.

The EOS Data and Information System (EOSDIS) program is supporting extension of CONQUEST to a distributed object management environment, in which phenomena attributes and associated operators are combined into "objects" that contain all the information necessary to retrieve the desired data products. This capability will be crucial with data residing in different repositories, Shek stresses. Connection to a testbed Distributed Active Archive Center (DAAC) for advanced prototyping is planned. Stolorz points out that the team is also working on automating what it means for something to be a blocking feature or a cyclone. "To describe them, it will decide what the factors are," he says. The difficulty here is that there is often no consensus on the definitions. Thus, "an individual researcher needs a way to quickly and efficiently iterate on the definitions until the result is judged to be acceptable," Muntz says. "He or she then can check the results by visualizing subsets of the raw data and modifying the definitions as needed."

Two other improvements underway to make the system more user-friendly are a simpler method for adding new operators and an interface that will create its own query forms for additional phenomena.

The first release of CONQUEST to the research community is planned for early 1996. By then, its designers' aim is to have "a robust language, like an erector set," Muntz says. "We want to make extracting cyclones or any other phenomena over a certain time as easy as searching for employees making over $30,000 in a company."

[Table of Contents] [Previous] [Next]