The importance of subsetting large data sets has been widely discussed by people concerned with the implementation of EOSDIS. During the EOSDIS Release A Critical Design Review (CDR) in August 1995, Bruce Barkstrom, Bill Emery, and Marti Szczur (GSFC) discussed the idea of having a workshop on subsetting where people interested in subsetting would get together, discuss their present experience with subsetting, and discuss the future needs for subsetting within EOSDIS. After many delays due to scheduling conflicts, the Subsetting Special Interest Group (SSIG) held a workshop at NASA's Langley Research Center on November 8 - 9, 1995. About 40 people from the academic community, NASA centers, and the ESDIS Project participated. Papers presented each day dealt with user experiences or expected EOS capabilities. Also discussions were held leading to an improved understanding of subsetting requirements for EOSDIS. The SSIG Web page (http://ecsinfo.hitc.com/ssig/ssig.html) provides abstracts for the papers presented at the meeting, as well as a current listing of the planned EOSDIS Core System (ECS) data-type services, and some references regarding HDF-EOS. We intend to provide electronic publication of the papers presented, and refine the requirements that appeared at the meeting.
As stated at the outset of the meeting, the purposes of the workshop were:
The first day started with presentations on: Interactive User Subsetting in the Colorado EOSDIS Testbed (W. Emery/D. Baldwin [http://jester.colorado.edu/EOSDIS.html]); Coincidence and Subsetting with OTCS/LIS at the MSFC DAAC (P. Meyer/S. Graves [http://wwwghcc.msfc.nasa. gov/ghcc_home.html]); EDC DAAC Experience with 1 km AVHRR (J. Eidenschink [http://edcwww.cr.usgs.gov/landdaac/1KM/1kmhomepage.html] ); Correlative Browse Studies (M. Kafatos/R. Yang); AVHRR Subsetting (A. C. Sundar/R. Welch); AVHRR-SST Subsetting at JPL (A. Tran; [http://podaac-www.jpl.nasa.gov/sst/subset.html]), MODIS Subsetting (L. Fishtahler); and finally a series of presentations from the ECS project: ECS Overview (A. Endal), Data Server Subsetting/Sampling/Averaging (C. Horgan), and EOSHDF (D. Wynne, L. Klein).
The afternoon was devoted to a general group discussion of working group definitions. Some substantial changes were made, and the participants were separated into four working groups as listed below.
The second day began with three papers: ERBE Subsetting and Content-Based Searching (B. Barkstrom), Subsetting of Assimilated Level 4 Products (J. Stobie), and the LaRC DAAC Experience (T. Feltman). The rest of the day was devoted to the working group meetings and a summary of their results.
WORKSHOP WORKING GROUPS
While the papers will provide the formal record of the experiences and expectations that went into the meeting, some of the most important work was conducted in four working groups, where participants could more fully exchange views than was possible in the plenary sessions. The groups formed towards the end of the first day of the workshop were:
WORKING GROUP RECOMMENDATIONS
Users should be able to order data subsets based on geographic limits (x, y, z), temporal limits, and by variable. This group felt that subsetting that alters data values should not be included in the requirements, which would limit possibilities for averaging, smoothing, or interpolating, but which would include subsampling. The cost of the subsetting service was viewed as a useful way to allow for a rational selection of the service. Also simplicity of subsetting service should provide a way to prioritize it -- simpler has higher priority.
This group felt that EOS View may provide a good basis for subsetting, but that more-exotic tools should be left to the scientific community. Once data had been subsetted from an original granule, this group did feel that it was useful to allow further subsetting by value, which could be done through a tool at the scientist's workstation.
On subsetting swath data, the group felt that it was important that the subsetting function should not change the data values. However, it did appear that the subsetting functionality would be the same as that for gridded data. We note that during discussions in the "plenary" session, the consensus that emerged was that swath data by scan-line, which was suggested by Hughes, was an appropriate level of service for users.
This WG developed the following list of functions that the storage system could do in order to facilitate subsetting. These items are listed in more-or-less increasing level of difficulty:
The WG on Production Subsetting felt it useful to try to provide a simple model of the connection between data production and the data server part of EOSDIS:
In this model, both secondary and tertiary storage are much slower than the RAID disks that make up the Production and Data Server Working Storage. This WG felt it was important to note that there are external subsets that go out through the "Exgest" service, but that there is also the potential for subsetting that influences the way in which the system handles the partitioning between Working Storage and the tape storage. As a matter of philosophy, this WG felt that all production subsetting for standard data products involve JOBS. If jobs are done in the data server, we will call them "Queries;" if they are done in production, we call them Product Generation Executives (PGEs). The PGE characteristics include a scheduled, standing order approach to scheduling, and subsetting done on the upstream side of a flow to reduce volume. The Query characteristics include: dynamic and unscheduled response to anomalies (QA) or validation. It appeared most reasonable to expect the Investigation Teams responsible for receiving the data subset to define the services needed.
The Production WG did feel that subsetting for data production could be spatial, temporal, or parameter based. They could also see that there were two sources of difficulty that need to be accommodated in production subsetting: multigranule subsetting where there is a spatial discontinuity between two subsets, and one involving different parameters in different granules.
The WG on Subsetting Methods developed a list of functions that might be part of a subsetting service and prioritized them. The functions fall into 3 general classes: subsetting, e.g., subsetting, sampling; reduction, e.g., averaging, compositing; and transformation, e.g., reprojection, re-interleaving. The prioritized list is:
| Function | Priority |
|---|---|
| By Geography | 1 |
| By Time Interval | 1 |
| By Parameter/Variable | 1 |
| With a Mask | 1- |
| Land/Sea/Other | 1 |
| User Defined | ? |
| Compress/Decompress | 1 |
| Select by Content | 1 |
| Subset by Selection | 1 |
| Create Mask | 1 |
| Calculate (+, -, X, /, other simple functions) | 2 |
| Statistics | 2 |
| Ceiling | 2 |
| Floor | 2 |
| Average | 1 |
| Spatial Subsample | 3 |
| Spatial Neighbor | 3 |
| 3D | 3 |
| Temporal | 2- |
| Region/Mask/Selection | 2 |
| Differentials | 3 |
| Interpolate | 3 |
| Reprojection | 4 |
| Compositing | 4 |
| Masking | 1 |
| Geotransform | 2 |
| Data Type Transform | 4 |
| Re-Interleaving | 3 |
| Subsampling | 1 |
| Sequence Process for Optimal Processing | 5 |
| Journaling | 1 |
| Undo, Redo based on history | 3 |
This WG also felt that much further work needed to be done to explore making the definitions of these phrases more precise, and to evaluate the proposed functions against data types. Thus, we need to develop a description for each function, and to note the dependencies of one function on others.