The Earth Observer, November/December, 1995


Subsetting Special Interest Group (SSIG) Workshop

Bill Emery (emery@orbit.colorado.edu), University of Colorado
Bruce Barkstrom (brb@ceres.larc.nasa.gov), NASA Langley Research Center

The importance of subsetting large data sets has been widely discussed by people concerned with the implementation of EOSDIS. During the EOSDIS Release A Critical Design Review (CDR) in August 1995, Bruce Barkstrom, Bill Emery, and Marti Szczur (GSFC) discussed the idea of having a workshop on subsetting where people interested in subsetting would get together, discuss their present experience with subsetting, and discuss the future needs for subsetting within EOSDIS. After many delays due to scheduling conflicts, the Subsetting Special Interest Group (SSIG) held a workshop at NASA's Langley Research Center on November 8 - 9, 1995. About 40 people from the academic community, NASA centers, and the ESDIS Project participated. Papers presented each day dealt with user experiences or expected EOS capabilities. Also discussions were held leading to an improved understanding of subsetting requirements for EOSDIS. The SSIG Web page (http://ecsinfo.hitc.com/ssig/ssig.html) provides abstracts for the papers presented at the meeting, as well as a current listing of the planned EOSDIS Core System (ECS) data-type services, and some references regarding HDF-EOS. We intend to provide electronic publication of the papers presented, and refine the requirements that appeared at the meeting.

As stated at the outset of the meeting, the purposes of the workshop were:

  1. To obtain good summaries of subsetting experiences by users as well as "theoretical views" of subsetting.

  2. Obtain a summary of the current capabilities/plans of the EOSDIS.

  3. Set up a priority list of data sets for which subsetting seems to be important, and define methods of implementing subsetting and select tests of subsetting that ESDIS and the user community can use to evaluate the EOSDIS implementation of subsetting.

The first day started with presentations on: Interactive User Subsetting in the Colorado EOSDIS Testbed (W. Emery/D. Baldwin [http://jester.colorado.edu/EOSDIS.html]); Coincidence and Subsetting with OTCS/LIS at the MSFC DAAC (P. Meyer/S. Graves [http://wwwghcc.msfc.nasa. gov/ghcc_home.html]); EDC DAAC Experience with 1 km AVHRR (J. Eidenschink [http://edcwww.cr.usgs.gov/landdaac/1KM/1kmhomepage.html] ); Correlative Browse Studies (M. Kafatos/R. Yang); AVHRR Subsetting (A. C. Sundar/R. Welch); AVHRR-SST Subsetting at JPL (A. Tran; [http://podaac-www.jpl.nasa.gov/sst/subset.html]), MODIS Subsetting (L. Fishtahler); and finally a series of presentations from the ECS project: ECS Overview (A. Endal), Data Server Subsetting/Sampling/Averaging (C. Horgan), and EOSHDF (D. Wynne, L. Klein).

The afternoon was devoted to a general group discussion of working group definitions. Some substantial changes were made, and the participants were separated into four working groups as listed below.

The second day began with three papers: ERBE Subsetting and Content-Based Searching (B. Barkstrom), Subsetting of Assimilated Level 4 Products (J. Stobie), and the LaRC DAAC Experience (T. Feltman). The rest of the day was devoted to the working group meetings and a summary of their results.

WORKSHOP WORKING GROUPS

While the papers will provide the formal record of the experiences and expectations that went into the meeting, some of the most important work was conducted in four working groups, where participants could more fully exchange views than was possible in the plenary sessions. The groups formed towards the end of the first day of the workshop were:

  1. WG on User Needs and Requirements -- Jim Stobie. This group was to consider particularly the following issues: what subsetting services users need and want; how cost influences the selection of subsetting services; what kind of subsetting tools users need; where does subsetting stop and analysis begin?

  2. WG on Subsetting for Storage and Network Transmission -- Ben Kobler. This group dealt with these issues: should the results of a subsetting service be archived or discarded; compression, filing, and interleaving of results in the subset product?

  3. WG on Production Subsetting -- Bruce Barkstrom. This group considered production issues, such as what subsetting services are particular to data production and who should be responsible for them; what metadata is required for subsetting, and what metadata should subsetting produce?

  4. WG on Subsetting Methods -- Ted Meyer. This group dealt with such issues as: what processing steps should be included for such subsetting processes as sampling, averaging, and other volume reduction methods; what user support and documentation are needed; what mechanisms and processes are needed to review and validate data types for particular subsetting methods?

WORKING GROUP RECOMMENDATIONS

  1. WG on User Needs and Requirements

    Users should be able to order data subsets based on geographic limits (x, y, z), temporal limits, and by variable. This group felt that subsetting that alters data values should not be included in the requirements, which would limit possibilities for averaging, smoothing, or interpolating, but which would include subsampling. The cost of the subsetting service was viewed as a useful way to allow for a rational selection of the service. Also simplicity of subsetting service should provide a way to prioritize it -- simpler has higher priority.

    This group felt that EOS View may provide a good basis for subsetting, but that more-exotic tools should be left to the scientific community. Once data had been subsetted from an original granule, this group did feel that it was useful to allow further subsetting by value, which could be done through a tool at the scientist's workstation.

    On subsetting swath data, the group felt that it was important that the subsetting function should not change the data values. However, it did appear that the subsetting functionality would be the same as that for gridded data. We note that during discussions in the "plenary" session, the consensus that emerged was that swath data by scan-line, which was suggested by Hughes, was an appropriate level of service for users.

  2. WG on Subsetting for Storage and Network Transmission

    This WG developed the following list of functions that the storage system could do in order to facilitate subsetting. These items are listed in more-or-less increasing level of difficulty:

    a.) Monitor file access patterns to identify when reorganization of tape files appears justified.

    b.) Identify subsets that have high usage, both to improve service to users and to provide justification for reorganizing the data.

    c.) Provide flexibility in system design to allow the usage and access patterns to determine the tradeoff between storage and transmission.

    d.) Allow subsetted data to be stored and reused by a wider community.

    e.) Allow content-based subsets to be developed and stored.

    f.) Allow different portions of a file to be stored on different media.

  3. WG on Production Subsetting

    The WG on Production Subsetting felt it useful to try to provide a simple model of the connection between data production and the data server part of EOSDIS:

    In this model, both secondary and tertiary storage are much slower than the RAID disks that make up the Production and Data Server Working Storage. This WG felt it was important to note that there are external subsets that go out through the "Exgest" service, but that there is also the potential for subsetting that influences the way in which the system handles the partitioning between Working Storage and the tape storage. As a matter of philosophy, this WG felt that all production subsetting for standard data products involve JOBS. If jobs are done in the data server, we will call them "Queries;" if they are done in production, we call them Product Generation Executives (PGEs). The PGE characteristics include a scheduled, standing order approach to scheduling, and subsetting done on the upstream side of a flow to reduce volume. The Query characteristics include: dynamic and unscheduled response to anomalies (QA) or validation. It appeared most reasonable to expect the Investigation Teams responsible for receiving the data subset to define the services needed.

    The Production WG did feel that subsetting for data production could be spatial, temporal, or parameter based. They could also see that there were two sources of difficulty that need to be accommodated in production subsetting: multigranule subsetting where there is a spatial discontinuity between two subsets, and one involving different parameters in different granules.

  4. WG on Subsetting Methods

    The WG on Subsetting Methods developed a list of functions that might be part of a subsetting service and prioritized them. The functions fall into 3 general classes: subsetting, e.g., subsetting, sampling; reduction, e.g., averaging, compositing; and transformation, e.g., reprojection, re-interleaving. The prioritized list is:

FunctionPriority
By Geography1
By Time Interval1
By Parameter/Variable1
With a Mask1-
Land/Sea/Other1
User Defined?
Compress/Decompress1
Select by Content1
Subset by Selection1
Create Mask1
Calculate (+, -, X, /, other simple functions)2
Statistics2
Ceiling2
Floor2
Average1
Spatial Subsample3
Spatial Neighbor3
3D3
Temporal2-
Region/Mask/Selection2
Differentials3
Interpolate3
Reprojection4
Compositing4
Masking1
Geotransform2
Data Type Transform4
Re-Interleaving3
Subsampling1
Sequence Process for Optimal Processing5
Journaling1
Undo, Redo based on history3

This WG also felt that much further work needed to be done to explore making the definitions of these phrases more precise, and to evaluate the proposed functions against data types. Thus, we need to develop a description for each function, and to note the dependencies of one function on others.

[Table of Contents] [Previous] [Next]