flowchart TD A[Science question] B[Data structures] B1[Raster] B2[Swath] B3[Vector] B4[Tabular] C[File formats] C1[NetCDF / HDF5] C2[GeoTIFF] C3[Shapefile] C4[CSV] D((Analysis tools)) D1((Python)) D2((R)) D3((GIS software)) D4((Web tools)) A -.-> B B -.-> A B --> B1 B --> B2 B --> B3 B --> B4 B1 --> C B2 --> C B3 --> C B4 --> C C --> C1 C --> C2 C --> C3 C --> C4 C -.-> D D --> D1 D --> D2 D --> D3 D --> D4
Data structures and formats
Earth observation data are stored in a variety of structures and file formats, each shaped by how the data were collected, processed, distributed, and historically used, along with the science questions that guided their creation. At NSIDC, formats reflect satellite mission design, modeling needs, interoperability standards, and long-term archiving requirements.
This page describes the formats and structures used to store NSIDC data so that users can better understand how to work with this data.
It is helpful to distinguish between:
Data structure — how data observations are organized conceptually (e.g. as a grid, rows/columns, discrete features, etc.), and
File format — how those structured data are stored and delivered.
Understanding the basics (and differences between the two) can help you:
- Interpret results correctly
- Recognize limitations
- Avoid common analysis errors, and
- Do science more efficiently!
In practice, the relationship between a science question, available data, and the tools used for analysis is often not linear. Science questions often guide data collection and product design, but they are also shaped by what data already exist. As a result, data structures, file formats, and tools are interconnected, with multiple valid pathways depending on the data set and workflow. The figure below highlights these conceptual relationships between science questions, data structures, file formats, and analysis tools.
How Earth science data are structured
Before working with specific file formats, it helps to understand the underlying data structures they represent. At NSIDC, most data sets fall into one or more of the structures below.
Raster / gridded data
Common applications: Continuous spatial fields
Common formats: GeoTIFF, NetCDF, HDF5, COG, ASCII grid
For most remotely sensed imagery products, raster data represent space as a regular grid of cells. Each cell corresponds to a defined area on Earth and stores a value or category. For example, below is a grid where each grid cell (also called a pixel) contains a numeric value between 1 and 5.
In this simplified example, each of these cells represents a geographic footprint, not a point. This means that anywhere within one of these boxes (whether it is in the middle, or near the outer edge) is represented by the same numeric value. It is worth noting that this interpretation can vary by product and methodology, where output from models is typically only valid for a point, and techniques for resampling remote sensing data can change the assumptions around grid cell interpretation.
Key concepts
- Extent — the geographic area covered
- Cell size — the size of each cell (e.g., 25 km × 25 km)
- Coordinate Reference System (CRS) — how the grid maps to Earth
- Grid definition — origin, orientation, and spacing
Many raster products are derived, not directly observed. Examples of this would be:
- Individual satellite observations being aggregated to a grid
- Statistics from multiple observations (a mean, max, or min, etc.) for each grid cell
- Point data interpolated onto a grid
Raster data can be stored in a variety of file formats (such as GeoTIFF, NetCDF, HDF5, COG, ASCII grid as stated above), but they all share the same underlying structure of a regular grid with values assigned to each cell. The specific format may affect how you read and interpret the data, but the conceptual model of a raster grid remains consistent across formats.
GeoTIFF extends standard image files by embedding geographic referencing information directly in the file. Cloud Optimized GeoTIFF (COG) is an extension of GeoTIFF designed for cloud storage and efficient access. COG files are structured to allow for efficient reading of data over the internet, making them ideal for web mapping and analysis.
NetCDF data format is commonly used to store raster data (although they can represent other structures as well).
Many NSIDC data sets are best understood as named, labeled arrays, not as tables or spreadsheets.
NetCDF allows data to be organized by named dimensions such as space, time, and height.
variable(time, y, x)
This means:
- The data are stored as a multidimensional block
- Each axis has a name, units, and meaning
- Position along each axis matters
NetCDF is a scientific data container designed to store multidimensional arrays with metadata. At NSIDC it is often used to store gridded data sets, time series stacks, or model outputs, but it can represent many structures (not just raster or grid) depending on how it is written.
HDF5 was designed as a flexible container for storing large scientific data sets with complex internal structure. It is the “backbone”/ underlying structure of NetCDF4. It supports hierarchical grouping, mixed data types, and metadata, where NetCDF was created to store large, multidimensional scientific data sets while keeping data and metadata together. It allows variables to be tied to named dimensions (time, latitude, longitude, height), making files self-describing and machine-interpretable.
NetCDF = dimensions + variables + attributes
A NetCDF file typically contains three key components.
Dimensions define the shape of the data. Below is an example of a 3-dimensional (3-D) data set.
Dimensions:
time = 365
y = 720
x = 720
Each dimension represents an independent axis:
time→ when the observation occurredy,x→ the coordinates of each grid cell (usually the center of the grid cell).- sometimes
zorlevel→ vertical structure (this would be a 4th dimension)
Variables are arrays of values defined over one or more dimensions.
Variables:
sea_ice_concentration(time, y, x)
latitude(y, x)
longitude(y, x)
Visually, you can think of a 3-dimensional variable (like sea_ice_concentration above) as a stack of 2-D maps through time that form a data-cube:
Each slice shares:
- the same grid
- the same projection
- the same variable definition
Attributes store metadata that describe how to interpret the values and may be attached to:
- the file (global attributes)
- individual variables
- spatial coordinate variables
Attributes:
units = "%"
valid_range = 0 to 100
missing_value = -9999
long_name = "Sea Ice Concentration"
Attributes explain:
- what the numbers mean
- how they were produced
- how they should (and should not) be used
Having well-defined attributes is something that sets NetCDF apart from simpler formats. A LOT of scientific information can be embedded within these attributes, and a lot of it is essential for proper data interpretation.
Common NetCDF data shapes
3-D (for example, a variable over time and horizontal space)
variable(time, y, x)
Examples:
- Sea ice concentration
- Snow cover
- Surface temperature
4-D (perhaps a variable over time + horizontal space + vertical space)
variable(time, level, y, x)
Examples:
- Air temperature at atmospheric pressure levels (e.g., 1000 hPa → 100 hPa)
- Ocean salinity at different depths
- Soil temperature/moisture at different depths
Swath data
Common applications: Preserving original satellite observation geometry
Common formats: HDF-EOS, NetCDF with geolocation arrays
Swath data preserve measurements close to their native satellite measurement geometry along an orbit.
Satellite flight direction →
────────────────────────────────────
◯ ◯ ◯ ◯ ◯
◯ ◯ ◯ ◯
◯ ◯ ◯ ◯ ◯
◯ ◯ ◯ ◯
Each ◯ = one observation footprint
The above is a highly idealized example as swath geometry varies by mission, but notice how:
- Spacing can vary across the swath (although this is not the case for all sensors)
- Footprints overlap differently near edges
- No fixed row/column alignment
Swath data typically preserves:
- Native resolution
- Viewing angle
- Exact observation timing
This makes them important for:
- Calibration and validation
- Algorithm development
- Applications sensitive to observation geometry
Common challenges
- Harder to visualize
- Not directly comparable across time or space
- Must be transformed to a regular geospatial grid before analysis
HDF-EOS adds Earth Observing System conventions on top of HDF to represent satellite data structures such as swaths, grids, and point measurements. It was created so NASA mission products could store observation geometry and geolocation alongside measurements in a consistent way.
Tabular data
Common applications: Discrete observations, station data, or summaries
Common formats: CSV, TSV, Parquet, NetCDF tables
For many users, tabular or delimited data feels familiar because it resembles spreadsheets used in tools like Excel or Google Sheets.
| date | latitude | longitude | variable |
|------------|----------|-----------|----------|
| 2022-01-01 | 70.2 | -150.1 | 0.82 |
| 2022-01-02 | 70.2 | -150.1 | 0.79 |
| 2022-01-03 | 70.2 | -150.1 | 0.85 |
Each row represents a record, and each column represents a field.
Tabular formats are widely used because the record–field structure organizes data into simple rows and columns. They are human-readable, easy to edit and inspect, and supported by nearly all software environments.
At NSIDC, tabular data are commonly used for:
- Time series for a location or region
- Summary statistics derived from spatial products
- Point-based observations or extractions
Several file formats exist specifically to store tabular information, ranging from simple text tables to structured exchange formats.
What tabular data do not standardize
Unlike gridded or vector formats, tabular files typically do not embed:
- Coordinate reference system (CRS)
- Spatial resolution or footprint
- Relationships between neighboring observations
- Processing history
Instead, this information is often:
- Documented externally
- Assumed based on context
- Non-existent
For example, a table like this:
| latitude | longitude | value |
|----------|-----------|-------|
| 70.2 | -150.1 | 0.82 |
Does not tell you:
- Whether the value represents a point measurement
- Whether it is an average over an area
- How large that area might be
That meaning must come from metadata or documentation.
A spreadsheet may show what the values are, but not where or how they were produced.
In general, tabular structures are well-suited for:
- Single-location time series
- Small collections of points
- Summaries and diagnostics
But are less suitable for:
- Continuous spatial fields
- Large-scale gridded products
- Analyses requiring detailed spatial context
Delimited text formats like CSV were created to provide a simple, universal way to store tabular data that can be opened in nearly any software environment. Because they are human-readable and easy to generate, they remain widely used for time series, summaries, and data exchange.
Vector spatial data
Common applications: Boundaries, tracks, and discrete geographic features
Common formats: Shapefile, GeoJSON, GeoPackage, KML
Vector data typically represent discrete geographic features rather than continuous spatial fields.
Examples:
- Glacier outlines
- Ice sheet boundaries
- Grounding lines
Geometry types
Points
Points represent a single location with no area or length.
Common uses:
- Observation stations
- Sample locations
- Reference points
Lines
Lines represent linear features with length but no area.
Common uses:
- Grounding lines
- Ice front positions
- Transects
Polygons
Polygons represent areas with defined boundaries.
Common uses:
- Glacier outlines
- Ice sheet masks
- Drainage basins
Attributes
Each feature includes attributed data, where attributes describe properties of unique features:
| Feature ID | Name | Area_km2 | Year |
|------------|-------------|----------|------|
| 001 | Glacier A | 12.4 | 2020 |
| 002 | Glacier B | 8.9 | 2020 |
Vector formats exist because many spatial features are:
- Discrete rather than continuous
- Better described by boundaries than averages
- Interpreted as objects, not fields
The shapefile format, actually composed of multiple files, was introduced as an early GIS standard for storing vector features with attributes. Although newer formats now exist, shapefiles remain common because of their compatibility across tools and legacy data sets.
How vector data differ from rasters
Raster: value everywhere
Vector: value only where features exist
Key differences:
- Geometry is explicit, not implied by grid cells
- Precision depends on digitization, not grid spacing
- Attributes are attached to features, not cells
Vector data represent objects and boundaries, not continuous fields.
Common file formats at NSIDC
The table below lists common file formats encountered in NSIDC data products and some practical considerations when working with them.
| File format | Common uses | Quirks | NSIDC example products | Common tools (programmatic + GUI) |
|---|---|---|---|---|
| NetCDF4 / NetCDF classic | Climate records, reanalysis products, gridded time series, model output. | Time encoding confusion; multidimensional indexing; large file sizes; steep learning curve without high-level tools. | Sea Ice Concentration CDR Antarctic Ice Velocity |
Python: xarray, netCDF4 (low-level)R: terra, stars, ncdf4GUI: Panoply, QGIS, ArcGIS Pro, HDFView |
| HDF5 | Mission-level products, intermediate processing outputs, complex hierarchical data sets. | Deep hierarchies; unclear variable naming; inconsistent conventions between products. | SMAP L3 Soil Moisture ICESat-2 ATL03 |
Python: rioxarray, h5pyR: rhdf5GUI: Panoply, HDFView |
| HDF-EOS2 / HDF-EOS5 | Satellite swath and gridded mission products (e.g., MODIS). | Specialized structure; harder to read without EOS-aware tools; metadata spread across groups. | MODIS Snow Cover AMSR-E/AMSR2 SWE |
Python: xarray, h5netcdf, pyhdfR: terraGUI: Panoply, HDFView |
| GeoTIFF / TIFF / COG | Gridded maps, spatial subsets, visualization-ready products. | Limited metadata compared to NetCDF; alignment issues across data sets; large files at high resolution. | Ice Sheet Velocity Land Freeze/Thaw Status |
Python: rasterio, rioxarray, xarrayR: terra, rasterGUI: QGIS, ArcGIS Pro |
| Shapefile | Ice sheet boundaries, grounding lines, glacier outlines. | Multi-file dependency; field name limits; aging format. | Glacier Database Glacier Terminus Positions |
Python: geopandas, shapelyR: sfGUI: QGIS, ArcGIS Pro |
| CSV / TSV / Other Delimited Text | Time series, summaries, point extractions. | Often lacks spatial context or a coordinate reference system; metadata easily lost; scaling issues. | Ice Thickness | Python: pandasR: readr, data.table, tibbleGUI: Excel, LibreOffice |
| ASCII grid / text | Historical products, simple model outputs. | No embedded CRS; manual parsing required. | Sea Ice Trends | Python: numpyR: base R, terraGUI: Text editors, QGIS (limited) |
| PNG / GIF / JPEG | Quicklook imagery, browse products, outreach. | Not analysis-ready; no quantitative metadata. | Arctic Change | Python: numpy, OpenCV, PillowGUI: Web browser, image viewers, NASA Worldview |
| JSON | Metadata responses, data access services. | Not suitable for large arrays; not analysis-ready. | SMAP Ancillary Files | Python: jsonR: jsonliteGUI: Web browser |
Wrapping up
A single NSIDC data set may involve multiple structures (and formats!) through stages of its development and use. For example, a satellite mission may produce swath data in HDF-EOS, which are then resampled to a regular grid and stored in NetCDF for distribution, and users may extract tabular time series for analysis…
Swath observations
↓
Modeled / aggregated grid
↓
NetCDF storage
↓
Tabular extraction for analysis
Understanding data structure (and how that relates to the data format) can help you interpret values correctly, use appropriate tools, and avoid common issues or wrong assumptions.