Data structures and formats

Earth observation data are stored in a variety of structures and file formats, each shaped by how the data were collected, processed, distributed, and historically used, along with the science questions that guided their creation. At NSIDC, formats reflect satellite mission design, modeling needs, interoperability standards, and long-term archiving requirements.

This page describes the formats and structures used to store NSIDC data so that users can better understand how to work with this data.

It is helpful to distinguish between:

Data structure — how data observations are organized conceptually (e.g. as a grid, rows/columns, discrete features, etc.), and

File format — how those structured data are stored and delivered.

Understanding the basics (and differences between the two) can help you:

In practice, the relationship between a science question, available data, and the tools used for analysis is often not linear. Science questions often guide data collection and product design, but they are also shaped by what data already exist. As a result, data structures, file formats, and tools are interconnected, with multiple valid pathways depending on the data set and workflow. The figure below highlights these conceptual relationships between science questions, data structures, file formats, and analysis tools.

flowchart TD

A[Science question]

B[Data structures]
B1[Raster]
B2[Swath]
B3[Vector]
B4[Tabular]

C[File formats]
C1[NetCDF / HDF5]
C2[GeoTIFF]
C3[Shapefile]
C4[CSV]

D((Analysis tools))
D1((Python))
D2((R))
D3((GIS software))
D4((Web tools))

A -.-> B
B -.-> A

B --> B1
B --> B2
B --> B3
B --> B4

B1 --> C
B2 --> C
B3 --> C
B4 --> C

C --> C1
C --> C2
C --> C3
C --> C4

C -.-> D

D --> D1
D --> D2
D --> D3
D --> D4

How Earth science data are structured

Before working with specific file formats, it helps to understand the underlying data structures they represent. At NSIDC, most data sets fall into one or more of the structures below.

Raster / gridded data

TipAt a glance…

Common applications: Continuous spatial fields
Common formats: GeoTIFF, NetCDF, HDF5, COG, ASCII grid

For most remotely sensed imagery products, raster data represent space as a regular grid of cells. Each cell corresponds to a defined area on Earth and stores a value or category. For example, below is a grid where each grid cell (also called a pixel) contains a numeric value between 1 and 5.

Raster grid where each pixel stores a value A 3 by 3 raster grid with each cell containing a numeric value between 1 and 5. One cell is highlighted. 3 4 5 2 3 4 1 2 3 pixel (cell)

In this simplified example, each of these cells represents a geographic footprint, not a point. This means that anywhere within one of these boxes (whether it is in the middle, or near the outer edge) is represented by the same numeric value. It is worth noting that this interpretation can vary by product and methodology, where output from models is typically only valid for a point, and techniques for resampling remote sensing data can change the assumptions around grid cell interpretation.

Key concepts

  • Extent — the geographic area covered
  • Cell size — the size of each cell (e.g., 25 km × 25 km)
  • Coordinate Reference System (CRS) — how the grid maps to Earth
  • Grid definition — origin, orientation, and spacing

Many raster products are derived, not directly observed. Examples of this would be:

  • Individual satellite observations being aggregated to a grid
  • Statistics from multiple observations (a mean, max, or min, etc.) for each grid cell
  • Point data interpolated onto a grid

Raster data can be stored in a variety of file formats (such as GeoTIFF, NetCDF, HDF5, COG, ASCII grid as stated above), but they all share the same underlying structure of a regular grid with values assigned to each cell. The specific format may affect how you read and interpret the data, but the conceptual model of a raster grid remains consistent across formats.

NoteMore about GeoTIFF and COG data formats…

GeoTIFF extends standard image files by embedding geographic referencing information directly in the file. Cloud Optimized GeoTIFF (COG) is an extension of GeoTIFF designed for cloud storage and efficient access. COG files are structured to allow for efficient reading of data over the internet, making them ideal for web mapping and analysis.

NetCDF data format is commonly used to store raster data (although they can represent other structures as well).

Many NSIDC data sets are best understood as named, labeled arrays, not as tables or spreadsheets.

NetCDF allows data to be organized by named dimensions such as space, time, and height.

variable(time, y, x)

This means:

  • The data are stored as a multidimensional block
  • Each axis has a name, units, and meaning
  • Position along each axis matters

NetCDF is a scientific data container designed to store multidimensional arrays with metadata. At NSIDC it is often used to store gridded data sets, time series stacks, or model outputs, but it can represent many structures (not just raster or grid) depending on how it is written.

NoteMore about HDF5 and NetCDF data formats…

HDF5 was designed as a flexible container for storing large scientific data sets with complex internal structure. It is the “backbone”/ underlying structure of NetCDF4. It supports hierarchical grouping, mixed data types, and metadata, where NetCDF was created to store large, multidimensional scientific data sets while keeping data and metadata together. It allows variables to be tied to named dimensions (time, latitude, longitude, height), making files self-describing and machine-interpretable.

NetCDF = dimensions + variables + attributes

A NetCDF file typically contains three key components.

Dimensions define the shape of the data. Below is an example of a 3-dimensional (3-D) data set.
Dimensions:
  time = 365
  y    = 720
  x    = 720

Each dimension represents an independent axis:

  • time → when the observation occurred
  • y, x → the coordinates of each grid cell (usually the center of the grid cell).
  • sometimes z or level → vertical structure (this would be a 4th dimension)
Variables are arrays of values defined over one or more dimensions.
Variables:
  sea_ice_concentration(time, y, x)
  latitude(y, x)
  longitude(y, x)

Visually, you can think of a 3-dimensional variable (like sea_ice_concentration above) as a stack of 2-D maps through time that form a data-cube:

"Stacked" 2D grid slices to create a 3-dimensional NetCDF datacube Sea ice concentration at three time steps, showing a 3-dimensional NetCDF variable(time, y, x) is as stack of 2D arrays. 0.82 0.74 0.61 0.78 0.70 0.52 0.36 0.43 0.29 t = t₀ 0.91 0.85 0.72 0.88 0.79 0.63 0.44 0.51 0.38 t = t₁ 0.87 0.80 0.69 0.83 0.75 0.58 0.40 0.47 0.33 t = t₂ y x time

Each slice shares:

  • the same grid
  • the same projection
  • the same variable definition
Attributes store metadata that describe how to interpret the values and may be attached to:
  • the file (global attributes)
  • individual variables
  • spatial coordinate variables
Attributes:
units = "%"
valid_range = 0 to 100
missing_value = -9999
long_name = "Sea Ice Concentration"

Attributes explain:

  • what the numbers mean
  • how they were produced
  • how they should (and should not) be used

Having well-defined attributes is something that sets NetCDF apart from simpler formats. A LOT of scientific information can be embedded within these attributes, and a lot of it is essential for proper data interpretation.

Common NetCDF data shapes
3-D (for example, a variable over time and horizontal space)
variable(time, y, x)

Examples:

  • Sea ice concentration
  • Snow cover
  • Surface temperature
4-D (perhaps a variable over time + horizontal space + vertical space)
variable(time, level, y, x)

Examples:

  • Air temperature at atmospheric pressure levels (e.g., 1000 hPa → 100 hPa)
  • Ocean salinity at different depths
  • Soil temperature/moisture at different depths

Swath data

TipAt a glance…

Common applications: Preserving original satellite observation geometry
Common formats: HDF-EOS, NetCDF with geolocation arrays

Swath data preserve measurements close to their native satellite measurement geometry along an orbit.

Satellite flight direction →
────────────────────────────────────

   ◯     ◯      ◯      ◯      ◯
     ◯      ◯      ◯      ◯
   ◯     ◯      ◯      ◯      ◯
     ◯      ◯      ◯      ◯

Each ◯ = one observation footprint

The above is a highly idealized example as swath geometry varies by mission, but notice how:

  • Spacing can vary across the swath (although this is not the case for all sensors)
  • Footprints overlap differently near edges
  • No fixed row/column alignment

Swath data typically preserves:

  • Native resolution
  • Viewing angle
  • Exact observation timing

This makes them important for:

  • Calibration and validation
  • Algorithm development
  • Applications sensitive to observation geometry

Common challenges

  • Harder to visualize
  • Not directly comparable across time or space
  • Must be transformed to a regular geospatial grid before analysis
NoteMore about the HDF-EOS data format…

HDF-EOS adds Earth Observing System conventions on top of HDF to represent satellite data structures such as swaths, grids, and point measurements. It was created so NASA mission products could store observation geometry and geolocation alongside measurements in a consistent way.

Tabular data

TipAt a glance…

Common applications: Discrete observations, station data, or summaries
Common formats: CSV, TSV, Parquet, NetCDF tables

For many users, tabular or delimited data feels familiar because it resembles spreadsheets used in tools like Excel or Google Sheets.

| date       | latitude | longitude | variable |
|------------|----------|-----------|----------|
| 2022-01-01 |  70.2    | -150.1    | 0.82     |
| 2022-01-02 |  70.2    | -150.1    | 0.79     |
| 2022-01-03 |  70.2    | -150.1    | 0.85     |

Each row represents a record, and each column represents a field.

Tabular formats are widely used because the record–field structure organizes data into simple rows and columns. They are human-readable, easy to edit and inspect, and supported by nearly all software environments.

At NSIDC, tabular data are commonly used for:

  • Time series for a location or region
  • Summary statistics derived from spatial products
  • Point-based observations or extractions

Several file formats exist specifically to store tabular information, ranging from simple text tables to structured exchange formats.

What tabular data do not standardize

Unlike gridded or vector formats, tabular files typically do not embed:

  • Coordinate reference system (CRS)
  • Spatial resolution or footprint
  • Relationships between neighboring observations
  • Processing history

Instead, this information is often:

  • Documented externally
  • Assumed based on context
  • Non-existent

For example, a table like this:

| latitude | longitude | value |
|----------|-----------|-------|
| 70.2     | -150.1    | 0.82  |

Does not tell you:

  • Whether the value represents a point measurement
  • Whether it is an average over an area
  • How large that area might be

That meaning must come from metadata or documentation.

A spreadsheet may show what the values are, but not where or how they were produced.

In general, tabular structures are well-suited for:

  • Single-location time series
  • Small collections of points
  • Summaries and diagnostics

But are less suitable for:

  • Continuous spatial fields
  • Large-scale gridded products
  • Analyses requiring detailed spatial context
NoteMore about delimited text formats…

Delimited text formats like CSV were created to provide a simple, universal way to store tabular data that can be opened in nearly any software environment. Because they are human-readable and easy to generate, they remain widely used for time series, summaries, and data exchange.

Vector spatial data

TipAt a glance…

Common applications: Boundaries, tracks, and discrete geographic features
Common formats: Shapefile, GeoJSON, GeoPackage, KML

Vector data typically represent discrete geographic features rather than continuous spatial fields.

Point Line Polygon

Examples:

  • Glacier outlines
  • Ice sheet boundaries
  • Grounding lines

Geometry types

Points

Points represent a single location with no area or length.

Common uses:

  • Observation stations
  • Sample locations
  • Reference points
Lines

Lines represent linear features with length but no area.

Common uses:

  • Grounding lines
  • Ice front positions
  • Transects
Polygons

Polygons represent areas with defined boundaries.

Common uses:

  • Glacier outlines
  • Ice sheet masks
  • Drainage basins

Attributes

Each feature includes attributed data, where attributes describe properties of unique features:

| Feature ID | Name        | Area_km2 | Year |
|------------|-------------|----------|------|
| 001        | Glacier A   | 12.4     | 2020 |
| 002        | Glacier B   |  8.9     | 2020 |

Vector formats exist because many spatial features are:

  • Discrete rather than continuous
  • Better described by boundaries than averages
  • Interpreted as objects, not fields
NoteMore about Shapefile data format…

The shapefile format, actually composed of multiple files, was introduced as an early GIS standard for storing vector features with attributes. Although newer formats now exist, shapefiles remain common because of their compatibility across tools and legacy data sets.

How vector data differ from rasters

Raster:  value everywhere
Vector:  value only where features exist

Key differences:

  • Geometry is explicit, not implied by grid cells
  • Precision depends on digitization, not grid spacing
  • Attributes are attached to features, not cells

Vector data represent objects and boundaries, not continuous fields.

Common file formats at NSIDC

The table below lists common file formats encountered in NSIDC data products and some practical considerations when working with them.

File format Common uses Quirks NSIDC example products Common tools (programmatic + GUI)
NetCDF4 / NetCDF classic Climate records, reanalysis products, gridded time series, model output. Time encoding confusion; multidimensional indexing; large file sizes; steep learning curve without high-level tools. Sea Ice Concentration CDR
Antarctic Ice Velocity
Python: xarray, netCDF4 (low-level)
R: terra, stars, ncdf4
GUI: Panoply, QGIS, ArcGIS Pro, HDFView
HDF5 Mission-level products, intermediate processing outputs, complex hierarchical data sets. Deep hierarchies; unclear variable naming; inconsistent conventions between products. SMAP L3 Soil Moisture
ICESat-2 ATL03
Python: rioxarray, h5py
R: rhdf5
GUI: Panoply, HDFView
HDF-EOS2 / HDF-EOS5 Satellite swath and gridded mission products (e.g., MODIS). Specialized structure; harder to read without EOS-aware tools; metadata spread across groups. MODIS Snow Cover
AMSR-E/AMSR2 SWE
Python: xarray, h5netcdf, pyhdf
R: terra
GUI: Panoply, HDFView
GeoTIFF / TIFF / COG Gridded maps, spatial subsets, visualization-ready products. Limited metadata compared to NetCDF; alignment issues across data sets; large files at high resolution. Ice Sheet Velocity
Land Freeze/Thaw Status
Python: rasterio, rioxarray, xarray
R: terra, raster
GUI: QGIS, ArcGIS Pro
Shapefile Ice sheet boundaries, grounding lines, glacier outlines. Multi-file dependency; field name limits; aging format. Glacier Database
Glacier Terminus Positions
Python: geopandas, shapely
R: sf
GUI: QGIS, ArcGIS Pro
CSV / TSV / Other Delimited Text Time series, summaries, point extractions. Often lacks spatial context or a coordinate reference system; metadata easily lost; scaling issues. Ice Thickness Python: pandas
R: readr, data.table, tibble
GUI: Excel, LibreOffice
ASCII grid / text Historical products, simple model outputs. No embedded CRS; manual parsing required. Sea Ice Trends Python: numpy
R: base R, terra
GUI: Text editors, QGIS (limited)
PNG / GIF / JPEG Quicklook imagery, browse products, outreach. Not analysis-ready; no quantitative metadata. Arctic Change Python: numpy, OpenCV, Pillow
GUI: Web browser, image viewers, NASA Worldview
JSON Metadata responses, data access services. Not suitable for large arrays; not analysis-ready. SMAP Ancillary Files Python: json
R: jsonlite
GUI: Web browser

Wrapping up

A single NSIDC data set may involve multiple structures (and formats!) through stages of its development and use. For example, a satellite mission may produce swath data in HDF-EOS, which are then resampled to a regular grid and stored in NetCDF for distribution, and users may extract tabular time series for analysis…

Swath observations
        ↓
Modeled / aggregated grid
        ↓
NetCDF storage
        ↓
Tabular extraction for analysis

Understanding data structure (and how that relates to the data format) can help you interpret values correctly, use appropriate tools, and avoid common issues or wrong assumptions.