Data Strategies for Future Us

Andy Barrett, National Snow and Ice Data Center

2024-04-19

What are Data Strategies?

Data Strategies enhance collaboration and reproducible science.

  • Workflows;
  • Data management best practices;
  • Documentation;

Good to start from the beginning of a project, great to start from where you are now.

Who is future us?

You.

Your team.

The scientific community.

A simple data workflow

http://r4ds.hadley.nz/

When to cloud?

  • What is the data volume?
  • How long will it take to download?
  • Can you store all that data (cost and space)?
  • Do you have the computing power for processing?
  • Does your team need a common computing environment?
  • Do you need to share data at each step or just an end product?

Workflow Solutions

Local

G cluster_incloud In Cloud cluster_0 Local Machine communicate Communicate cloud Earthdata Cloud Tidy Tidy cloud->Tidy daac DAAC daac->Tidy Transform Transform Tidy->Transform Visualize Visualize Transform->Visualize Model Model Visualize->Model Model->communicate Model->Transform

Workflow Solutions

Hybrid

G cluster_incloud In Cloud cluster_local Local Machine communicate Communicate cloud Earthdata Cloud tidy_cloud Tidy cloud->tidy_cloud daac DAAC tidy_local Tidy daac->tidy_local Transform Transform tidy_cloud->Transform tidy_local->Transform Visualize Visualize Transform->Visualize Model Model Visualize->Model Model->communicate Model->Transform

Workflow Solutions

All in cloud

G cluster_incloud In Cloud communicate Communicate cloud Earthdata Cloud tidy_cloud Tidy cloud->tidy_cloud daac DAAC daac->tidy_cloud Transform Transform tidy_cloud->Transform Visualize Visualize Transform->Visualize Model Model Visualize->Model Model->communicate Model->Transform

Workflow Solutions

Use Cloud-based Services

  • Cloud services are infrastructure, platforms and software hosted in the cloud and made available to users via an API, often accessed via a web interface.

  • NASA’s Harmony (https://harmony.earthdata.nasa.gov/) can subset, reproject and reformat, and serve data.

  • This might save processing steps.

What does this look like in the cloud?

import earthaccess

auth = earthaccess.login(strategy='netrc')

Query = earthaccess.granule_query().concept_id(
    'C2153572614-NSIDC_CPRD'
).temporal(
    "2020-03-01", "2020-03-30"
).bounding_box(
    -134.7,58.9,-133.9,59.2)

granules = Query.get(4)

files = earthaccess.open(granules)
ds = xr.open_dataset(files[1], group='/gt1l/land_ice_segments')
ds

# Start to do awesome science

How to future-proof workflows and make them reproducible

FAIR

  • Findable,
  • Accessible,
  • Interoperable,
  • Reusable

Applies to the future you and your team as well.

Make sure data are Findable and Accessible

Does everyone on your team know where the data is?

Can they access it?

Helpful to document this somewhere.

Data Management

Keep raw data, raw!

Save intermediate data not just final versions.

Use consistent and descriptive folder and file name patterns.

(base) nsidc-442-abarrett:data_strategies_for_a_future_us$ tree Data
Data
├── calibrated
├── cleaned
├── figures
├── final
├── monthly_averages
├── raw
└── results

7 directories, 0 files

Standard file formats make data Interoperable

  • GeoTIFF for imagery or 2D raster data
  • NetCDF for multi-dimensional data (3D+)
  • Shapefiles or GeoJSON for vector data
  • csv for tabular data.

Avoid Excel and other proprietary formats.

Metadata makes data Interoperable and Reusable.

Metadata standards and conventions ensure that standard tools can read/interpret the data.

Standards also define the meaning of metadata attributes.

  • What is the Coordinate Reference System?
    (projection, grid mapping)
  • What are the units?
  • What is the variable name?
  • What is the source of the data?
  • What script produced the data?

Document the Analysis

Document each step.

  • Where did you get the data, which files, which version?
  • Write it down. Anywhere is good but using a script is better.

Can you (or anyone else) easily reproduce your processing pipeline?

With GUI interfaces - e.g. ArcGIS, QGIS, Excel - use screenshots, journal commands.