Data Strategies for Future Us
What are Data Strategies?
Data Strategies enhance collaboration and reproducible science.
- Data management best practices;
Good to start from the beginning of a project, great to start from where you are now.
Who is future us?
The scientific community.
A simple data workflow
When to cloud?
- What is the data volume?
- How long will it take to download?
- Can you store all that data (cost and space)?
- Do you have the computing power for processing?
- Does your team need a common computing environment?
- Do you need to share data at each step or just an end product?
All in cloud
Use Cloud-based Services
Cloud services are infrastructure, platforms and software hosted in the cloud and made available to users via an API, often accessed via a web interface.
NASA’s Harmony (https://harmony.earthdata.nasa.gov/) can subset, reproject and reformat, and serve data.
This might save processing steps.
What does this look like in the cloud?
auth = earthaccess.login(strategy='netrc')
Query = earthaccess.granule_query().concept_id(
granules = Query.get(4)
files = earthaccess.open(granules)
ds = xr.open_dataset(files, group='/gt1l/land_ice_segments')
# Start to do awesome science
How to future-proof workflows and make them reproducible
Applies to the future you and your team as well.
Make sure data are Findable and Accessible
Does everyone on your team know where the data is?
Can they access it?
Helpful to document this somewhere.
Keep raw data, raw!
Save intermediate data not just final versions.
Use consistent and descriptive folder and file name patterns.
(base) nsidc-442-abarrett:data_strategies_for_a_future_us$ tree Data
7 directories, 0 files
Standard file formats make data Interoperable
- GeoTIFF for imagery or 2D raster data
- NetCDF for multi-dimensional data (3D+)
- Shapefiles or GeoJSON for vector data
- csv for tabular data.
Avoid Excel and other proprietary formats.
Document the Analysis
Document each step.
- Where did you get the data, which files, which version?
- Write it down. Anywhere is good but using a script is better.
Can you (or anyone else) easily reproduce your processing pipeline?
With GUI interfaces - e.g. ArcGIS, QGIS, Excel - use screenshots, journal commands.