Data Strategies for a Future Us

Author

Andy P. Barrett

Published

2024-04-19

Introduction

We are all data wranglers or data scientists. It is our job to collect data, process data, analyse data and generate results in the forms of plots, tables and eventually new knowledge. In the past, many of us have done this work in isolation, we store data on our laptops, sharing files by email or dropbox, google drive, etc as needed. This approach has mostly worked. However, more collaborative, and more open and reproducible ways of working, as well as increases in data volumes, and ways of accessing data requires us to find new, or at least modified, data strategies.

What are Data Strategies?

Collaborating with a research team or the wider scientific community requires us to solve problems of data access and share knowledge about how data was analyzed. How do we make data available to others? Howa. do we document the processing steps? Can you remember how a dataset you processed last week was cleaned? If your computer crashes or get’s stolen could you reproduce your processing workflow easily? When it comes to submit a paper, finish a project, hand off to someone else on the team, or pick up a project after vacation; how easy is it to do this. Data strategies, used in this context, is a catchall term that includes where to process data (locally or on a remote system), how to document and manage data processing for a future you or others, how to make this data accessible to others, how to make the processing and analysis workflows reproducible.

Who is a future us?

A future us is first and foremost, you tomorrow or next week when you pick up a project again, you next month when you discover you discover a problem with the data, or in siz months to a year from now when “reviewer 2” requests major changes. It is other members of your team who might want to work with the data and want to know where it came from and how it was processed. And it is the wider scientific community who may want to reproduce your work.

A simple data workflow

flowchart TD
    A["Collect Data
    Field, Remote Sensing, Model"] -->B[Clean Data\nCalibration, QC, Reprojection] -->C[Aggregate\nTime or spatial means] -->D[Analysis] -->E[Make Plots and Tables] -->F[Write Manuscript]
    F-- Review -->B
    E-- Iterate-->C

Collaborative and reproducible workflows

Where to process data?

  • Download data to Local Machine and process locally
  • Do some processing in cloud to reduce data size
  • Do all of your processing in the cloud

Depends on data volume:
- how long will it take to download (Mbps);
- Can you store all that data (cost and space);
- Do you have the computing power for processing;
- Does your team need a common computing environment;
- Do you need to share data at each step or just an end product.

What does this look like in the cloud?

Placeholder for examples of hybrid and cloud workflows

How to future-proof workflows and make them reproducible

FAIR - Findable, Accessible, Interoperable, Reusable applies to the future you and your tean as well.

Document the Analysis

Document each step. Where did you get the data, which files, which version. Write it down. Anywhere is good but using a script is better. Example: earthaccess, wget or curl

Can you (or anyone else) easily reproduce your processing pipeline. Having it scripted so just a few commands can set the process going is really helpful.

How can this be done with GUI interfaces? E.g. ArcGIS, QGIS, Excel - screenshots, journal commands.

Make sure data are Findable and Accessible

Does everyone know where you data is and can they access it? E.g. A shared machine, DropBox, Google Drive, Cloud Storage.

Data Management

Raw data is immutable1!

Save intermediate data not just final - saves processing time if you have to redo one or a few steps.

Use consistent folder and file name patterns.

(base) nsidc-442-abarrett:data_strategies_for_a_future_us$ tree Data
Data
├── calibrated
├── cleaned
├── figures
├── final
├── monthly_averages
├── raw
└── results

7 directories, 0 files

Setup config files to both document filepaths and patterns but also so your scripts can use these without hardcoding them into code.

from pathlib import Path

DATAPATH = Path.home() / path_to_project / Data
RAW_DATA_PATH = DATAPATH / raw
CLEANED_DATA_PATH = DATAPATH / cleaned
FINAL_DATA_PATH = DATAPATH / final
MONTHLY_PATH = DATAPATH / monthly_averages
RESULTS_PATH = DATAPATH / results   # Path for final results tables
                                    # and files
FIGURES_PATH = DATAPATH / figures   # Path for plots etc.

In any script you can then do something like…

from filepaths import FINAL_DATA_PATH

files = FINAL_DATA_PATH.glob('*')

Use standard file formats and include metadata

It can take time to set up but it will save time in the long run.
Use metadata standards.

Common File Formats

  • GeoTIFF for imagery or 2D raster data
  • NetCDF for multi-dimensional data (3D+)
  • Shapefiles or GeoJSON for vector data
  • csv for tabular data (Make it tidy2)

If you are working in the cloud, Cloud Optimized Formats can help. Avoid Excel and proprietary formats.

Metadata, metadata, metadata

Metadata makes your data to be interoperable and reusable.

Metadata standards and conventions ensure that standard tools can read/interpret the data, and that there is a common reference/ontology3. Not always perfect but worth following.

  • What is the CRS4 (projection, grid mapping) of the data?
  • What are the units?
  • What is the variable name?
  • What is the source of the data?
  • What script produced the data?

Footnotes

  1. The file cannot be changed after it is created↩︎

  2. A framework for organising tabular data. See Wickham, 2014. Tidy Data↩︎

  3. A formal naming convention↩︎

  4. Coordinate Reference System↩︎