On-demand Zarr Stores for NASA datasets with earthaccess and Kerchunk

The idea behind this PR from James Borbeau on earthaccess is that we can combine earthaccess, the power of Dask and kerchunk to create consolidated refenrece files (zarr compatible) from NASA datasets. This method works best with gridded data as they can be combined by time using the same grid.

Notes: * Looks like the resulting consolidated store has coordinate encoding issues for some datasets, as this study form the HDF Group notes, Kerchunk is still on an early phase and doesn’t support all the features of HDF5. * Lucas Sterzinger notes that further optimizations are possible for big datasets. * Having a distributed cluster means that we could scale trhis approach and create on-demand Zarr views of NASA data. A more detailed description of what Kerchunk buys us can be found on this notebook from Lucas.

%%capture
!pip uninstall -y earthaccess
!pip install git+https://github.com/jrbourbeau/earthaccess.git@kerchunk

Example with SSTS, gridded global NetCDF

import earthaccess
import xarray as xr
from dask.distributed import LocalCluster

if __name__ == "__main__":

    # Authenticate my machine with `earthaccess`
    earthaccess.login()

    # Retrieve data files for the dataset I'm interested in
    short_name = "SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205"
    granuales = earthaccess.search_data(
        short_name=short_name,
        cloud_hosted=True,
        temporal=("1990", "2019"),
        count=10,  # For demo purposes
    )

    # Create a local Dask cluster for parallel metadata consolidation
    # (but works with any Dask cluster)
    cluster = LocalCluster()
    client = cluster.get_client()

    # Save consolidated metdata file
    outfile = earthaccess.consolidate_metadata(
        granuales,
        outfile=f"./{short_name}-metadata.json",    # Writing to a local file for demo purposes
        # outfile=f"s3://my-bucket/{short_name}-metadata.json",   # We could also write to a remote file
        access="indirect",
        kerchunk_options={"concat_dims": "Time"}
    )
    print(f"Consolidated metadata written to {outfile}")

    # Load the dataset using the consolidated metadata file
    fs = earthaccess.get_fsspec_https_session()
    ds = xr.open_dataset(
        "reference://",
        engine="zarr",
        chunks={},
        backend_kwargs={
            "consolidated": False,
            "storage_options": {
                "fo": outfile,
                "remote_protocol": "https",
                "remote_options": fs.storage_options,
            }
        },
    )

    result = ds.SLA.mean({"Latitude", "Longitude"}).compute()
    print(f"{result = }")

Using Chelle’s Tutorial for MUR SST on AWS as reference to build a Zarr store from 10 years of monthly data from MUR.

if __name__ == "__main__":

    # Authenticate my machine with `earthaccess`
    earthaccess.login()
 
    doi = "10.5067/GHGMR-4FJ04"
    short_name = "MUR"
    month = 7
    
    results = []
    
    for year in range(2012,2022):
    
        params = {
            "doi": doi,
            "cloud_hosted": True,
            "temporal": (f"{str(year)}-{str(month)}-01", f"{str(year)}-{str(month)}-31"),
            "count": 31
        }

        # Retrieve data files for the dataset I'm interested in
        print(f"Searching for granules on {year}")
        granules = earthaccess.search_data(**params)
        results.extend(granules)
    print(f"Total granules to process: {len(results)}")

    # Create a local Dask cluster for parallel metadata consolidation
    # (but works with any Dask cluster)
    cluster = LocalCluster()
    client = cluster.get_client()

    # Save consolidated metdata file
    outfile = earthaccess.consolidate_metadata(
        results,
        outfile=f"./direct-{short_name}-metadata.json",    # Writing to a local file for demo purposes
        # outfile=f"s3://my-bucket/{short_name}-metadata.json",   # We could also write to a remote file
        access="direct",
        # kerchunk_options={"coo_map": []}
        kerchunk_options={"concat_dims": "time"}
    )
    print(f"Consolidated metadata written to {outfile}")
EARTHDATA_USERNAME and EARTHDATA_PASSWORD are not set in the current environment, try setting them or use a different strategy (netrc, interactive)
You're now authenticated with NASA Earthdata Login
Using token with expiration date: 10/23/2023
Using .netrc file for EDL
Searching for granules on 2012
Granules found: 31
Searching for granules on 2013
Granules found: 31
Searching for granules on 2014
Granules found: 31
Searching for granules on 2015
Granules found: 31
Searching for granules on 2016
Granules found: 31
Searching for granules on 2017
Granules found: 31
Searching for granules on 2018
Granules found: 31
Searching for granules on 2019
Granules found: 31
Searching for granules on 2020
Granules found: 31
Searching for granules on 2021
Granules found: 31
Total granules to process: 310
Consolidated metadata written to ./direct-MUR-metadata.json
earthaccess.login()

fs = earthaccess.get_s3fs_session("GES_DISC")

ds = xr.open_dataset(
    "reference://",
    engine="zarr",
    chunks={},
    decode_coords=False, # tricky, the coords are there but encoded in a way xarray can't decode for some reason. Similar to https://github.com/fsspec/kerchunk/issues/177
    backend_kwargs={
        "consolidated": False,
        "storage_options": {
            "fo": "direct-MUR-metadata.json",
            "remote_protocol": "s3",
            "remote_options": fs.storage_options,
        }
    },
)
ds