%%capture
!pip uninstall -y earthaccess
!pip install git+https://github.com/jrbourbeau/earthaccess.git@kerchunk
On-demand Zarr Stores for NASA datasets with earthaccess
and Kerchunk
The idea behind this PR from James Borbeau on earthaccess is that we can combine earthaccess, the power of Dask and kerchunk to create consolidated refenrece files (zarr compatible) from NASA datasets. This method works best with gridded data as they can be combined by time using the same grid.
Notes: * Looks like the resulting consolidated store has coordinate encoding issues for some datasets, as this study form the HDF Group notes, Kerchunk is still on an early phase and doesn’t support all the features of HDF5. * Lucas Sterzinger notes that further optimizations are possible for big datasets. * Having a distributed cluster means that we could scale trhis approach and create on-demand Zarr views of NASA data. A more detailed description of what Kerchunk buys us can be found on this notebook from Lucas.
Example with SSTS, gridded global NetCDF
import earthaccess
import xarray as xr
from dask.distributed import LocalCluster
if __name__ == "__main__":
# Authenticate my machine with `earthaccess`
earthaccess.login()
# Retrieve data files for the dataset I'm interested in
= "SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205"
short_name = earthaccess.search_data(
granuales =short_name,
short_name=True,
cloud_hosted=("1990", "2019"),
temporal=10, # For demo purposes
count
)
# Create a local Dask cluster for parallel metadata consolidation
# (but works with any Dask cluster)
= LocalCluster()
cluster = cluster.get_client()
client
# Save consolidated metdata file
= earthaccess.consolidate_metadata(
outfile
granuales,=f"./{short_name}-metadata.json", # Writing to a local file for demo purposes
outfile# outfile=f"s3://my-bucket/{short_name}-metadata.json", # We could also write to a remote file
="indirect",
access={"concat_dims": "Time"}
kerchunk_options
)print(f"Consolidated metadata written to {outfile}")
# Load the dataset using the consolidated metadata file
= earthaccess.get_fsspec_https_session()
fs = xr.open_dataset(
ds "reference://",
="zarr",
engine={},
chunks={
backend_kwargs"consolidated": False,
"storage_options": {
"fo": outfile,
"remote_protocol": "https",
"remote_options": fs.storage_options,
}
},
)
= ds.SLA.mean({"Latitude", "Longitude"}).compute()
result print(f"{result = }")
Using Chelle’s Tutorial for MUR SST on AWS as reference to build a Zarr store from 10 years of monthly data from MUR.
if __name__ == "__main__":
# Authenticate my machine with `earthaccess`
earthaccess.login()
= "10.5067/GHGMR-4FJ04"
doi = "MUR"
short_name = 7
month
= []
results
for year in range(2012,2022):
= {
params "doi": doi,
"cloud_hosted": True,
"temporal": (f"{str(year)}-{str(month)}-01", f"{str(year)}-{str(month)}-31"),
"count": 31
}
# Retrieve data files for the dataset I'm interested in
print(f"Searching for granules on {year}")
= earthaccess.search_data(**params)
granules
results.extend(granules)print(f"Total granules to process: {len(results)}")
# Create a local Dask cluster for parallel metadata consolidation
# (but works with any Dask cluster)
= LocalCluster()
cluster = cluster.get_client()
client
# Save consolidated metdata file
= earthaccess.consolidate_metadata(
outfile
results,=f"./direct-{short_name}-metadata.json", # Writing to a local file for demo purposes
outfile# outfile=f"s3://my-bucket/{short_name}-metadata.json", # We could also write to a remote file
="direct",
access# kerchunk_options={"coo_map": []}
={"concat_dims": "time"}
kerchunk_options
)print(f"Consolidated metadata written to {outfile}")
EARTHDATA_USERNAME and EARTHDATA_PASSWORD are not set in the current environment, try setting them or use a different strategy (netrc, interactive)
You're now authenticated with NASA Earthdata Login
Using token with expiration date: 10/23/2023
Using .netrc file for EDL
Searching for granules on 2012
Granules found: 31
Searching for granules on 2013
Granules found: 31
Searching for granules on 2014
Granules found: 31
Searching for granules on 2015
Granules found: 31
Searching for granules on 2016
Granules found: 31
Searching for granules on 2017
Granules found: 31
Searching for granules on 2018
Granules found: 31
Searching for granules on 2019
Granules found: 31
Searching for granules on 2020
Granules found: 31
Searching for granules on 2021
Granules found: 31
Total granules to process: 310
Consolidated metadata written to ./direct-MUR-metadata.json
earthaccess.login()
= earthaccess.get_s3fs_session("GES_DISC")
fs
= xr.open_dataset(
ds "reference://",
="zarr",
engine={},
chunks=False, # tricky, the coords are there but encoded in a way xarray can't decode for some reason. Similar to https://github.com/fsspec/kerchunk/issues/177
decode_coords={
backend_kwargs"consolidated": False,
"storage_options": {
"fo": "direct-MUR-metadata.json",
"remote_protocol": "s3",
"remote_options": fs.storage_options,
}
},
) ds