Accessing and working with ICESat-2 data in the cloud

1. Tutorial Overview

Note: This is an updated version of the notebook that was presented to the NSIDC DAAC User Working Group in May 2022

This notebook demonstrates searching for cloud-hosted ICESat-2 data and directly accessing Land Ice Height (ATL06) granules from an Amazon Compute Cloud (EC2) instance using the earthaccess package. NASA data “in the cloud” are stored in Amazon Web Services (AWS) Simple Storage Service (S3) Buckets. Direct Access is an efficient way to work with data stored in an S3 Bucket when you are working in the cloud. Cloud-hosted granules can be opened and loaded into memory without the need to download them first. This allows you take advantage of the scalability and power of cloud computing.

The Amazon Global cloud is divided into geographical regions. To have direct access to data stored in a region, our compute instance - a virtual computer that we create to perform processing operations in place of using our own desktop or laptop - must be in the same region as the data. This is a fundamental concept of analysis in place. NASA cloud-hosted data is in Amazon Region us-west2. So your compute instance must also be in us-west2. If we wanted to use data stored in another region, to use direct access for that data, we would start a compute instance in that region.

As an example data collection, we use ICESat-2 Land Ice Height (ATL06) over the Juneau Icefield, AK, for March 2003. ICESat-2 data granules, including ATL06, are stored in HDF5 format. We demonstrate how to open an HDF5 granule and access data variables using xarray. Land Ice Heights are then plotted using hvplot.

earthaccess is a package developed by Luis Lopez (NSIDC developer) to allow easy search of the NASA Common Metadata Repository (CMR) and download of NASA data collections. It can be used for programmatic search and access for both DAAC-hosted and cloud-hosted data. It manages authenticating using Earthdata Login credentials which are then used to obtain the S3 tokens that are needed for S3 direct access. https://github.com/nsidc/earthaccess

Credits

The notebook was created by Andy Barrett, NSIDC, updated by Jennifer Roebuck, NSIDC, and is based on notebooks developed by Luis Lopez and Mikala Beig, NSIDC.

For questions regarding the notebook, or to report problems, please create a new issue in the NSIDC-Data-Tutorials repo.

Learning Objectives

By the end of this demonstration you will be able to:
1. use earthaccess to search for ICESat-2 data using spatial and temporal filters and explore search results;
2. open data granules using direct access to the ICESat-2 S3 bucket;
3. load a HDF5 group into an xarray.Dataset;
4. visualize the land ice heights using hvplot.

Prerequisites

An EC2 instance in the us-west-2 region. NASA cloud-hosted data is in Amazon Region us-west2. So you also need an EC2 instance in the us-west-2 region. An EC2 instance is a virtual computer that you create to perform processing operations in place of using your own desktop or laptop. Details on how to set up an instance can be found here.
An Earthdata Login is required for data access. If you don’t have one, you can register for one here.
A .netrc file, that contains your Earthdata Login credentials, in your home directory. The current recommended practice for authentication is to create a .netrc file in your home directory following these instructions (Step 1) and to use the .netrc file for authentication when required for data access during the tutorial.
The nsidc-tutorials environment is setup and activated. This README has setup instructions.

Example of end product

At the end of this tutorial, the following figure will be generated:

ATL06 land ice heights ### Time requirement

Allow approximately 20 minutes to complete this tutorial.

2. Tutorial steps

Import Packages

The first step in any python script or notebook is to import packages. This tutorial requires the following packages: - earthaccess, which enables Earthdata Login authentication and retrieves AWS credentials; enables collection and granule searches; and S3 access; - xarray, used to load data; - hvplot, used to visualize land ice height data.

We are going to import the whole earthaccess package.

We will also import the whole xarray package but use a standard short name xr, using the import <package> as <short_name> syntax. We could use anything for a short name but xr is an accepted standard that most xarray users are familiar with.

We only need the xarray module from hvplot so we import that using the import <package>.<module> syntax.

# For searching NASA data
import earthaccess

# For reading data, analysis and plotting
import xarray as xr
import hvplot.xarray
import pprint

Authenticate

The first step is to get the correct authentication that will allow us to get cloud-hosted ICESat-2 data. This is all done through Earthdata Login. The login method also gets the correct AWS credentials.

Login requires your Earthdata Login username and password. The login method will automatically search for these credentials as environment variables or in a .netrc file, and if those aren’t available it will prompt us to enter our username and password. We use a .netrc strategy. A .netrc file is a text file located in our home directory that contains login information for remote machines. If we don’t have a .netrc file, login can create one for us.

earthaccess.login(strategy='interactive', persist=True)

auth = earthaccess.login()

Search for ICESat-2 Collections

earthaccess leverages the Common Metadata Repository (CMR) API to search for collections and granules. Earthdata Search also uses the CMR API.

We can use the search_datasets method to search for ICESat-2 collections by setting keyword='ICESat-2'.

This will display the number of data collections (data sets) that meet this search criteria.

Query = earthaccess.search_datasets(keyword = 'ICESat-2')

In this case there are 65 collections that have the keyword ICESat-2.

The search_datasets method returns a python list of DataCollection objects. We can view the metadata for each collection in long form by passing a DataCollection object to print or as a summary using the summary method. We can also use the pprint function to Pretty Print each object.

We will do this for the first 10 results (objects).

for collection in Query[:10]:
    pprint.pprint(collection.summary(), sort_dicts=True, indent=4)
    print('')

For each collection, summary returns a subset of fields from the collection metadata and the Unified Metadata Model (UMM): - concept-id is a unique id for the collection. It consists of an alphanumeric code and the provider-id specific to the DAAC (Distributed Active Archive Center). You can use the concept_id to search for data granules. - short_name is a quick way of referring to a collection (instead of using the full title). It can be found on the collection landing page underneath the collection title after ‘DATA SET ID’. See the table below for a list of the shortnames for ICESat-2 collections. - version is the version of each collection. - file-type gives information about the file format of the collection granules. - get-data is a collection of URLs that can be used to access the data, collection landing pages and data tools. - cloud-info this is for cloud-hosted data and provides additional information about the location of the S3 bucket that holds the data and where to get temporary AWS S3 credentials to access the S3 buckets. earthaccess handles these credentials and the links to the S3 buckets, so in general you won’t need to worry about this information.

For the ICESat-2 search results, within the concept-id, there is a provider-id; NSIDC_ECS and NSIDC_CPRD. NSIDC_ECS which is for the on-prem collections and NSIDC_CPRD is for the cloud-hosted collections.

For ICESat-2, ShortNames are generally how different products are referred to.

ShortName	Product Description
ATL03	ATLAS/ICESat-2 L2A Global Geolocated Photon Data
ATL06	ATLAS/ICESat-2 L3A Land Ice Height
ATL07	ATLAS/ICESat-2 L3A Sea Ice Height
ATL08	ATLAS/ICESat-2 L3A Land and Vegetation Height
ATL09	ATLAS/ICESat-2 L3A Calibrated Backscatter Profiles and Atmospheric Layer Characteristics
ATL10	ATLAS/ICESat-2 L3A Sea Ice Freeboard
ATL11	ATLAS/ICESat-2 L3B Slope-Corrected Land Ice Height Time Series
ATL12	ATLAS/ICESat-2 L3A Ocean Surface Height
ATL13	ATLAS/ICESat-2 L3A Along Track Inland Surface Water Data

Search for cloud-hosted data

For most collections, to search for only data in the cloud, the cloud_hosted method can be used.

Query = earthaccess.search_datasets(
    keyword = 'ICESat-2',
    cloud_hosted = True
)

Search a data set using spatial and temporal filters

We can use the search_data method to search for granules within a data set by location and time using spatial and temporal filters. In this example, we will search for data granules from the ATL06 verison 006 cloud-hosted data set over the Juneau Icefield, AK, for March and April 2020.

The temporal range is identified with standard date strings, and latitude-longitude corners of a bounding box is specified. Polygons and points, as well as shapefiles can also be specified.

This will display the number of granules that match our search.

results = earthaccess.search_data(
    short_name = 'ATL06',
    version = '006',
    cloud_hosted = True,
    bounding_box = (-134.7,58.9,-133.9,59.2),
    temporal = ('2020-03-01','2020-04-30'),
)

To display the rendered metadata, including the download link, granule size and two images, we will use display. In the example below, all 4 results are shown.

The download link is https and can be used download the granule to your local machine. This is similar to downloading DAAC-hosted data but in this case the data are coming from the Earthdata Cloud. For NASA data in the Earthdata Cloud, there is no charge to the user for egress from AWS Cloud servers. This is not the case for other data in the cloud.

Note the [None, None, None, None] that is displayed at the end can be ignored, it has no meaning in relation to the metadata.

[display(r) for r in results]

Use Direct-Access to open, load and display data stored on S3

Direct-access to data from an S3 bucket is a two step process. First, the files are opened using the open method. The auth object created at the start of the notebook is used to provide Earthdata Login authentication and AWS credentials.

The next step is to load the data. In this case, data are loaded into an xarray.Dataset. Data could be read into numpy arrays or a pandas.Dataframe. However, each granule would have to be read using a package that reads HDF5 granules such as h5py. xarray does this all under-the-hood in a single line but for a single group in the HDF5 granule*.

*ICESat-2 measures photon returns from 3 beam pairs numbered 1, 2 and 3 that each consist of a left and a right beam. In this case, we are interested in the left ground track (gt) of beam pair 1.

files = earthaccess.open(results)
ds = xr.open_dataset(files[1], group='/gt1l/land_ice_segments')

ds

hvplot is an interactive plotting tool that is useful for exploring data.

ds['h_li'].hvplot(kind='scatter', s=2)

3. Learning outcomes recap

We have learned how to: 1. use earthaccess to search for ICESat-2 data using spatial and temporal filters and explore search results; 2. open data granules using direct access to the ICESat-2 S3 bucket; 3. load a HDF5 group into an xarray.Dataset; 4. visualize the land ice heights using hvplot.

4. Additional resources

For general information about NSIDC DAAC data in the Earthdata Cloud:

FAQs About NSIDC DAAC’s Earthdata Cloud Migration

NASA Earthdata Cloud Data Access Guide

Additional tutorials and How Tos:

NASA Earthdata Cloud Cookbook