Skip to content

Commit

Permalink
clarify datasets on hovenweep accessibility
Browse files Browse the repository at this point in the history
  • Loading branch information
amsnyder committed Aug 30, 2024
1 parent ecd942b commit 72daccb
Show file tree
Hide file tree
Showing 3 changed files with 19 additions and 16 deletions.
14 changes: 8 additions & 6 deletions _sources/dataset_catalog/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# HyTEST Data Catalog (Intake)
This section describes how to use HyTEST's [intake catalog](https://intake.readthedocs.io/en/latest/catalog.html). Intake catalogs help reduce or remove the burden of handling different file formats and storage locations, making it easier to read data into your workflow. They also allow data providers to update the filepath/storage location of a dataset without breaking the workflows that were built on top of the intake catalog.

Our catalog facilitates this access for HyTEST's key data offerings and is used to read the data into the notebooks contained in this repository. While intake catalogs are Python-centric, they stored as a yaml file, which should also be easy to parse using other programming languages, even if there is no equivalent package in that programming language.
Our catalog facilitates this access for HyTEST's key data offerings and is used to read the data into the notebooks contained in this repository. While intake catalogs are Python-centric, they are stored as a yaml file, which should also be easy to parse using other programming languages, even if there is no equivalent package in that programming language.

Please note that this catalog is a temporary solution for reading data into our workflows. By the end of 2023, we hope to replace this catalog by a [STAC](https://stacspec.org/en). We plan to update all notebooks to read from our STAC at that time, as well.
Please note that this catalog is a temporary solution for reading data into our workflows. By the end of 2024, we hope to replace this catalog by a [STAC](https://stacspec.org/en). We plan to update all notebooks to read from our STAC at that time, as well.

## Storage Locations
Before getting into the details of how to use the intake catalog, it will be helpful to have some background on the various data storage systems HyTEST uses. Many of the datasets in our intake catalog have been duplicated in multiple storage locations, so you will need to have a basic understanding of these systems to navigate the data catalog. For datasets that are duplicated in multiple locations, the data on all storage systems will be identical; however, the details and costs associated with accessing them may differ. Datasets that are duplicated in multiple locations will have identical names, up until the last hypenated part of the name, which will indicate the storage location; for example, `conus404-hourly-cloud`, `conus404-hourly-osn`, and `conus404-hourly-onprem` are all identical datasets stored in different places. The three locations we store data currently are: AWS S3 buckets, Open Storage Network (OSN) pods, and USGS on-premises supercomputer storage (Caldera). Each of these locations is described in more detail below.
Before getting into the details of how to use the intake catalog, it will be helpful to have some background on the various data storage systems HyTEST uses. Many of the datasets in our intake catalog have been duplicated in multiple storage locations, so you will need to have a basic understanding of these systems to navigate the data catalog. For datasets that are duplicated in multiple locations, the data on all storage systems will be identical; however, the details and costs associated with accessing them may differ. Datasets that are duplicated in multiple locations will have identical names, up until the last hypenated part of the name, which will indicate the storage location; for example, `conus404-hourly-cloud`, `conus404-hourly-osn`, and `conus404-hourly-onprem` are all identical datasets stored in different places. The four locations we store data currently are: **AWS S3 buckets**, **Open Storage Network (OSN) pods**, and [USGS on-premises supercomputer](https://hpcportal.cr.usgs.gov/) storage systems (one storage system for the **Tallgrass/Denali** supercomputers and another for the **Hovenweep** supercomputer). Each of these locations is described in more detail below.

### AWS S3
This location provides object storage through an Amazon Web Services (AWS) Simple Storage Service (S3) bucket. This data is free to access for workflows that are running in the AWS us-west-2 region. However, if you would like to pull the data out of the AWS cloud (to your local computer, a supercomputer, or another cloud provider) or into another [AWS cloud region](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.RegionsAndAvailabilityZones.html), you will incur fees. This is because the bucket storing the data is a **“requester pays”** bucket. The costs associated with reading the data to other computing environments or AWS regions is documented [here](https://aws.amazon.com/s3/pricing/) (on the “Requests and Data Retrievals” tab). If you do need to read this data into a computing environment outside the AWS us-west-2 region, you will need to make sure you have an [AWS account](https://aws.amazon.com/account/) set up. You will need credentials from this account to read in the data, and your account will be billed. Please refer to the [AWS Credentials](../environment_set_up/Help_AWS_Credentials.ipynb) section of this book for more details on handling AWS credentials.
Expand All @@ -20,10 +20,12 @@ The OSN pod storage can be accessed through an API that is compatible with the b

**Datasets in the intake catalog that are stored on the OSN pod have a name ending in "-osn".**

### USGS On-premises Supercomputer Storage (Caldera)
The last storage location is the USGS on-premises disk storage that is attached to the USGS supercomputers (also called Caldera). This location is **only accessible to USGS employees or collaborators who have been granted access to USGS supercomputers**. This is the preferred data storage to use if you are working on the USGS supercomputers, and in fact, you will *not* be able to read data from this location into any computing environment other than the USGS supercomputers. More information about this storage system can be found in the [HPC User Docs](https://hpcportal.cr.usgs.gov/hpc-user-docs/supercomputers/caldera.html) (which are also only accessible through the internal USGS network).
### USGS On-premises Supercomputer Storage (Caldera for Tallgrass/Denali and Hovenweep)
The last storage location is the USGS on-premises disk storage that is attached to the USGS supercomputers (often referred to as Caldera). This location is **only accessible to USGS employees or collaborators who have been granted access to [USGS supercomputers](https://hpcportal.cr.usgs.gov/)**. This is the preferred data storage to use if you are working on the USGS supercomputers as it will give you the fastest data reads.

**Datasets in the intake catalog that are stored on Caldera have a name ending in "-onprem".**
The Tallgrass and Denali supercomputers share on filesystem, and the Hovenweep supercomputer has a different filesystem. These supercomputers can only read data from their own filesystems (you *cannot* read data from the filesystem attached to Denali/Tallgrass into Hovenweep and vice versa). You also *cannot* read data from an on-premises storage system into any computing environment outside of the USGS supercomputers (like your local computer or the cloud). More information about this storage system can be found in the [HPC User Docs](https://hpcportal.cr.usgs.gov/hpc-user-docs/supercomputers/caldera.html) (which are also only accessible through the internal USGS network).

**Datasets in the intake catalog that are stored on the filesystem attached to Denali/Tallgrass have a name ending in "-onprem", while datasets stored on the filesystem attached to Hovenweep have a name ending in "-onprem-hw".**

## Example Intake Catalog Usage
Now that you have an understanding of the different storage systems HyTEST uses, you will be able to navigate the HyTEST intake catalog and make a selection that is appropriate for your computing environment. Below is a demonstration of how to use HyTEST's intake catalog to select and open a dataset in your python workflow.
Expand Down
Loading

0 comments on commit 72daccb

Please sign in to comment.