Remote data via intake-esm catalogs #194

mgrover1 · 2021-04-16T14:25:38Z

mgrover1
Apr 16, 2021

Are there current plans to implement the ability to use intake-esm catalogs (ex. https://storage.googleapis.com/cmip6/pangeo-cmip6.json) within this framework? Could this be specified within the .jsonrc file as a parameter (essentially replacing the MODEL_DATA_ROOT when you are interested in remote data)?

jkrasting · 2021-04-16T15:15:50Z

jkrasting
Apr 16, 2021
Maintainer

Hi @mgrover1. It is something we have discussed, but we don't have a definite plan or timeline. Tagging @aradhakrishnanGFDL in this thread.

0 replies

tsjackson-noaa · 2021-09-20T20:26:54Z

tsjackson-noaa
Sep 20, 2021

I'm responding to this thread to document current and future use of Intake catalogs by the package in a public place, as requested by @aradhakrishnanGFDL. This provides a roundabout answer to @mgrover1 's initial question: we currently use a partial implementation of intake-esm for all data queries, but more work (detailed below) will be needed to adapt it to querying general external catalogs. We should prioritize this work relatively highly, since as @aradhakrishnanGFDL pointed out, this is needed for accessing cloud data stores.

1 reply

mgrover1 Sep 20, 2021
Author

Can we set up a time to chat soon about this? I would suggest taking a look at ecgtools which can be used to build intake-esm catalogs given a directory path. Currently, we have CESM and CMIP6 parsers, but do not have GFDL parser. We can chat about how to build this, and what this would look like in the context of this workflow.

tsjackson-noaa · 2021-09-21T03:16:44Z

tsjackson-noaa
Sep 21, 2021

Current use of intake-esm

This is technical information about the package's internal APIs, and is not necessary for POD developers or package users.

Background

One of the goals of the package is to allow PODs to analyze data from a wide range of sources and formats without rewriting. This is done by having PODs specify their data requirements in a model-agnostic way, and providing separate data source "plug-ins" that handle the needed data query and conversion for every source of model data we support. All current data sources operate in the following distinct stages:

Query the external source for the presence of a variable requested by the POD;
Select the specific files (or atomic units of data) to be transferred in order to minimize data movement;
Fetch the selected files from the provider's location via some file transfer protocol, downloading them to a local temp directory;
Preprocess the local copies of data, by converting them from their native format to the format expected by each POD.

The skeleton for this process is provided by the DataSourceBase class. Currently all data sources implement the Query stage by querying an intake-esm catalog (in a nonstandard way), which is implemented by DataframeQueryDataSourceBase. In addition, all current data sources assemble the intake catalog on the fly, by crawling data files in a regular directory hierarchy and parsing metadata from the file naming convention. This is provided by OnTheFlyDirectoryHierarchyQueryMixin, which inherits from OnTheFlyFilesystemQueryMixin. Specific data sources, which correspond to different directory hierarchy naming conventions, inherit from these classes and provide logic describing the file naming convention.

We split this logic up over a hierarchy of multiple classes to make future customization possible without code duplication, as outlined in the next comment in the thread.

Catalog construction

Before queries are executed, the catalog gets constructed by the setup_query method of OnTheFlyFilesystemQueryMixin, which is called once, before any queries take place, as part of the hooks offered by the AbstractDataSource base class. setup_query calls generate_catalog, as implemented by OnTheFlyDirectoryHierarchyQueryMixin, to crawl the directory and assemble a Pandas dataframe, which is converted to an intake-esm catalog.

Child classes of OnTheFlyDirectoryHierarchyQueryMixin must supply two classes, _FileRegexClass and _DirectoryRegex. _DirectoryRegex is a RegexPattern -- a wrapper around a python regular expression -- which selects the subdirectories to be included in the catalog, based on whether they match the regex.
_FileRegexClass implements parsing paths in the directory hierarchy into usable metadata, and is expected to be a regex_dataclass: the regex_dataclass decorator extends python dataclasses to the case where the fields of a dataclass are populated by named capture groups in a regular expression.

For concreteness, we'll describe how the CMIP6 directory hierarchy (DRS) is implemented by CMIP6LocalFileDataSource. In this case _DirectoryRegex is the drs_directory_regex, matching directories in the CMIP6 DRS, and _FileRegexClass is CMIP6_DRSPath, which parses CMIP6 filenames and paths. Individual fields of a regex_dataclass can also be regex_dataclasses (under inheritance), in which case they apply regexes and populate fields of all parent classes as well. This is used in CMIP6_DRSPath, which simply concatenates the fields from CMIP6_DRSDirectory and CMIP6_DRSFilename, and so on. This is part of a more general mechanism in which the strings matched by the regex groups are used to instantiate objects of the type in the corresponding field's type annotation, e.g. the CMIP6 version_date attribute is used to create a Date object.

The regex_dataclass mechanism is intended to streamline the common aspects of parsing metadata from a string. In addition to the conditions of the regex, arbitrary validation and checking logic can be implemented in the class's __post_init__ method. At the expense of regex syntax, this provides parsing functionality I haven't seen in other tools.

Catalog column specifications

Each field of the _FileRegexClass defines a column of the DataFrame which is used as the catalog, and each parseable file encountered in the directory crawl is added to it as a row. Metadata about the columns for a specific data source is provided by a "column specification" object, which inherits from DataframeQueryColumnSpec and is assigned to the col_spec attribute of the data source's class. The column spec for the CMIP6 example is here.

The expt_cols attribute of this class is a list of column names whose values must all be the same for two files to be considered to belong to the same experiment. This is needed, e.g., to collect timeseries data chunked by date across multiple files. This is used to define an "experiment key", which is used to test if two files belong to the same or different experiments. Currently this just concatenates string representations of all the entries in expt_cols.

The pod_expt_cols and var_expt_cols attributes of the column spec come into play during the data selection stage, and aren't discussed here. Finally, the column spec also identifies the names of the columns containing the path to the file on the remote filesystem (remote_data_col) and the column containing the DateRange of data in each file.

Catalog querying

The overarching method for the Query stage is the query_data method of DataSourceBase, which does a query for all active PODs at once. This calls query_dataset on DataframeQueryDataSourceBase, which queries a single variable requested by a POD. The catalog query itself is done in _query_catalog. Individual conditions of the query are assembled by _query_clause, except for the clause specifying that data cover the analysis period, which is done first for technical reasons involving the use of comparison operators in object-valued columns (see comment in code). By default, _query_clause assumes the names of columns in the catalog are the same as the corresponding attributes on the variable object defining the query. This can be changed by defining a _query_attrs_synonyms dict as a class attribute that maps attributes on the variable to the correct column names. (Translating the values in those columns between the naming conventions of the POD's settings file and the naming convention used by the data source is done by VariableTranslator, which is beyond the scope of this description.)

The query is executed by Pandas' query method, which returns a DataFrame containing a subset of the catalog's rows. There is no good reason for this, and this should be reimplemented in terms of Intake's search method, which is closely equivalent -- see remarks in following comment.

The query results are then grouped by values of the "experiment key" (defined above). If a group is not eliminated by check_group_daterange or custom logic in _query_group_hook, it's considered a successful query. A "data key" (an object of the class given in the data source's \_DataKeyClass attribute) corresponding to the result is generated and stored in the data attribute of the variable being queried.

"Data keys" inherit from DataKeyBase and are used to associate remote files (or URLs, etc.) with local paths to downloaded data during the Fetch stage. All data sources based on the DataframeQueryDataSourceBase use the DataFrameDataKey, which identifies files based on their row index in the catalog; the path to the remote file (in remote_data_col) is looked up separately.

0 replies

tsjackson-noaa · 2021-09-21T04:43:11Z

tsjackson-noaa
Sep 21, 2021

Work needed for general Intake compatibility

As much as possible, we should use the same logic for querying pre-existing and "on the fly" catalogs, to minimize the amount of testing required. Therefore, I propose that DataframeQueryDataSourceBase be altered to be compatible with pre-existing catalogs, rather than creating a new base class for that use case.

Once the changes below are implemented, writing a data source that uses a pre-existing Intake catalog would then be a matter of defining a child class of DataframeQueryDataSourceBase, by analogy with the existing data sources.

Changes to DataframeQueryDataSourceBase

Split up metadata parsing

Recall from the previous comment that parsing metadata from filenames was done by regex_dataclasses, which use a regex with named matching groups to populate metadata fields, which are then cast into the appropriate classes given by the fields' type specifications.

Columns in pre-existing data catalogs are typically string-valued (SQL et al have date format support etc.), so the "on the fly" catalog should match this. This will require splitting up the regex_dataclass logic into separate "string matching" and "object instantiation" steps. The former would be done at its current time (catalog creation, during the directory crawl) but the latter needs to be postponed until after the catalog query. One way to do this would be to have regex_dataclass add to_string/from_string methods, which would convert between instances of the class and tuples of string-valued representations of the class's fields.

Use intake-esm search method

Existing logic accesses catalogs as Pandas DataFrames. There's no real reason for this, and this logic should be changed to use more of intake-esm's API. In particular, we should use intake-esm's search method (which returns results as a sub-catalog) instead of Pandas' query (which returns results as a DataFrame). At a minimum, we can use Intake's .df property to manipulate the underlying DataFrame directly.

Replace "data keys" with query results

Successful query results are stored as a map from "experiment keys" to "data keys," with the latter being row indices of the on-the-fly catalog. This property isn't used in an essential way, so instead DataFrameDataKey should be rewritten to store the returned query result (see next item).

Other changes

Implement data fetch based on fsspec

See related discussion in #102.

Intake-esm provides the to_dataset_dict method, which simultaneously downloads/transfers query results and opens them with xarray. This is done separately in the package (separate Fetch and Preprocess stages), because we have non-trivial error handling logic/bookkeeping that needs to run if a file transfer fails. The package attempts to transfer the data needed by all PODs as a batch, so this could result in a large number of open files. This shouldn't be a problem because xarray loads data lazily; still, we'd like to split up the work of preprocessing downloaded files in a more concurrent fashion.

The logic for transferring files is done in mixin classes that are separate from the query logic (e.g. LocalFetchMixin, GCPFetchMixin), so this could be done by another mixin that simply wraps to_dataset_dict. This would be simplified by having the "data keys" be the full catalog objects returned as query results.

At a bare minimum, we could use fsspec's built-in local caching via filesystem chaining to explicitly generate a local copy of each downloaded file when to_dataset_dict is called. We could then close all the open Dataset objects it returns, figure out how to distribute the local files for concurrent preprocessing, and then pass the local paths to the existing preprocessor logic.

Other catalog formats: STAC, THREDDS, ...

Once the tasks above are implemented, it shouldn't be too difficult to generalize the results of the work above to catalog APIs other than Intake. These would be implemented in terms of mixin classes that build off of DataframeQueryDataSourceBase, by analogy with OnTheFlyFilesystemQueryMixin. Connections to the catalog would be opened/closed in the setup_query/tear_down_query methods. The mixin would hopefully only need to re-implement the logic in _query_catalog, according to the API provided for querying the catalog, and would need to parse the query result into a DataFrame for downstream logic. As noted above, logic for supported file transfer protocols is handled separately; connections can be opened/closed in setup_fetch/tear_down_fetch.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote data via intake-esm catalogs #194

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Remote data via intake-esm catalogs #194

mgrover1 Apr 16, 2021

Replies: 4 comments · 1 reply

jkrasting Apr 16, 2021 Maintainer

tsjackson-noaa Sep 20, 2021

mgrover1 Sep 20, 2021 Author

tsjackson-noaa Sep 21, 2021

Current use of intake-esm

Background

Catalog construction

Catalog column specifications

Catalog querying

tsjackson-noaa Sep 21, 2021

Work needed for general Intake compatibility

Changes to DataframeQueryDataSourceBase

Split up metadata parsing

Use intake-esm search method

Replace "data keys" with query results

Other changes

Implement data fetch based on fsspec

Other catalog formats: STAC, THREDDS, ...

mgrover1
Apr 16, 2021

Replies: 4 comments 1 reply

jkrasting
Apr 16, 2021
Maintainer

tsjackson-noaa
Sep 20, 2021

mgrover1 Sep 20, 2021
Author

tsjackson-noaa
Sep 21, 2021

tsjackson-noaa
Sep 21, 2021