Replies: 4 comments 1 reply
-
Hi @mgrover1. It is something we have discussed, but we don't have a definite plan or timeline. Tagging @aradhakrishnanGFDL in this thread. |
Beta Was this translation helpful? Give feedback.
-
I'm responding to this thread to document current and future use of Intake catalogs by the package in a public place, as requested by @aradhakrishnanGFDL. This provides a roundabout answer to @mgrover1 's initial question: we currently use a partial implementation of intake-esm for all data queries, but more work (detailed below) will be needed to adapt it to querying general external catalogs. We should prioritize this work relatively highly, since as @aradhakrishnanGFDL pointed out, this is needed for accessing cloud data stores. |
Beta Was this translation helpful? Give feedback.
-
Current use of intake-esmThis is technical information about the package's internal APIs, and is not necessary for POD developers or package users. BackgroundOne of the goals of the package is to allow PODs to analyze data from a wide range of sources and formats without rewriting. This is done by having PODs specify their data requirements in a model-agnostic way, and providing separate data source "plug-ins" that handle the needed data query and conversion for every source of model data we support. All current data sources operate in the following distinct stages:
The skeleton for this process is provided by the DataSourceBase class. Currently all data sources implement the Query stage by querying an intake-esm catalog (in a nonstandard way), which is implemented by DataframeQueryDataSourceBase. In addition, all current data sources assemble the intake catalog on the fly, by crawling data files in a regular directory hierarchy and parsing metadata from the file naming convention. This is provided by OnTheFlyDirectoryHierarchyQueryMixin, which inherits from OnTheFlyFilesystemQueryMixin. Specific data sources, which correspond to different directory hierarchy naming conventions, inherit from these classes and provide logic describing the file naming convention. We split this logic up over a hierarchy of multiple classes to make future customization possible without code duplication, as outlined in the next comment in the thread. Catalog constructionBefore queries are executed, the catalog gets constructed by the setup_query method of OnTheFlyFilesystemQueryMixin, which is called once, before any queries take place, as part of the hooks offered by the AbstractDataSource base class. setup_query calls generate_catalog, as implemented by OnTheFlyDirectoryHierarchyQueryMixin, to crawl the directory and assemble a Pandas dataframe, which is converted to an intake-esm catalog. Child classes of OnTheFlyDirectoryHierarchyQueryMixin must supply two classes, _FileRegexClass and _DirectoryRegex. _DirectoryRegex is a RegexPattern -- a wrapper around a python regular expression -- which selects the subdirectories to be included in the catalog, based on whether they match the regex. For concreteness, we'll describe how the CMIP6 directory hierarchy (DRS) is implemented by CMIP6LocalFileDataSource. In this case _DirectoryRegex is the drs_directory_regex, matching directories in the CMIP6 DRS, and _FileRegexClass is CMIP6_DRSPath, which parses CMIP6 filenames and paths. Individual fields of a regex_dataclass can also be regex_dataclasses (under inheritance), in which case they apply regexes and populate fields of all parent classes as well. This is used in CMIP6_DRSPath, which simply concatenates the fields from CMIP6_DRSDirectory and CMIP6_DRSFilename, and so on. This is part of a more general mechanism in which the strings matched by the regex groups are used to instantiate objects of the type in the corresponding field's type annotation, e.g. the CMIP6 The regex_dataclass mechanism is intended to streamline the common aspects of parsing metadata from a string. In addition to the conditions of the regex, arbitrary validation and checking logic can be implemented in the class's Catalog column specificationsEach field of the _FileRegexClass defines a column of the DataFrame which is used as the catalog, and each parseable file encountered in the directory crawl is added to it as a row. Metadata about the columns for a specific data source is provided by a "column specification" object, which inherits from DataframeQueryColumnSpec and is assigned to the The The Catalog queryingThe overarching method for the Query stage is the query_data method of DataSourceBase, which does a query for all active PODs at once. This calls query_dataset on DataframeQueryDataSourceBase, which queries a single variable requested by a POD. The catalog query itself is done in _query_catalog. Individual conditions of the query are assembled by _query_clause, except for the clause specifying that data cover the analysis period, which is done first for technical reasons involving the use of comparison operators in object-valued columns (see comment in code). By default, _query_clause assumes the names of columns in the catalog are the same as the corresponding attributes on the variable object defining the query. This can be changed by defining a _query_attrs_synonyms dict as a class attribute that maps attributes on the variable to the correct column names. (Translating the values in those columns between the naming conventions of the POD's settings file and the naming convention used by the data source is done by VariableTranslator, which is beyond the scope of this description.) The query is executed by Pandas' query method, which returns a DataFrame containing a subset of the catalog's rows. There is no good reason for this, and this should be reimplemented in terms of Intake's search method, which is closely equivalent -- see remarks in following comment. The query results are then grouped by values of the "experiment key" (defined above). If a group is not eliminated by check_group_daterange or custom logic in _query_group_hook, it's considered a successful query. A "data key" (an object of the class given in the data source's "Data keys" inherit from DataKeyBase and are used to associate remote files (or URLs, etc.) with local paths to downloaded data during the Fetch stage. All data sources based on the DataframeQueryDataSourceBase use the DataFrameDataKey, which identifies files based on their row index in the catalog; the path to the remote file (in |
Beta Was this translation helpful? Give feedback.
-
Work needed for general Intake compatibilityAs much as possible, we should use the same logic for querying pre-existing and "on the fly" catalogs, to minimize the amount of testing required. Therefore, I propose that DataframeQueryDataSourceBase be altered to be compatible with pre-existing catalogs, rather than creating a new base class for that use case. Once the changes below are implemented, writing a data source that uses a pre-existing Intake catalog would then be a matter of defining a child class of DataframeQueryDataSourceBase, by analogy with the existing data sources. Changes to DataframeQueryDataSourceBaseSplit up metadata parsingRecall from the previous comment that parsing metadata from filenames was done by regex_dataclasses, which use a regex with named matching groups to populate metadata fields, which are then cast into the appropriate classes given by the fields' type specifications. Columns in pre-existing data catalogs are typically string-valued (SQL et al have date format support etc.), so the "on the fly" catalog should match this. This will require splitting up the regex_dataclass logic into separate "string matching" and "object instantiation" steps. The former would be done at its current time (catalog creation, during the directory crawl) but the latter needs to be postponed until after the catalog query. One way to do this would be to have regex_dataclass add to_string/from_string methods, which would convert between instances of the class and tuples of string-valued representations of the class's fields. Use intake-esm search methodExisting logic accesses catalogs as Pandas DataFrames. There's no real reason for this, and this logic should be changed to use more of intake-esm's API. In particular, we should use intake-esm's search method (which returns results as a sub-catalog) instead of Pandas' query (which returns results as a DataFrame). At a minimum, we can use Intake's .df property to manipulate the underlying DataFrame directly. Replace "data keys" with query resultsSuccessful query results are stored as a map from "experiment keys" to "data keys," with the latter being row indices of the on-the-fly catalog. This property isn't used in an essential way, so instead DataFrameDataKey should be rewritten to store the returned query result (see next item). Other changesImplement data fetch based on fsspecSee related discussion in #102. Intake-esm provides the to_dataset_dict method, which simultaneously downloads/transfers query results and opens them with xarray. This is done separately in the package (separate Fetch and Preprocess stages), because we have non-trivial error handling logic/bookkeeping that needs to run if a file transfer fails. The package attempts to transfer the data needed by all PODs as a batch, so this could result in a large number of open files. This shouldn't be a problem because xarray loads data lazily; still, we'd like to split up the work of preprocessing downloaded files in a more concurrent fashion. The logic for transferring files is done in mixin classes that are separate from the query logic (e.g. LocalFetchMixin, GCPFetchMixin), so this could be done by another mixin that simply wraps to_dataset_dict. This would be simplified by having the "data keys" be the full catalog objects returned as query results. At a bare minimum, we could use fsspec's built-in local caching via filesystem chaining to explicitly generate a local copy of each downloaded file when to_dataset_dict is called. We could then close all the open Dataset objects it returns, figure out how to distribute the local files for concurrent preprocessing, and then pass the local paths to the existing preprocessor logic. Other catalog formats: STAC, THREDDS, ...Once the tasks above are implemented, it shouldn't be too difficult to generalize the results of the work above to catalog APIs other than Intake. These would be implemented in terms of mixin classes that build off of DataframeQueryDataSourceBase, by analogy with OnTheFlyFilesystemQueryMixin. Connections to the catalog would be opened/closed in the setup_query/tear_down_query methods. The mixin would hopefully only need to re-implement the logic in _query_catalog, according to the API provided for querying the catalog, and would need to parse the query result into a DataFrame for downstream logic. As noted above, logic for supported file transfer protocols is handled separately; connections can be opened/closed in setup_fetch/tear_down_fetch. |
Beta Was this translation helpful? Give feedback.
-
Are there current plans to implement the ability to use intake-esm catalogs (ex.
https://storage.googleapis.com/cmip6/pangeo-cmip6.json
) within this framework? Could this be specified within the .jsonrc file as a parameter (essentially replacing theMODEL_DATA_ROOT
when you are interested in remote data)?Beta Was this translation helpful? Give feedback.
All reactions