-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make data fetching scalable with Dask #392
Conversation
fix bug where by the load would fail if fetching is done without concat
- Blackify - Improve docstrings - Refactor _mfprocessor_json to be more consistent with _mfprocessor_dataset - Add Dask client option to open_mfjson
allow for casting time stamps from argovis API
- make it serializable -
- update _add_attributes to handle coordinates
- also update option validation - expose new options.VALIDATE function
Some 1st testing on my side :
|
|
Glad this works ! Note @quai20 that perf with argovis are not as good as expected and could be improved, argovis/argovis_api#345 |
Test on datarmor with dask-hpcconfig 'datarmor-local' cluster, and dataref gdac
|
as fast as the erddap ! |
Hi @quai20 |
- httpstore._mfprocessor_json no longer raises DataNotFound
❌ 1 Tests Failed:
View the top 1 failed tests by shortest run time
To view individual test run time comparison to the main branch, go to the Test Analytics Dashboard |
Motivation
This has been under the hood for a very long time and somehow hidden in our file stores, waiting for prime time.
But with the advance of Argo data availability in the cloud, eg:
It's now time to support a new parallelization method with Dask
Existing methods are documented here: https://argopy.readthedocs.io/en/latest/performances.html#parallel-data-fetching
Support for Dask already works but only down to the very low level of fetching the raw data.
To fully make the high level argopy API works with a Dask client requires more work.
Existing feature
Today, we can fetch data in parallel with a Dask client only at low level.
First, let's create a fetcher without downloading data, just to get the list of resources to load and process:
with this fetcher, we need to load and process 50 chunks of data.
If we now get a Dask client:
we can fetch data with our file store method
open_mfdataset
:The trick is to give the Dask client to the
method
argument and to ignore all errors (because if one url throw an error, the entire process fails).What needs to be done
Although our
open_mfdataset
is not fundamentally different compared to using a catalog or xarray's ability to open multiple file; using our low level argopy file store methodopen_mfdataset
allows for more Argo-related control of the data processing.That's why in practice, the high level API:
f.load()
orf.to_xarray()
are not yet working with a Dask client: because under the hood, the data fetcher performs complex processing of the data, and this processing is not yet serialisable by Dask.This illustrated by the following snippet, where we now try to apply the data fetcher processing function to each chunk of data:
This raises a pickle error like
Could not serialize object ...
So that's where most of the work in this PR should go: make the argopy data processing chain serialisable.
What has been done
Rq: this PR is based on the upcoming v1.0.0 release
argopy.data_fetchers.gdac_data.GDACArgoDataFetcher.post_process
refactored toargopy.data_fetchers.gdac_data_processors.pre_process
argopy.data_fetchers.erddap_data.ErddapArgoDataFetcher.post_process
refactored toargopy.data_fetchers.erddap_data_processors.pre_process
argopy.data_fetchers.argovis_data.ArgovisArgoDataFetcher.post_process
refactored toargopy.data_fetchers.argovis_data_processors.pre_process
DataSet.argo.datamode
extensionshttpstore.download_url
) to be ignored from higher levels calls (byhttpstore.open_dataset
,httpstore.open_mfdataset
,httpstore.open_json
,httpstore.open_mfjson
)parallel
andparallel_default_method
parallel
, which can take False as a value for no parallelization or the name of the method to use. If set to True, we fall back on using the default methodAnd also: