Make data fetching scalable with Dask #392

gmaze · 2024-09-18T13:04:44Z

Motivation

This has been under the hood for a very long time and somehow hidden in our file stores, waiting for prime time.

But with the advance of Argo data availability in the cloud, eg:

It's now time to support a new parallelization method with Dask

Existing methods are documented here: https://argopy.readthedocs.io/en/latest/performances.html#parallel-data-fetching

Support for Dask already works but only down to the very low level of fetching the raw data.
To fully make the high level argopy API works with a Dask client requires more work.

Existing feature

Today, we can fetch data in parallel with a Dask client only at low level.

First, let's create a fetcher without downloading data, just to get the list of resources to load and process:

f = argopy.DataFetcher(ds='bgc', src='erddap', 
                       chunks={'lon':1, 'lat':1, 'dpt':'auto', 'time':1}, 
                       chunks_maxsize={'dpt':100})
f = f.region([-78, -50, 34, 80, 0, 5000, '2020-01-01', '2021-01-01'])
len(f.uri)  # Returns 50 erddap urls

with this fetcher, we need to load and process 50 chunks of data.

If we now get a Dask client:

from dask.distributed import Client
client = Client(processes=True)

we can fetch data with our file store method open_mfdataset:

ds = f.fetcher.fs.open_mfdataset(
			f.uri,
			method=client,
			open_dataset_opts={'errors': 'ignore', 'download_url_opts': {'errors': 'ignore'}},
			)

The trick is to give the Dask client to the method argument and to ignore all errors (because if one url throw an error, the entire process fails).

What needs to be done

Although our open_mfdataset is not fundamentally different compared to using a catalog or xarray's ability to open multiple file; using our low level argopy file store method open_mfdataset allows for more Argo-related control of the data processing.

That's why in practice, the high level API: f.load() or f.to_xarray() are not yet working with a Dask client: because under the hood, the data fetcher performs complex processing of the data, and this processing is not yet serialisable by Dask.

This illustrated by the following snippet, where we now try to apply the data fetcher processing function to each chunk of data:

ds = f.fetcher.fs.open_mfdataset(
			f.uri,
			method=client,
			open_dataset_opts={'errors': 'ignore', 'download_url_opts': {'errors': 'ignore'}},
			preprocess=f.fetcher.post_process,
			preprocess_opts={"add_dm": False, "URI": f.uri}
			)

This raises a pickle error like Could not serialize object ...

So that's where most of the work in this PR should go: make the argopy data processing chain serialisable.

What has been done

Rq: this PR is based on the upcoming v1.0.0 release

And also:

closes Asynchronous, parallel/concurrent data fetching from a single Argo data server ? #365

fix bug where by the load would fail if fetching is done without concat

- Blackify - Improve docstrings - Refactor _mfprocessor_json to be more consistent with _mfprocessor_dataset - Add Dask client option to open_mfjson

allow for casting time stamps from argovis API

- make it serializable -

- update _add_attributes to handle coordinates

- also update option validation - expose new options.VALIDATE function

quai20 · 2024-10-08T07:41:42Z

Some 1st testing on my side :

%%time
with argopy.set_options(parallel=client):
    f = DataFetcher(src='argovis').region(box)
    print("%i chunks to process" % len(f.uri))
    print(f)
    ds = f.load().data
    print(ds)

117 chunks to process
<datafetcher.argovis>
👁 Name: Argovis Argo data fetcher for a space/time region
🗺  Domain: [x=-60.00/0.00; y=20.00/60.08; z=0.0/500.0; t=2007-01-01/2009-01-01]
🔗 API: https://argovis-api.colorado.edu/
🗝 API KEY: 'guest' (get a free key at https://argovis-keygen.colorado.edu/)
🏊 User mode: standard
🟡+🔵 Dataset: phy
🌤  Performances: cache=False, parallel=True [<Client: 'tcp://127.0.0.1:44031' processes=4 threads=4, memory=15.53 GiB>]
<xarray.Dataset>
    [...]
CPU times: user 3min 21s, sys: 12.9 s, total: 3min 34s
Wall time: 10min 49s

quai20 · 2024-10-08T07:50:00Z

%%time
with argopy.set_options(parallel=client):
    f = DataFetcher(src='erddap').region(box)
    print("%i chunks to process" % len(f.uri))
    print(f)
    ds = f.load().data
    print(ds)

81 chunks to process
<datafetcher.erddap>
⭐ Name: Ifremer erddap Argo data fetcher for a space/time region
🗺  Domain: [x=-60.00/0.00; y=20.00/60.09; z=0.0/500.0; t=2007-01-01/2009-01-01]
🔗 API: https://erddap.ifremer.fr/erddap
🏊 User mode: standard
🟡+🔵 Dataset: phy
🌤  Performances: cache=False, parallel=True [<Client: 'tcp://127.0.0.1:44031' processes=4 threads=4, memory=15.53 GiB>]
<xarray.Dataset>
[...]
CPU times: user 1min 45s, sys: 6.01 s, total: 1min 51s
Wall time: 4min 15s

gmaze · 2024-10-08T08:03:53Z

Glad this works !

Note @quai20 that perf with argovis are not as good as expected and could be improved, argovis/argovis_api#345

quai20 · 2024-10-08T08:39:45Z

Test on datarmor with dask-hpcconfig 'datarmor-local' cluster, and dataref gdac

%%time
with argopy.set_options(parallel=client):
    f = DataFetcher(src='gdac',gdac='/home/ref-argo/gdac').region(box)
    print("%i chunks to process" % len(f.uri))
    print(f)
    ds = f.load().data
    print(ds)

410 chunks to process
<datafetcher.gdac>
🌐 Name: Ifremer GDAC Argo data fetcher for a space/time region
🗺  Domain: [x=-60.00/0.00; y=20.00/60.01; z=0.0/500.0; t=2007-01-01/2009-01-01]
🔗 API: /home/ref-argo/gdac
📗 Index: ar_index_global_prof.txt.gz (3042123 records)
📸 Index searched: True (410 matches, 0.0135%)
🏊 User mode: standard
🟡+🔵 Dataset: phy
🌤  Performances: cache=False, parallel=True [<Client: 'tcp://127.0.0.1:45831' processes=7 threads=14, memory=100.00 GiB>]
Oops! <class 'UnicodeDecodeError'> occurred.
Fail to cast SCIENTIFIC_CALIB_COEFFICIENT[('N_PROF', 'N_CALIB', 'N_PARAM')] from 'object' to <class 'str'>
<xarray.Dataset> Size: 73MB
Dimensions:          (N_POINTS: 611401)
[...]
CPU times: user 1min 25s, sys: 5.73 s, total: 1min 31s
Wall time: 4min 48s

gmaze · 2024-10-08T11:33:39Z

Test on datarmor with dask-hpcconfig 'datarmor-local' cluster, and dataref gdac

as fast as the erddap !

gmaze · 2024-10-09T07:48:50Z

Hi @quai20
Thanks for the review.
I'll double check the doc and CI tests and merge this
g

- httpstore._mfprocessor_json no longer raises DataNotFound

codecov · 2024-10-11T13:32:46Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
1	1	0	0

View the top 1 failed tests by shortest run time

 test_fetchers_dask_cluster

Stack Traces | 0s run time

No failure message available

To view individual test run time comparison to the main branch, go to the Test Analytics Dashboard

gmaze added 3 commits September 17, 2024 15:41

ongoing

56866b3

Merge branch 'other-major-breaking-refactoring' into dask-ok

2573bd8

more stuff trying to make the processing chaining pickelizable

b6b4e96

gmaze added enhancement New feature or request backends performance labels Sep 18, 2024

gmaze marked this pull request as draft September 18, 2024 13:04

gmaze added 20 commits September 25, 2024 14:12

clear CI tests data

58a0bcf

Add Ci tests data

92b7156

Merge branch 'master' into dask-ok

6b4b64a

Update fetchers.py

f285e05

fix bug where by the load would fail if fetching is done without concat

Serializable erddap processor

944f56e

Remove env caching in CI tests

5f34187

remove cache

af4fa3d

Create test-mamba.yml

7fc3613

Merge branch 'master' into dask-ok

a9f0e7c

pro to pre refactor

aafc79b

Update erddap_data.py

e7a47c7

Update erddap_data_processors.py

6f74686

serializable gdac pre-processor

4f688a8

Update filesystems.py

664716f

- Blackify - Improve docstrings - Refactor _mfprocessor_json to be more consistent with _mfprocessor_dataset - Add Dask client option to open_mfjson

fix doc

1287cba

Update casting.py

5e5f22a

allow for casting time stamps from argovis API

Refactor argovis pre-processor

7ba07ca

- make it serializable -

Update erddap_data_processors.py

594434f

- update _add_attributes to handle coordinates

Update filesystems.py

3f4a986

Update envs_manager

64e42c2

gmaze self-assigned this Sep 30, 2024

gmaze marked this pull request as ready for review September 30, 2024 14:03

New "parallel" and "parallel_default_method" options

8faeb56

- also update option validation - expose new options.VALIDATE function

Update performances.rst

e668a2f

gmaze requested a review from quai20 October 3, 2024 13:01

gmaze added 2 commits October 3, 2024 13:10

Update whatsnew

da3e457

Update doc [skip-ci]

0a8f51c

quai20 approved these changes Oct 8, 2024

View reviewed changes

gmaze added 13 commits October 9, 2024 13:06

Improve doc and docstrings

cb462e2

Update xarray.py

fcfd815

New read_dac_wmo method to ArgoIndex

533bede

Fix CI tests

14ec6a7

Clean erddap to_xarray arg management

3a294e8

Fix CI tests

e72cb10

Merge branch 'master' into dask-ok

1757967

Fix cache dir facade bug

3eb65b3

Update whats-new.rst

0d5e422

Update pytests.yml

a26fa06

Merge branch 'master' into dask-ok

4f7f536

Fix/update CI tests data

e7512c1

Add CI tests for Dask client parallelization method

aac32bc

- httpstore._mfprocessor_json no longer raises DataNotFound

gmaze added 3 commits October 11, 2024 15:43

Update test_fetchers_dask_cluster.py

07922b2

Merge branch 'master' into dask-ok

49fd61e

Update test_fetchers_dask_cluster.py

b852a65

gmaze merged commit f405296 into master Oct 15, 2024
43 checks passed

gmaze deleted the dask-ok branch October 15, 2024 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make data fetching scalable with Dask #392

Make data fetching scalable with Dask #392

gmaze commented Sep 18, 2024 •

edited

Loading

quai20 commented Oct 8, 2024 •

edited

Loading

quai20 commented Oct 8, 2024

gmaze commented Oct 8, 2024

quai20 commented Oct 8, 2024

gmaze commented Oct 8, 2024

gmaze commented Oct 9, 2024

codecov bot commented Oct 11, 2024 •

edited

Loading

Make data fetching scalable with Dask #392

Make data fetching scalable with Dask #392

Conversation

gmaze commented Sep 18, 2024 • edited Loading

Motivation

Existing feature

What needs to be done

What has been done

quai20 commented Oct 8, 2024 • edited Loading

quai20 commented Oct 8, 2024

gmaze commented Oct 8, 2024

quai20 commented Oct 8, 2024

gmaze commented Oct 8, 2024

gmaze commented Oct 9, 2024

codecov bot commented Oct 11, 2024 • edited Loading

❌ 1 Tests Failed:

gmaze commented Sep 18, 2024 •

edited

Loading

quai20 commented Oct 8, 2024 •

edited

Loading

codecov bot commented Oct 11, 2024 •

edited

Loading