feat(datasets): Add option to async load and save in PartitionedDatasets #696

puneeter · 2024-05-23T17:02:50Z

Description

This PR provides the user to load and save PartitionedDataset asynchronously for partitions provided.
PartitionedDatasets already provide a way to do lazy loading, which solves for memory complexity. With this PR the time complexity is also reduced if the user wants to save/load these partitions in parallel with the help of use_async argument.

Development notes

Additional use_async argument to PartitionedDataset constructor is used to control the async load/save.
Based on this argument, _save and _load methods call different private functions.
Leveraged existing tests for PartitionedDataset by parameterizing value for use_async using @pytest.mark.parametrize("use_async", [True, False])

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

Signed-off-by: puneeter <puneet.saini@quantumblack.com>

merelcht · 2024-05-24T11:01:58Z

Hi @puneeter, can you please provide a description and any relevant development notes on the PR? This will make it easier for the team to review.

puneeter · 2024-05-24T11:09:28Z

Hi @puneeter, can you please provide a description and any relevant development notes on the PR? This will make it easier for the team to review.

I updated the description. Please let me know if it needs any refactoring.

kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py

Signed-off-by: puneeter <puneet.saini@quantumblack.com>

puneeter · 2024-10-18T11:07:34Z

Would need team's help to point to the right documentation to be changed because of this change. Maybe: docs/source/data/partitioned_and_incremental_datasets.md?

astrojuanlu · 2025-01-14T15:31:48Z

Hey @puneeter, sorry for the long delay. Indeed, partitioned_and_incremental_datasets.md corresponds to https://docs.kedro.org/en/0.19.10/data/partitioned_and_incremental_datasets.html

In the end, is the usage similar to what I wrote here #696 (comment) or is it different?

Aside from that, I'll leave one more comment

astrojuanlu · 2025-01-14T15:36:51Z

kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py

    def save(self, data: dict[str, Any]) -> None:
+        if self._use_async:
+            asyncio.run(self._async_save(data))


If I understand correctly, asyncio.run creates a new event loop, so if there's already an event loop running (for example, in a Jupyter notebook), calling this will raise an error.

This is essentially the red/blue function problem... Most of Kedro is synchronous anyway AFAIK, but I think this might set an API expectation that could be difficult to satisfy cleanly.

@merelcht @ElenaKhaustova do you have more thoughts?

If I understand correctly, asyncio.run creates a new event loop, so if there's already an event loop running (for example, in a Jupyter notebook), calling this will raise an error.

That is indeed correct. Maybe it's alright though to say async saving doesn't work in interactive envs?

but I think this might set an API expectation that could be difficult to satisfy cleanly

@astrojuanlu can you elaborate what you mean with this?

merelcht · 2025-01-16T09:48:48Z

@puneeter I see all the tests have been modified to take the use_async argument, but is there a way to also check that the async functionality is working?

Add async load and save methods

4aaf152

Signed-off-by: puneeter <puneet.saini@quantumblack.com>

puneeter changed the title ~~Add async load and save methods to PartitionedDatasets~~ feat(datasets): Add option to async load and save in PartitionedDatasets May 23, 2024

puneeter and others added 5 commits May 23, 2024 22:35

Merge branch 'main' into feature/async-partitioned-dataset

1e73033

Update lint

0943068

Signed-off-by: puneeter <puneet.saini@quantumblack.com>

Fix mypy

7427bce

Signed-off-by: puneeter <puneet.saini@quantumblack.com>

Update tests

5a23f44

Signed-off-by: puneeter <puneet.saini@quantumblack.com>

Update formatting

760fa88

Signed-off-by: puneeter <puneet.saini@quantumblack.com>

puneeter marked this pull request as ready for review May 23, 2024 18:20

Update RELEASE.md

d174779

Signed-off-by: puneeter <puneet.saini@quantumblack.com>

astrojuanlu reviewed May 28, 2024

View reviewed changes

kedro-datasets/kedro_datasets/partitions/partitioned_dataset.py Outdated Show resolved Hide resolved

merelcht mentioned this pull request Aug 19, 2024

Close/merge as many PRs as possible on kedro-plugins #809

Closed

16 tasks

puneeter added 4 commits October 18, 2024 13:12

Merge with latest main

a63e70b

Signed-off-by: puneeter <puneet.saini@quantumblack.com>

Revert load

2bfd17f

Signed-off-by: puneeter <puneet.saini@quantumblack.com>

Update docstring

14b5c5f

Signed-off-by: puneeter <puneet.saini@quantumblack.com>

Update release

4c513d4

Signed-off-by: puneeter <puneet.saini@quantumblack.com>

puneeter requested review from astrojuanlu and noklam and removed request for astrojuanlu October 18, 2024 11:06

astrojuanlu reviewed Jan 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): Add option to async load and save in PartitionedDatasets #696

feat(datasets): Add option to async load and save in PartitionedDatasets #696

puneeter commented May 23, 2024 •

edited

Loading

merelcht commented May 24, 2024

puneeter commented May 24, 2024

puneeter commented Oct 18, 2024 •

edited

Loading

astrojuanlu commented Jan 14, 2025

astrojuanlu Jan 14, 2025

merelcht Jan 16, 2025

merelcht commented Jan 16, 2025

feat(datasets): Add option to async load and save in PartitionedDatasets #696

Are you sure you want to change the base?

feat(datasets): Add option to async load and save in PartitionedDatasets #696

Conversation

puneeter commented May 23, 2024 • edited Loading

Description

Development notes

Checklist

merelcht commented May 24, 2024

puneeter commented May 24, 2024

puneeter commented Oct 18, 2024 • edited Loading

astrojuanlu commented Jan 14, 2025

astrojuanlu Jan 14, 2025

Choose a reason for hiding this comment

merelcht Jan 16, 2025

Choose a reason for hiding this comment

merelcht commented Jan 16, 2025

puneeter commented May 23, 2024 •

edited

Loading

puneeter commented Oct 18, 2024 •

edited

Loading