Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make IncrementalDataset's confirms "namespaced" #4039

Open
gtauzin opened this issue Jul 29, 2024 · 2 comments
Open

Make IncrementalDataset's confirms "namespaced" #4039

gtauzin opened this issue Jul 29, 2024 · 2 comments
Labels
Community Issue/PR opened by the open-source community

Comments

@gtauzin
Copy link

gtauzin commented Jul 29, 2024

Description

I have a namespace-based incremental dataset and wish to use the confirms attribute to trigger CHECKPOINT update further down my pipeline. However, based on discussions on Slack, it seems that incremental datasets are not meant to be used within namespaces and so confirms is not "namespaced" by design.

Following discussion with @noklam on Slack, it seems that my use case could justify having "namespaced" confirms.

Context

I have many devices that regularly record event files and push it to a S3 bucket. I would like to run a preprocessing pipeline that is different for each device and that would for each of them:

  1. Load all new files as dataframes, preprocess them and concatenate the preprocessed recorded event and save the results to another S3 bucket
  2. Load all preprocessed recorded files computed so far and concatenate them

Then , I use the concatenation of all recorded preprocessed event seen so far for data science purposes.

The way I achieve this with Kedro is:

  • For step 1, I use IncrementalDataset and the concatenated dataframe is saved using a versioned ParquetDataset
  • For step 2, I use a PartionedDataset that is able to find all preprocessed recorded event computer so far (with load_args withdirs and max_depth set accordingly)

Those steps are done for each device, so I use namespace to reuse the same logic for all of them varying the S3 bucket path. I need the confirms to be at step 2 because only then I can consider new files to have been processed.

Workaround

@noklam suggested to try putting the namespace in the argument, e.g. confirms=namespace.data, as a workaround and I can confirm this worked.

@gtauzin gtauzin added the Issue: Feature Request New feature or improvement to existing feature label Jul 29, 2024
@gtauzin
Copy link
Author

gtauzin commented Sep 10, 2024

I believe this is also hidding a bug. If the incremental dataset is namespaced and the confirms argument is not explicitely set as per the workaround, no checkpoint file is created. I would guess that this is because if confirms is not provided, it is set to the incremental dataset name without the namespace and this dataset does not actually exist.

@astrojuanlu astrojuanlu added Community Issue/PR opened by the open-source community and removed Issue: Feature Request New feature or improvement to existing feature labels Dec 2, 2024
@astrojuanlu
Copy link
Member

Thanks @gtauzin and sorry for the slow response. We will investigate the issue you mention first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
Status: No status
Development

No branches or pull requests

2 participants