Cannot apply the Prefect deployment boilerplate to a versioned dataset #1427

alexfurnica · 2022-04-08T13:58:46Z

alexfurnica
Apr 8, 2022

Hi there!

I've been going through tutorials for Kedro, Prefect and Great Expectations to try and make a proof-of-concept of an end-to-end ML pipeline that includes data quality monitoring. At this stage, most of the spaceflights tutorial has worked (except the experiment tracking bug on Windows).

I'm currently attempting to run the pipeline as a Prefect flow using the boilerplate script provided in the Kedro docs. The pipeline registers just fine and it runs successfully the first time. The issue is that it fails during subsequent runs if I don't manually restart the agent. The reason it fails is that the save attribute of the Version object does not get updated in subsequent runs of the same pipeline so it throws a DataSetError:

The initial run was at 13:17 UTC, but this run was at 13:28 UTC as seen here (+2 hrs due to time difference):

Is this a known issue? Is there something I'm doing wrong? I've tried digging through the code to better understand but it is above my level of experience. Best I could find is that the tracking.JSONDataSet is not reinitialized with the correct date, or the cache is never cleared within the save() method.

Would really appreciate any suggestions!

Python version: 3.9.7
Kedro version: 0.17.7
Prefect version: 1.1.0

noklam · 2022-04-08T14:00:24Z

noklam
Apr 8, 2022
Collaborator

Potentially related to this issue
#1388

9 replies

alexfurnica Apr 11, 2022
Author

Just gave the suggested code change a try and I get an error regarding the session ID:

Any suggestions for how to go about this?

alexfurnica Apr 11, 2022
Author

Ok, the above was fixed by switching to ShelveStore instead of SQLiteStore. The pipeline now crashes at a different section but I'm not sure if this is a mistake in another part of the code. Will try first to fix it myself and update here.

Additionally: there is strange behavior now where every single node is treated as a separate pipeline by Prefect. After every task this gets printed:

Is it inherent to the fix where the session is re-used for each task that solved the original problem for this discussion?

alexfurnica Apr 11, 2022
Author

Coming back with an update: I cannot get the pipeline working (using the spaceflights tutorial code). It was giving me a DataSetError before that it could not find the datasets (X_train, X_test etc.) in memory. After seeing that in the proposed solution code they are parquet files instead, I switched just to test and this also fails:

To double-check I've just cloned the suggested solution project and that runs fine, albeit with each node treated as its own pipeline as I mentioned previously. The only difference that I can see now from the shared solution is that my project is using different namespaces for the pipelines. This is exactly as described in the tutorial page for namespacing

I should mention that just doing a standard kedro run and skipping Prefect works fine, so it seems unlikely that the issue is outside the prefect_flow.py file.

avan-sh Apr 13, 2022

@alexfurnica, does the sample project work as is in your machine? Also which version of kedro are you using? If you're using 0.18.0, I'm not confident that it would work with all the changes that were made to the framework.

This solution in the repo requires all the intermediate datasets to be persisted to disc, so cross check if all the dataset entries are present in your catalog.

I tried running the example again and this is the run log that was generated for one Prefect run.

14 April 2022,12:52:18 	agent	INFO	Submitted for execution: PID: 97578
14 April 2022,12:52:19 	prefect.CloudFlowRunner	INFO	Beginning Flow run for '__default__'
14 April 2022,12:52:19 	prefect.CloudTaskRunner	INFO	Task 'sf_prefect': Starting task run...
14 April 2022,12:52:19 	prefect.CloudTaskRunner	INFO	Task 'sf_prefect': Finished task run for task with final state: 'Success'
14 April 2022,12:52:20 	prefect.CloudTaskRunner	INFO	Task 'preprocess_shuttles_node': Starting task run...
14 April 2022,12:52:30 	prefect.CloudTaskRunner	INFO	Task 'preprocess_shuttles_node': Finished task run for task with final state: 'Success'
14 April 2022,12:52:30 	prefect.CloudTaskRunner	INFO	Task 'preprocess_companies_node': Starting task run...
14 April 2022,12:52:31 	prefect.CloudTaskRunner	INFO	Task 'preprocess_companies_node': Finished task run for task with final state: 'Success'
14 April 2022,12:52:31 	prefect.CloudTaskRunner	INFO	Task 'create_model_input_table_node': Starting task run...
14 April 2022,12:52:40 	prefect.CloudTaskRunner	INFO	Task 'create_model_input_table_node': Finished task run for task with final state: 'Success'
14 April 2022,12:52:40 	prefect.CloudTaskRunner	INFO	Task 'split_data_node': Starting task run...
14 April 2022,12:52:43 	prefect.CloudTaskRunner	INFO	Task 'split_data_node': Finished task run for task with final state: 'Success'
14 April 2022,12:52:43 	prefect.CloudTaskRunner	INFO	Task 'train_model_node': Starting task run...
14 April 2022,12:52:45 	prefect.CloudTaskRunner	INFO	Task 'train_model_node': Finished task run for task with final state: 'Success'
14 April 2022,12:52:45 	prefect.CloudTaskRunner	INFO	Task 'evaluate_model_node': Starting task run...
14 April 2022,12:52:46 	prefect.CloudTaskRunner	INFO	Task 'evaluate_model_node': Finished task run for task with final state: 'Success'
14 April 2022,12:52:46 	prefect.CloudFlowRunner	INFO	Flow run SUCCESS: all reference tasks succeeded

alexfurnica Apr 14, 2022
Author

@avan-sh thanks for getting back to me! As I've mentioned in my last post, the sample project works, albeit with a bug that treats every node as a separate run, causing multiple redundant prints:

14 April 2022,01:50:42 	agent	INFO	Submitted for execution: PID: 23264
14 April 2022,01:50:43 	prefect.CloudFlowRunner	INFO	Beginning Flow run for '__default__'
14 April 2022,01:50:54 	prefect.CloudTaskRunner	INFO	Task 'sf_prefect': Starting task run...
14 April 2022,01:50:56 	kedro.framework.session.session	WARNING	Unable to git describe C:\Users\310252617\Downloads\kedro-prefect
14 April 2022,01:50:59 	prefect.CloudTaskRunner	INFO	Task 'sf_prefect': Finished task run for task with final state: 'Success'
14 April 2022,01:51:01 	prefect.CloudTaskRunner	INFO	Task 'preprocess_shuttles_node': Starting task run...
14 April 2022,01:51:03 	kedro.framework.session.session	INFO	** Kedro project kedro-prefect
14 April 2022,01:51:15 	kedro.runner.sequential_runner	INFO	Pipeline execution completed successfully.
14 April 2022,01:51:15 	kedro.runner.sequential_runner	INFO	Completed 1 out of 1 tasks
14 April 2022,01:51:17 	prefect.CloudTaskRunner	INFO	Task 'preprocess_shuttles_node': Finished task run for task with final state: 'Success'
14 April 2022,01:51:19 	prefect.CloudTaskRunner	INFO	Task 'preprocess_companies_node': Starting task run...
14 April 2022,01:51:22 	kedro.framework.session.session	INFO	** Kedro project kedro-prefect
14 April 2022,01:51:22 	kedro.runner.sequential_runner	INFO	Completed 1 out of 1 tasks
14 April 2022,01:51:22 	kedro.runner.sequential_runner	INFO	Pipeline execution completed successfully.
14 April 2022,01:51:24 	prefect.CloudTaskRunner	INFO	Task 'preprocess_companies_node': Finished task run for task with final state: 'Success'
14 April 2022,01:51:26 	prefect.CloudTaskRunner	INFO	Task 'create_model_input_table_node': Starting task run...
14 April 2022,01:51:29 	kedro.framework.session.session	INFO	** Kedro project kedro-prefect
14 April 2022,01:51:40 	kedro.runner.sequential_runner	INFO	Completed 1 out of 1 tasks
14 April 2022,01:51:40 	kedro.runner.sequential_runner	INFO	Pipeline execution completed successfully.
14 April 2022,01:51:42 	prefect.CloudTaskRunner	INFO	Task 'create_model_input_table_node': Finished task run for task with final state: 'Success'
14 April 2022,01:51:44 	prefect.CloudTaskRunner	INFO	Task 'split_data_node': Starting task run...
14 April 2022,01:51:47 	kedro.framework.session.session	INFO	** Kedro project kedro-prefect
14 April 2022,01:51:51 	kedro.runner.sequential_runner	INFO	Completed 1 out of 1 tasks
14 April 2022,01:51:51 	kedro.runner.sequential_runner	INFO	Pipeline execution completed successfully.
14 April 2022,01:51:53 	prefect.CloudTaskRunner	INFO	Task 'split_data_node': Finished task run for task with final state: 'Success'
14 April 2022,01:51:55 	prefect.CloudTaskRunner	INFO	Task 'train_model_node': Starting task run...
14 April 2022,01:51:57 	kedro.framework.session.session	INFO	** Kedro project kedro-prefect
14 April 2022,01:51:58 	kedro.runner.sequential_runner	INFO	Completed 1 out of 1 tasks
14 April 2022,01:51:58 	kedro.runner.sequential_runner	INFO	Pipeline execution completed successfully.
14 April 2022,01:52:00 	prefect.CloudTaskRunner	INFO	Task 'train_model_node': Finished task run for task with final state: 'Success'
14 April 2022,01:52:02 	prefect.CloudTaskRunner	INFO	Task 'evaluate_model_node': Starting task run...
14 April 2022,01:52:05 	kedro.framework.session.session	INFO	** Kedro project kedro-prefect
14 April 2022,01:52:05 	kedro.runner.sequential_runner	INFO	Completed 1 out of 1 tasks
14 April 2022,01:52:05 	kedro.runner.sequential_runner	INFO	Pipeline execution completed successfully.
14 April 2022,01:52:07 	prefect.CloudTaskRunner	INFO	Task 'evaluate_model_node': Finished task run for task with final state: 'Success'
14 April 2022,01:52:09 	prefect.CloudFlowRunner	INFO	Flow run SUCCESS: all reference tasks succeeded

Because my deadlines are a little tight I had to drop Kedro and focus on getting Prefect + great expectations to work. That being said I'll try to adapt this sample project with modular, namespaced pipelines to see if that's the issue. Will write back here if there I've got an update on that 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot apply the Prefect deployment boilerplate to a versioned dataset #1427

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Cannot apply the Prefect deployment boilerplate to a versioned dataset #1427

alexfurnica Apr 8, 2022

Replies: 1 comment · 9 replies

noklam Apr 8, 2022 Collaborator

alexfurnica Apr 11, 2022 Author

alexfurnica Apr 11, 2022 Author

alexfurnica Apr 11, 2022 Author

avan-sh Apr 13, 2022

alexfurnica Apr 14, 2022 Author

alexfurnica
Apr 8, 2022

Replies: 1 comment 9 replies

noklam
Apr 8, 2022
Collaborator

alexfurnica Apr 11, 2022
Author

alexfurnica Apr 11, 2022
Author

alexfurnica Apr 11, 2022
Author

alexfurnica Apr 14, 2022
Author