Allow returning modified dataset in after_dataset_loaded
and before_dataset_saved
hook
#4306
Replies: 8 comments 1 reply
-
Hi @mjspier, thanks for creating this issue. Just to understand this a bit better, couldn't you just create another node to do validation or conversion? From my point of view, it seems that modifying data through a hook instead would make it harder to track what's happened and reproduce the behaviour. |
Beta Was this translation helpful? Give feedback.
-
Hi @merelcht, thanks for the reply.
And the nodes which are loading the data would receive the dataset converted according to the schema types. I think for this case it would be nice to have the abbility to return the converted dataset already in the hook without the need to create a node in a pipeline. |
Beta Was this translation helpful? Give feedback.
-
@mjspier have you seen https://github.com/Galileo-Galilei/kedro-pandera by any chance? Also tagging @noklam who might have some thoughts on this. |
Beta Was this translation helpful? Give feedback.
-
It looks like @mjspier is already using
Can you give an example of this?
I am not sure if this statement is true. Here is how the hooks are called: hook_manager.hook.before_dataset_loaded(dataset_name=dataset_name, node=node)
return_ds = catalog.load(dataset_name)
hook_manager.hook.after_dataset_loaded(
dataset_name=dataset_name, data=return_ds, node=node
) It doesn't return any data so how would you modify data and inject this into the pipeline? (This is something Kedro try very hard to avoid, immutability is important) |
Beta Was this translation helpful? Give feedback.
-
@merelcht Yes indeed I was actually working on a PR for the @noklam I fully understand your point. I will think of another way how to incorporate data type convertion. Somehow it would be nice if that feature could be enabled with a plugin and a configuration in the catalog without the need of creating nodes in the pipeline. If you have any idea let me know. About the possibility of changing the input data in the hook, it is possible with the There the hook_response is later used to update the input data
|
Beta Was this translation helpful? Give feedback.
-
Hi @mjspier do you need more help with this, or is it okay if I close the issue? |
Beta Was this translation helpful? Give feedback.
-
Should this be a Discussion? |
Beta Was this translation helpful? Give feedback.
-
Hmm yes actually it probably should! I'll move it. |
Beta Was this translation helpful? Give feedback.
-
Description
It would be nice if we could modify the data in the
after_dataset_loaded
andbefore_dataset_saved
hooks and return it so when the data arrives in the node or is saved it uses the modified data.This is already possible with the
before_node_run
hook where we can return a dict with dataset name and modified data. (it is not possible in theafter_node_run
hook, to modify the output data)Context
Example usage could be to validate and convert the data according to a schema in the hook and the converted dataframe is propagated to the node step or to the save step.
Possible Implementation
Possible Alternatives
Beta Was this translation helpful? Give feedback.
All reactions