Enhanced parallel experimentation and required changes in kedro code #4251
Replies: 25 comments
-
@Vincent-Liagre-QB To what extent would this be covered by #1303? Also, just to clarify, is your goal to be able to run all experiments with a single command, only run one experiment at a time, or do either? I think I understand your requirement as running one experiment at a time, but just wanted to make sure. Finally, since you're from QB, you can also consider an internal project called Multi-Runner--but I 100% think these issues should be resolved in the open source Kedro ecosystem in the long run! |
Beta Was this translation helpful? Give feedback.
-
to your questions,
|
Beta Was this translation helpful? Give feedback.
-
@Vincent-Liagre-QB Was just taking a closer look at this, including the code. To confirm my understanding of the requirements:
I think modifying filepath based on some param/other variable isn't too bad with Hooks. Storing config for each experiment requires something extra, if not using envs (and I get your reservation on using envs). |
Beta Was this translation helpful? Give feedback.
-
@deepyaman to your points:
Regarding hooks: in my understanding the limitations is that once you have implemented them, you cannot easily choose whether to apply them or not. I.e. hooks are not programatically manageable. Also, I like more to think in terms of (1) feature needs and (2) possible code implementations (which I called "requirements") and think about them separately ; so to summarise: Feature needs:
Requirements for a possible implementation solution (note that in this case there is a 1-to-1 matching with the feature needs but not always the case)
(See in 1st message for more details) |
Beta Was this translation helpful? Give feedback.
-
Also for the sake of enriching the discussion, I was told to look into this: https://kedro-mlflow.readthedocs.io/en/stable/index.html ; not sure it covers the need but worth looking into ; will do |
Beta Was this translation helpful? Give feedback.
-
My inclination is to recommend that you return them explicitly from a node. I think it lends itself well to the idea that pipelines have an interface of inputs and outputs.
This is doable as long as you design the hooks accordingly (e.g. parse flags that determine when and where to apply the hook logic). |
Beta Was this translation helpful? Give feedback.
-
@Vincent-Liagre-QB I'll first try to summarize the requirements to confirm my understood about this is right.
Assuming my understanding is correct, I feel like hooks as suggested by @deepyaman might be the right way to go about. As the only difference between experiments is inputs and outputs and not the pipeline being run, you can choose which files to be loaded at run time using some pattern recognition. This might be TemplatedConfig in the latest versions though. On integration with MLFlow, it fits perfectly to run different experiments. Ideally, all of your parameters from the experiment(especially things that differentiate the experiment) should be logged in the experiment and your models can be registered in MLFlow. I think kedro-mlflow plugin might have this capability. Edit: A workflow could be this.
|
Beta Was this translation helpful? Give feedback.
-
@deepyaman, on nodes, my frustration is that it would prevents from using the full capacities of pipeline ; @avan-sh --> yes that's exactly what I have in mind @deepyaman @avan-sh on hooks : I'll try to look more into this, but I am a bit skeptical about the possibility to programmatically manage hooks ; if you have examples, I am curious to look into them. On integration w. ML Flow, I was just sharing this as it had been suggested it might cover my need ; but that's not the main topic :) |
Beta Was this translation helpful? Give feedback.
-
Re-opening this now that I have a bit of time to look into it again: @avan-sh the workflow you shared looks promising to me ; the only thing that I have difficulties understanding is how to make sure to use the version of the params corresponding to the specified EDIT: my previous implementation of I can access the Implem: In from kedro.framework.hooks import hook_impl
class ExperimentRunHooks:
@hook_impl
def after_context_created(self, context) -> None:
print("Inside ExperimentRunHooks")
# Trying to modify the dict of params
context.params["test_hook_param"] = 5
class VerificationHooks:
@hook_impl
def after_context_created(self, context) -> None:
print("Inside hook : VerificationHook")
print(context.params) In from kedro_viz.integrations.kedro.sqlite_store import SQLiteStore
from pathlib import Path
from kedro_tutorial.hooks import ExperimentRunHooks, VerificationHooks
SESSION_STORE_CLASS = SQLiteStore
SESSION_STORE_ARGS = {"path": str(Path(__file__).parents[2] / "data")}
HOOKS = (VerificationHooks(), ExperimentRunHooks()) #LIFO order |
Beta Was this translation helpful? Give feedback.
-
EDIT: my previous implementation of |
Beta Was this translation helpful? Give feedback.
-
Also like pointed by @avan-sh we need a hook to inject the extra param @hook_impl
def register_config_loader(
self, conf_paths: Iterable[str], env: str, extra_params: Dict[str, Any]
) -> ConfigLoader:
globals_dict = {}
if extra_params:
globals_dict = {"experiment_name": extra_params["experiment_name"]}
return TemplatedConfigLoader(
conf_paths,
globals_pattern="*globals.yml",
globals_dict=globals_dict,
) but I am not sure this |
Beta Was this translation helpful? Give feedback.
-
Hello! Has the suggestion of @Vincent-Liagre-QB been taken into account? It would greatly help me if so :) |
Beta Was this translation helpful? Give feedback.
-
@cosasha , |
Beta Was this translation helpful? Give feedback.
-
Similar request from @ofir-insait from a month ago: As stated by @Vincent-Liagre-QB in option (1) at the beginning of the thread, |
Beta Was this translation helpful? Give feedback.
-
Similar request from @andrko1 today:
|
Beta Was this translation helpful? Give feedback.
-
A similar request from @quantumtrope: #2958 (and also https://linen-slack.kedro.org/t/14164549/i-have-a-question-about-using-kedro-in-a-non-ml-setting-spec#a956426e-30d3-4a01-98b5-a582e3082da6) Which is similar to this one from @ChristopherRabotin a while back https://linen-slack.kedro.org/t/14162145/hi-there-what-s-the-best-way-to-run-a-monte-carlo-simulation#48ef7630-854f-4e98-b698-3534f80a05b7 And this one from @bpmeek even earlier https://linen-slack.kedro.org/t/9703489/hey-everyone-i-m-looking-for-the-kedro-way-of-doing-a-monte-#80277f3a-95a8-4578-ae24-f101dc0244f9 |
Beta Was this translation helpful? Give feedback.
-
To all people subscribed to this issue, notice that @marrrcin has published an interesting approach using
Please give it a read https://getindata.com/blog/kedro-dynamic-pipelines/ and let us know what you think. |
Beta Was this translation helpful? Give feedback.
-
search_space = {
"a": tune.grid_search([0.001, 0.01, 0.1, 1.0]),
"b": tune.choice([1, 2, 3]),
}
tuner = tune.Tuner(objective, param_space=search_space) Originally posted by @astrojuanlu in #2627 (comment) No but it's does provide a budget version of it - this is what I'm saying about the lack of sweeper integration with dedicated "sweepers" in this comment Originally posted by @datajoely in #2627 (comment) Let's continue the conversation about "parameter sweeping"/experimentation here. |
Beta Was this translation helpful? Give feedback.
-
@astrojuanlu thanks for sharing this and for the overall work on connecting everything going on around this feature request. The solution you are sharing seems very promising - although a bit complex also. I'll try to take a deeper look into it asap. |
Beta Was this translation helpful? Give feedback.
-
Originally posted by @datajoely in #2627 (comment)
Originally posted by @datajoely in #2627 (comment) |
Beta Was this translation helpful? Give feedback.
-
A user that uses different environments https://linen-slack.kedro.org/t/16041288/question-on-environments-and-credentials-we-are-currently-us#49927057-9256-455d-9213-94b898fcb699
Essentially option (1) of the original @Vincent-Liagre-QB ticket. In my opinion this is an abuse of environments but it's what users want: add new config file, change CLI flag, and done. |
Beta Was this translation helpful? Give feedback.
-
I am that user! Indeed, we have have repurposed envs to act as parameter groups. It works fairly well for us and it's been easy to train new team members on how we use them. Would love a kedro-native solution though! PS: For most functionality that is not out of the box for kedro the community tends to recommend hooks. My experience is that large projects can end up with dozens of hooks and each team uses different ones making onboarding difficult. Also, logic that is applied there might appear as side effects to someone not familiar with them so my preference is to use them sparingly. Just one person's opinion :) |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
"Live replay" of a user attempting the current approach #3308 useful for future iterations |
Beta Was this translation helpful? Give feedback.
-
When showing dataset factories to some users internally:
|
Beta Was this translation helpful? Give feedback.
-
Description & context
When working outside
kedro
, I often have several parallel configs for the same script (in kedro terms, "pipeline"), e.g. different model configs for a regression model ; or specific start/end dates and exclusion patterns for an analysis. Tree could look like:And within
model_1.py
, I'd usually do something like:So that I can then easily run different experiments independently with:
python src/model_1.py --conf=experiment_2
(for instance)And I'd usually organize results like this (but that's personal ; point is to make it easily configurable):
Note that:
model_1.py
to be able to run them independently and so that the workflow of adding a conf is streamlessNow I am wondering: how can I easily have a similar workflow in kedro ? What I have though about so far:
Before deep diving into 5:
Do you have any other idea? Am I missing something (might very well be the case since I am quite a beginner here)? Am I too biased by my outside-kedro workflow which might not be that straightforward after all?
Possible Implementation
Using the example case of
spaceflights
'data_science
pipelineSimply run:
python src/kedro_tutorial/pipelines/data_science/experiment_run.py --experiment-name="test_experiment"
Where:
src/kedro_tutorial/pipelines/data_science/experiment_run.py
is as below:(Remarks and required changes below)
Remarks:
Required changes in kedro code:
kedro/kedro/framework/session/session.py
Line 350 in 8f4b81a
unregistered_ds
inAbstractRunner.run
(vs. onlyfree_outputs
)kedro/kedro/runner/runner.py
Line 91 in 8f4b81a
Possible Alternatives
See points 1/2/3/4 above
Beta Was this translation helpful? Give feedback.
All reactions