Replies: 11 comments 8 replies
-
How would you imagine this work? |
Beta Was this translation helpful? Give feedback.
-
feature branch: @datajoely Don't focus on the API, it's hacky and there is many random Pipeline call in the current process so I have to patch that everywhere. comment or uncomment this line to play with it. There are many way to build a nicer API, for example:
It's not too important to decide this now, it's more of a PoC to prove this is possible. Benefit of this is Kedro viz won't see this at all (I think) We can also assume it always follow the order of declaration, or try to stick with it as much as possible. Those are just design decision so I will delay that discussion. |
Beta Was this translation helpful? Give feedback.
-
Okay so I understand what you're trying to do. But have some questions.
|
Beta Was this translation helpful? Give feedback.
-
If you run with SequenetialRunner now, it's deterministic already #1604. Not sure if this is what you are talking about.
Which recommendation are you referring to? there are many mentioned. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
@datajoely 2. is one the thing that annoys me before I joined thus the first thing I fixed :P (also requested by former colleague), CacheDataset is the other one that I still haven't fixed. |
Beta Was this translation helpful? Give feedback.
-
We'll get there 🚀 |
Beta Was this translation helpful? Give feedback.
-
Commenting as recommended by Nok: About the kedro project and the goal: I have 9 pipelines; I want to run these pipelines in an explicitly specified order with the nodes in each pipeline also executed in an explicitly specified order. Explicit order of execution will be super helpful: I want to have a tight control on the default execution order: Some nodes and pipelines take longer to execute than others, and I want to push the execution of the time/memory intensive pipelines to the end. |
Beta Was this translation helpful? Give feedback.
-
So far I see two main theme around the need:
|
Beta Was this translation helpful? Give feedback.
-
Hi @noklam, whats the reason for specifying the pipeline dependencies on the pipeline level? I would propose organizing it as follows:
As far as I know, the node topology is constructed by examining the inputs and outputs of each node, since we're using this granularity I would propose exposing the dependencies on the same level. |
Beta Was this translation helpful? Give feedback.
-
Any update on this? I noticed the corresponding issue was closed. |
Beta Was this translation helpful? Give feedback.
-
Description
Support custom / partial node order where user desired.
Background
Kedro offers 3 Runners out of the box (
SequentialRunner
,ParallelRunner
andThreadRunner
), there is anotherSoftFailRunner
which I have implemented and can be installed in https://pypi.org/project/kedro-softfail-runner/.In the past we have focus to make consistent support for runners and are reluctant to introduce feature parity. In fact, runners has been mostly unchanged for years. Improve resume pipeline suggestion for SequentialRunner introduce a concept similar to "Change Data Capture" (CDC), which fixed the broken suggestion and started to consider about persisted data and the closest checkpoint to recover a failed pipeline. This is the most obvious feature parity among runners as it only support
SequentialRunner
due to the non-deterministic nature of parallel computing.Context
TBD
SequentialRunner
is the most important one, Kedro doesn't play an important role in terms of helping user to get code executed in a parallel fashion. This problem is usually solved by the 3rd party library, for example, polars support multi-core computing out of the box.This proposal will only support
SequentialRunner
, which I am increasingly more comfortable with, details are discussed in #3713.Design
Toposort will give ONE feasible solution, while there can be multiple possible solutions. In some case:
In terms of computation, both pipelines will have identical result if C & D doesn't depend on each other. However, for business logic or just ease of understanding, some ordering may be preferred. (Think about large pipeline like https://demo.kedro.org/?pipeline_id=__default__, how we perceive the execution order is usually from top-to-bottom, this is not necessary how nodes are executed)
Requirements:
Nice to have:
Feature:
Next step is providing an user friendly API, as this is likely still too low-level for the end user.
High level Proposal
Add new constrains to "node_dependencies" addition to the existing inputs/outputs pair.
Possible Implementation
Can't think of anything, thus dummy outputs has been the workaround for years.
Possible Alternatives
Current workaround involves dummy inputs outputs which become tedious quickly.
Beta Was this translation helpful? Give feedback.
All reactions