Proposal for Partial/Custom node ordering for Sequential Runner to avoid dummy inputs #3758

noklam · 2024-03-15T11:37:37Z

noklam
Mar 15, 2024
Collaborator

Description

Support custom / partial node order where user desired.

Background

Kedro offers 3 Runners out of the box (SequentialRunner, ParallelRunner and ThreadRunner), there is another SoftFailRunner which I have implemented and can be installed in https://pypi.org/project/kedro-softfail-runner/.

In the past we have focus to make consistent support for runners and are reluctant to introduce feature parity. In fact, runners has been mostly unchanged for years. Improve resume pipeline suggestion for SequentialRunner introduce a concept similar to "Change Data Capture" (CDC), which fixed the broken suggestion and started to consider about persisted data and the closest checkpoint to recover a failed pipeline. This is the most obvious feature parity among runners as it only support SequentialRunner due to the non-deterministic nature of parallel computing.

Improve resume suggestions for SequentialRunner #3026 attempts to fix this feature by considering "parameters" as persisted data.

Context

TBD

Rethink how Kedro can play a role in multiprocessing / performance boost #3713 - I started to think more about runner recently, and my feeling is that SequentialRunner is the most important one, Kedro doesn't play an important role in terms of helping user to get code executed in a parallel fashion. This problem is usually solved by the 3rd party library, for example, polars support multi-core computing out of the box.

This proposal will only support SequentialRunner, which I am increasingly more comfortable with, details are discussed in #3713.

Design

Toposort will give ONE feasible solution, while there can be multiple possible solutions. In some case:

A -> B -> C -> D
A -> B -> D -> C
In terms of computation, both pipelines will have identical result if C & D doesn't depend on each other. However, for business logic or just ease of understanding, some ordering may be preferred. (Think about large pipeline like https://demo.kedro.org/?pipeline_id=__default__, how we perceive the execution order is usually from top-to-bottom, this is not necessary how nodes are executed)

Requirements:

Non-breaking
Validation of ordering, it cannot violate toposort result.

Nice to have:

Don't need to introduce a tons of new API
Don't need to change how Kedro resolve execution order fundamentally (keep toposort)

Feature:

Provide an argument to support custom ordering.

Next step is providing an user friendly API, as this is likely still too low-level for the end user.

High level Proposal

Add new constrains to "node_dependencies" addition to the existing inputs/outputs pair.

Possible Implementation

Can't think of anything, thus dummy outputs has been the workaround for years.

Possible Alternatives

Current workaround involves dummy inputs outputs which become tedious quickly.

datajoely · 2024-03-15T12:06:01Z

datajoely
Mar 15, 2024
Collaborator

How would you imagine this work?

0 replies

noklam · 2024-03-15T16:10:01Z

noklam
Mar 15, 2024
Collaborator Author

feature branch: @datajoely
main...noklam/node-ordering-proposal

Don't focus on the API, it's hacky and there is many random Pipeline call in the current process so I have to patch that everywhere.

Demo Project:

comment or uncomment this line to play with it.

https://github.com/noklam/kedro-partial-node-order/blob/75b2febb45ab3a44f93928d9d0796bb9d9765ef7/src/ls/pipelines/data_processing/pipeline.py#L40

There are many way to build a nicer API, for example:

we can assume the order is following the list.
or we can use node_name
or we can introduce syntax like airflow etc, i.e. node_a >> node_e

It's not too important to decide this now, it's more of a PoC to prove this is possible. Benefit of this is Kedro viz won't see this at all (I think)

We can also assume it always follow the order of declaration, or try to stick with it as much as possible. Those are just design decision so I will delay that discussion.

0 replies

datajoely · 2024-03-15T17:37:29Z

datajoely
Mar 15, 2024
Collaborator

Okay so I understand what you're trying to do. But have some questions.

Why is asking users to pass dummy datasets between nodes insufficient?
Is a deterministic sort (something like a seed) order a more useful enhancement?
This feels like something that should happen at a orchestor / namespace granularity level first, so would address some of the recommendations in Synthesis of research related to deployment of Kedro to modern MLOps platforms #3094 first.

1 reply

inigohidalgo May 24, 2024

Why is asking users to pass dummy datasets between nodes insufficient

I haven't read the rest of the discussion, but I am currently encountering this issue, in my current pipeline it will be very unergonomic to add a dummy input to the main pipeline, as it is a modular pipeline imported from another library which I'd rather not modify

noklam · 2024-03-15T17:47:27Z

noklam
Mar 15, 2024
Collaborator Author

Why is asking users to pass dummy datasets between nodes insufficient?
Few reasons:

It contaminate the DAG flowchart (kedro-viz)
It's hard to edit or read custom execution order, users will need to go through nodes and try to do that matching in their head or edit multiple file.
It feels very hacky.
It's annoying that you cannot make Kedro run sequentially (or at least follow the declaration order when it's possible)

Is a deterministic sort (something like a seed) order a more useful enhancement?

If you run with SequenetialRunner now, it's deterministic already #1604. Not sure if this is what you are talking about.

This feels like something that should happen at a orchestor / namespace granularity level first, so would address some of the recommendations in Synthesis of research related to deployment of Kedro to modern MLOps platforms #3094 first.

Which recommendation are you referring to? there are many mentioned.

0 replies

datajoely · 2024-03-16T14:40:32Z

datajoely
Mar 16, 2024
Collaborator

You've sold me
Wasn't aware this was fixed, so yeah your prioritization is right.
I think the user provided session ID is the only one I'd put ahead of this.

1 reply

inigohidalgo May 24, 2024

I should've read ahead lol

noklam · 2024-03-18T09:44:53Z

noklam
Mar 18, 2024
Collaborator Author

@datajoely 2. is one the thing that annoys me before I joined thus the first thing I fixed :P (also requested by former colleague), CacheDataset is the other one that I still haven't fixed.

0 replies

datajoely · 2024-03-18T09:46:56Z

datajoely
Mar 18, 2024
Collaborator

We'll get there 🚀

0 replies

Skhurana136 · 2024-04-18T12:41:14Z

Skhurana136
Apr 18, 2024

Commenting as recommended by Nok:

About the kedro project and the goal: I have 9 pipelines; I want to run these pipelines in an explicitly specified order with the nodes in each pipeline also executed in an explicitly specified order.

Explicit order of execution will be super helpful: I want to have a tight control on the default execution order: Some nodes and pipelines take longer to execute than others, and I want to push the execution of the time/memory intensive pipelines to the end.

0 replies

noklam · 2024-05-23T09:06:37Z

noklam
May 23, 2024
Collaborator Author

So far I see two main theme around the need:

Performance optimisation - currently there is no way to control the order in Kedro and there are opportunities for fine control to minimize memory footprint etc.
Mimic the "task base" DAGs tools like Github Actions etc
2.1. Sometimes it's desired to run in a certain order first (for semantic or other reasons), less clear about this, would be great to have some examples.

4 replies

inigohidalgo May 24, 2024

For 2: environment setup. I have to set up some environment variables to configure parallelism based on some parameter config. I need to run this at the very start of my pipeline.

There are simple workarounds, like just running the pipelines sequentially, or I/O tricks but this feels like something which would be nicely resolved in kedro pipeline-only world

noklam May 24, 2024
Collaborator Author

Do I understand correctly this should be something that should be pass as a variable instead? Environment variable is kinda a global variable.

inigohidalgo May 25, 2024

No. We are training some models on K8s and we need to explicitly configure multithreading by setting various environment variables at the start, but for the sake of this discussion the environment variables aren't important, I just want to ensure I execute a node at the very start of the pipeline execution. That same code could also include something like torch.set_num_threads.

I have a kedro param which indicates how many threads I want to run the training on, so I have a node which loads those params and sets up the environment as required. Only after the environment has been set up can I run the training nodes. I have solved this by having my setup node return True, and passing that True as a dummy input into my training pipeline.

noklam May 25, 2024
Collaborator Author

Could it be done as a before pipeline book instead?

lvijnck · 2024-07-17T07:24:31Z

lvijnck
Jul 17, 2024

Hi @noklam, whats the reason for specifying the pipeline dependencies on the pipeline level?

I would propose organizing it as follows:

    """Create embeddings pipeline."""
    return pipeline(
        [
            node(
                func=write_nodes,
                inputs=[
                    "int.nodes"
                ],
                outputs="prm.nodes",
                name="write_nodes",
            ),
            node(
                func=write_edges,
                inputs=[
                    "int.edges"
                ],
                outputs="prm.edges",
                name="write_edges",
                dependencies=["write_nodes"]
            )
       ]
  )

As far as I know, the node topology is constructed by examining the inputs and outputs of each node, since we're using this granularity I would propose exposing the dependencies on the same level.

2 replies

noklam Jul 17, 2024
Collaborator Author

Thanks for the interest, I see your point.

As far as I know, the node topology is constructed by examining the inputs and outputs of each node,

a collection of nodes is the definition of a pipeline, so I'd say the original granularity was pipeline already because at node level the dependency does not exist.

This is more of a PoC to prove that this is doable, the API design is not a final one. The original motivation serves a different use case, and it could be that both node/pipeline level are needed.

I can see the node level dependency is familiar, like how one define relationship of Airflow task. The downside is once your pipeline is big, it's hard to even comprehend what's actually going on.

When I create the PoC, the use case in mind is simple. There are multiple path for execution which generates the same result. Some are more preferable because when I do kedro run I want it to start with the node that I care more about (or it's likely to fail). So I can give instruction like " do A,B,C first, then I don't care if you execute DEF or FED)."

lvijnck Jul 17, 2024

Thanks for the interest, I see your point.

As far as I know, the node topology is constructed by examining the inputs and outputs of each node,

a collection of nodes is the definition of a pipeline, so I'd say the original granularity was pipeline already because at node level the dependency does not exist.

This is more of a PoC to prove that this is doable, the API design is not a final one. The original motivation serves a different use case, and it could be that both node/pipeline level are needed.

I can see the node level dependency is familiar, like how one define relationship of Airflow task. The downside is once your pipeline is big, it's hard to even comprehend what's actually going on.

When I create the PoC, the use case in mind is simple. There are multiple path for execution which generates the same result. Some are more preferable because when I do kedro run I want it to start with the node that I care more about (or it's likely to fail). So I can give instruction like " do A,B,C first, then I don't care if you execute DEF or FED)."

I would actually argue that once the pipeline is big, specifying the dependencies at the pipeline level is rather cumbersome. My main assumption here is that node dependencies would be rare, and very localized between nodes. Hence my proposal of adding them to the node itself, i.e., run this node only if other nodes have completed, similar to the original dataset based dependencies that say run this node only if the upstream task that produces this dataset has ran successfully.

lvijnck · 2025-01-06T08:51:30Z

lvijnck
Jan 6, 2025

Any update on this? I noticed the corresponding issue was closed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for Partial/Custom node ordering for Sequential Runner to avoid dummy inputs #3758

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Proposal for Partial/Custom node ordering for Sequential Runner to avoid dummy inputs #3758

noklam Mar 15, 2024 Collaborator

Description

Background

Context

Design

High level Proposal

Possible Implementation

Possible Alternatives

Replies: 11 comments · 8 replies

datajoely Mar 15, 2024 Collaborator

noklam Mar 15, 2024 Collaborator Author

datajoely Mar 15, 2024 Collaborator

noklam Mar 15, 2024 Collaborator Author

datajoely Mar 16, 2024 Collaborator

noklam Mar 18, 2024 Collaborator Author

datajoely Mar 18, 2024 Collaborator

noklam May 23, 2024 Collaborator Author

noklam May 24, 2024 Collaborator Author

noklam May 25, 2024 Collaborator Author

noklam Jul 17, 2024 Collaborator Author

noklam
Mar 15, 2024
Collaborator

Replies: 11 comments 8 replies

datajoely
Mar 15, 2024
Collaborator

noklam
Mar 15, 2024
Collaborator Author

datajoely
Mar 15, 2024
Collaborator

noklam
Mar 15, 2024
Collaborator Author

datajoely
Mar 16, 2024
Collaborator

noklam
Mar 18, 2024
Collaborator Author

datajoely
Mar 18, 2024
Collaborator

noklam
May 23, 2024
Collaborator Author

noklam May 24, 2024
Collaborator Author

noklam May 25, 2024
Collaborator Author

noklam Jul 17, 2024
Collaborator Author