-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance Kedro Namespaces adoption #4343
Comments
Adding a bit of context - deep integration with Kedro-Viz was the first attempt to drive adoption and improve explainability:
We have spent nearly 5 years trying to explain this to users in various ways - We must pivot strategy. |
Thanks @DimedS for opening this issue. First, I would like to agree that tags do not guarantee non-overlapping pipeline partitioning. This has been said time and time again. But I am going to push back against the idea that namespaces are the right solution for that problem. The main reason is that they were probably never designed to solve it in the first place! Namespaces were born as "prefixes" and were introduced in Kedro 0.15.4 in October, 2019: (https://github.com/McK-Private/private-kedro/pull/286, private link) And then in 0.16.0 the modern concept of "modular pipelines" with Therefore a bit less or a bit more than 5 years have passed, depending on how you look at it. The original context and discussion have forever been lost in time https://jira.quantumblack.com/browse/KED-1105 (broken internal link) but we can get a glimpse of what the intent of the feature was from this comment:
(https://github.com/McK-Private/private-kedro/pull/286#issuecomment-542717548, private link) In addition, this is how the documentation of prefixes, and later namespaces, looked like:
The docs have always described namespaces (prefixes) as a way to reuse pipelines. There were zero review comments in those two PRs raising concerns about that. To note, nobody from the current team participated in the original 0.15 discussion. Therefore, I can only conclude that namespaces were always designed for pipeline reuse in mind. Implying that namespaces have always been the solution for pipeline non-overlapping partitioning is, in my view, a big unqualified opinion that has no backing in historical written evidence. And as such, saying that "the docs are wrong" is a misrepresentation of what those docs were supposed to describe. If anything, we're now retrofitting namespaces to solve a problem they weren't intended to solve in the first place. I am going to push back against doing incremental improvements on a feature that nobody has dared to touch in 5 years, that's difficult to understand even for Kedro engineers, let alone for our users (regardless of their intended use case), and that we're probably retrofitting to solve a problem they weren't designed for. My recommendation is that we look at the problem of non-overlapping pipeline partitioning with fresh eyes, go back to the drawing board, and prototype. |
I would also say from users
|
Thank you for your comments, @datajoely and @astrojuanlu. I see that there isn’t a consensus within the team about the future of namespaces, so I’ve updated the header of this issue to reflect your perspectives. I propose that we continue the discussion about deployment node grouping in the next Tech Design meeting with an open mind to all grouping possibilities - not limited to namespaces. If, during that discussion, we determine that namespaces are essential for deployment, we can revisit this conversation and make a decision on their future. |
Great - I'll also link to this write up from last year: |
Modular pipelines have long been the solution for non-overlapping pipeline composition. They are the original subunit that you can create with a simple command, and they came with provisions for defining their own set of requirements, packaging, documentation, etc. Namespaces simply provided a way to enable reuse of modular pipelines without overlapping. Between modular pipelines and namespaces, you have sufficient power for deployment purposes. You can deploy a full pipeline, or a set of subpipelines. Tags are OK for running subsets of pipelines, but provide no guarantees around overlapping; it's probably fine if they exist. They don't necessarily have to be a recommended deployment solution.
IMO more effectively automatically mapping to containers for orchestration was a minimum bar a couple years ago. This one-way translation was perhaps a good start then. Realistically, modern solutions now provide more aspects of a data platform than just a way to deploy pieces of logic in isolation--they provide lineage, data quality, partial/incremental materialization, etc. These desires are all repeatedly echoed by users, especially data engineers; namespaces are useful, but will go 2% of the way towards providing the full value users expect from data pipelines today. |
This! Why emulate the concept of hierarchical directories with namespaces when modular pipelines are laid out in actual directories? |
Hard truthsIf we'd like to group parts of a bigger pipeline into separate units of execution, there's no alternative to exclusive grouping, hence namespaces. You might want to change the name however you like, but it's the hard truth of mathematics and I we cannot pretend that this is not the case. A point to discuss could be whether we need deep (nested) namespaces or not, but not the hard truth that exclusive grouping is necessary and different than the inclusive grouping (tags). HistoryAs probably the only person in the team who has been since the beginning of the discussions around the feature called namespaces, I can confirm that the initial introduction of namespaces has been mostly related to reusability indeed and the automatic prefixing was crucial due to requirements of reusing the same pipeline twice in a bigger pipeline. However I can also assure you that the discussions about exclusive / inclusive grouping have started very shortly after, namely with the conceiving of the Kedro Viz feature of visualising "modular pipelines". Meanwhile, at that time there was similar confusion and very strong opinions on how difficult the concept of registered pipelines is, and how we shouldn't have such a thing, so and so. A few years later, people seem to understand it quite well, once we managed to explain it well and provide a good example and good entry point for it. So it's not going to be a precedent that we failed to explain something for a long time and then we got it right and it turned out to be actually a very simple and easy concept. False premisesSome digging is needed for the data and evidence thrown around, namely:
I haven't seen any hard data on that yet, only a few anecdotal quotes amplified by a couple of people who use it to support their opinion mainly. And even if this is right, I haven't seen a double-click on it - why is tagging used for deployment? Why are namespaces not used? As for this:
It is not an accurate statement, we spent exactly two (at most three, if I am generous) attempts to explain it in our docs, each attempt with less than 1 week of effort put into it by a single person each time. Also every time the explanation was tightly coupled with the ambiguous concept of reusability, and not the higher concept of grouping. I am not sure how we measure adoption of this to state that it isn't adopted feature, but even more so, you only need to adopt namespaces when you need to adopt namespaces. So pure adoption metrics are not great measurement. I am pretty sure if we start counting the nodes in pipelines, we'll find out that only a single digit percentage of users have more than 100 nodes, but I would never use it as an argument that we need to restrict the number of nodes to never exceed 100 nodes, because adoption of bigger pipelines is low.
These are irrelevant concerns for this discussion, we are talking about grouping. Dependency isolation is a choice of how you structure your project. Is it one Python package or several Python packages? Kedro focuses on making one Python package, thus assuming all nodes share the same set of dependencies. If you want a few packages, you can create a few Kedro projects. Focus the discussionSo in order to limit digressions, could we narrow down the discussion and start from first principles:
Compare tags with namespacesI will be repurposing @DimedS example to have a fair comparison and highlight accurately the difference between namespaces and tags. Single node tagging is not relevant for the discussion, as grouping one node is a meaningless operation (you need at least 2 entities to call something a group).
b. Add a namespace: part1_ns = pipeline(part1, namespace="part1_ns") # pipeline name most likely repeats namespace name This prefixes all inputs, outputs, and parameters with part1_ns = pipeline(part1, namespace="part1_ns", inputs={"a", "b"}, outputs={"e", "f"})
# I need to specify my inputs and outputs twice
As you can see, if we look at the problem dispassionately, we'll see that there's minimal difference in the API for both, so it's very unlikely that the "problem of adoption" hides there. The only difference is semantics, which is the automatic renaming of inputs / outputs. This can lead us to a way more productive discussion on how to go forward and what tradeoffs to make. Questions
Bear in mind that the discussion for automated renaming has happened a couple of times long time ago, was informed by "user research" and the behaviour changed twice to reflect the findings. Then again, we followed opinion of our users and not logic when making the decisions and we ended up here. This should serve as a cautionary tale that feature development is not a popularity contest, but an exercise of logic, vision and foreseeing future problems earlier than they will arise. |
Another issue we spotted: #4039 |
Digging on a recent comment by @noklam :
|
^ On this I would love to proxy tags / registered pipelines which do meet the requirements of a namespace to be thought of as one and collapsed accordingly |
I don't think this is a good idea. It's a recipe for more confusion - it's super weird that something works only sometimes in a non-obvious way. Again, let's refocus the discussion to answer the question, "what exactly makes I've shown in my earlier comment that the difference in API is minimal and clearly cannot be the culprit of the issues. There's only one difference there, namely auto-renaming of datasets behind the scenes. Shall we not dig in there first to see if we can improve that before actually suggesting we go ahead with some other random ideas like repurposing tags, registered pipelines and what-not? |
In this whole discussion about tags and namespaces, I never really understood why we need 3 concepts to group nodes (tags, pipelines and namespaces). To me, it feels weird to have to "force" a user to create a bunch of pipelines. But when it comes to deployment, you completely forget about this concept and you would need another concept (a namespace or a tag) to make it work. Wouldn't a better alternative be that we think about a pipeline as a "main" function calling several other functions (nodes in this case). That pipeline could then have an input and output defined that is used in de nodes. When designing it right, it should be enough to only have those inputs and outputs defined in the catalog. Again, following the analogy with coding, the objects you return in your main function are the only variables you have access to later on (references to all other objects are lost). So when you then want to add two pipelines, you have the following scenarios depending on what the graphs look like:
In this scenario, as @astrojuanlu and @noklam mentioned, the keys of the pipelines in the registry are then the ones suitable to collapse (and also used in deployment). As a result, in the deployments, you will see very simple commands like And while I understand this is a big and breaking change to what we currently have, I think it simplifies a whole lot. |
Kedro namespaces are currently not widely used. The team is divided on the reasons for this:
This parent issue aims to facilitate an agreed-upon decision regarding the points above and address these concerns. It is also tied to the goal of improving deployment functionality, where namespaces should play a pivotal role in node grouping.
History
Improving docs
The current documentation focuses primarily on how namespaces enhance pipeline reusability (see docs). However, this ticket proposes updating the docs to include a clear definition of namespaces, highlighting that they are similar to node tagging but do not allow overlaps. This makes namespaces an excellent choice for creating groups of nodes that can be executed together without conflicts.
Suggested docs example:
-Create pipelines without namespaces: Show how to build basic pipelines.
-Create namespaced pipelines: Use the initial pipelines to create namespaced versions.
-Combine pipelines: Build a final pipeline by combining the namespaced ones.
-Visualise: Include a visualisation using Kedro-Viz
(link to ticket in progress).
Decide new name for "reused pipelines with namespaces" #4016
Clarifying Modularity. The term "modularity" currently appears to relate to creating pipelines in separate folders, not namespaces. If this interpretation is correct, we should explicitly clarify this distinction in the docs.
Technical issues
Several technical issues were highlighted by @idanov during the last TD. These will be moved here for tracking (details in progress).
User interface
There is a potential user interface concern affecting namespace adoption, which might benefit from design attention (@stephkaiser, @iamelijahko).
Tagging Example: Tags are added directly during node or pipeline creation:
Alternatively, for pipelines:
Namespace Example: Namespaces are applied at the pipeline creation level and involve multiple steps:
part1.
, which most likely not to be desired. To preserve naming:Tags are applied directly to nodes, whereas namespaces require changes at the pipeline level. Simplifying the namespace UI or aligning it more closely with tagging might also improve adoption.
Few other UI gaps reported by users:
Hackathon
Namespaces in deployment
We aim to unify and implement node grouping functionality for deployment purposes in #4319. Namespaces appear to be a great fit for this purpose. However, the ongoing work to increase namespace adoption from the current ticket must be completed on the same time.
The text was updated successfully, but these errors were encountered: