Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0060: Propose remote resolution experimental controller #493

Merged
merged 1 commit into from Aug 31, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 116 additions & 8 deletions teps/0060-remote-resource-resolution.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
status: proposed
title: Remote Resource Resolution
creation-date: '2021-03-23'
last-updated: '2021-06-15'
last-updated: '2021-08-23'
authors:
- '@sbwsg'
- '@pierretasci'
Expand All @@ -20,6 +20,9 @@ authors:
- [Use Cases (optional)](#use-cases-optional)
- [Requirements](#requirements)
- [Proposal](#proposal)
- [Proposal 1: Create an Experimental Resolution Project](#proposal-1-create-an-experimental-resolution-project)
- [Proposal 2: Disable Tekton Pipelines' "In-Tree" Resolution Logic With a Deployment Arg](#proposal-2-disable-tekton-pipelines-in-tree-resolution-logic-with-a-deployment-arg)
- [Proposal 3: Add <code>status.resolvedMeta</code> to <code>TaskRuns</code> and <code>PipelineRuns</code>](#proposal-3-add--to--and-)
- [Notes/Caveats (optional)](#notescaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [User Experience (optional)](#user-experience-optional)
Expand Down Expand Up @@ -239,13 +242,118 @@ users can leverage those shared pipelines.

## Proposal

<!--
This is where we get down to the specifics of what the proposal actually is.
This should have enough detail that reviewers can understand exactly what
you're proposing, but should not include things like API designs or
implementation. The "Design Details" section below is for the real
nitty-gritty.
-->
Prior to moving this TEP to `implementable` there will be a considerable amount
of experimentation required to guage suitability of different approaches. There
are a lot of open questions wrt syntax, approach and architecture that cannot
be easily answered without trying ideas with running code.

There are several additional changes that would aid experimentation in the
pre-`implementable` phase of this TEP:

### Proposal 1: Create an Experimental Resolution Project

This proposal proposes adding a new controller in the [Experimental
Repo](https://github.com/tektoncd/experimental/) that implements some of
the ideas described in the [Alternatives section](#alternatives) of this TEP.

This will need a new directory added with a controller & OWNERS configuration.
If we want to try multiple alternatives (e.g. explore both interceptor-style
and reconciler-style approaches) then we can either have separate subdirectories
or branches for each implementation.

### Proposal 2: Disable Tekton Pipelines' "In-Tree" Resolution Logic With a Deployment Arg
This conversation was marked as resolved.
Show resolved Hide resolved

A problem we face with introducing an experimental resolver controller is that
any in-cluster process watching for `TaskRun` and `PipelineRun` resources will
race with Tekton Pipelines to process them as they're submitted to the kube
API.

In order to support an experimental controller independently performing
resolution we could add a behavioural flag that completely disables the
built-in resolution logic. For example: `experimental-disable-ref-resolution`.

This is explicitly _not_ being proposed as a feature flag because 1) it is not
intended to migrate through to production-readiness and 2) a behavioural flag
with its own name underlines the "experimental mode" that the user is electing
to put their Tekton Pipelines controller in to.

#### Alternative

A different approach that would work for PipelineRuns today would be to require
them to be created with a [`Pending`
status](https://github.com/tektoncd/pipeline/blob/main/docs/pipelineruns.md#pending-pipelineruns).

This proposal doesn't take that approach because:
1. The same support does not exist for TaskRuns and so would require its own
independent TEP and review cycle to implement.
2. Pending status can't be used by multiple controllers that all need to signal
readiness. This means a Resolution Reconciler and (for example) a custom
PipelineRun Scheduler could set this flag before the other controller is
ready to do so.

### Proposal 3: Add `status.taskYaml` to `TaskRuns` and `status.pipelineYaml` to `PipelineRuns`

When the proposed experimental resolution controller sees an unresolved
`TaskRun` or `PipelineRun` it will read the resource and then resolve whatever
ref, bundle or inline document is included. The resolved `spec` will then be
embedded in the `status.taskYaml.spec` field for `TaskRuns` or the
`status.pipelineYaml.spec` field for `PipelineRuns`. Once those fields are populated
Tekton Pipeline's reconcilers will be able to pick them up and begin working
with the specs.

We will take this approach because it's entirely backwards-compatible with our
existing resolution and reconcilers: today, the reconcilers already copy
resolved resources into the `status` block of `TaskRuns` and `PipelineRuns` so
that changes in the underlying resources don't affect runs that are already
in-flight. By observing the same protocol it means that the logic changes in
our Tekton Pipelines' reconcilers is minimized and the same "audit trail"
property is maintained.

Unfortunatley there's a wrinkle: our existing reconcilers copy labels and
annotations from the resolved `Task` or `Pipeline` to the `TaskRun` or
`PipelineRun`. By splitting the resolution logic in to its own external
machinery we are introducing a barrier between resolution and metadata
duplication: the "actuation reconcilers" never see the fully-resolved resource,
only the `spec` passed via the `status`, so they're unable to perform the
metadata copy. The end result is that the "resolver reconcilers" are also
responsible for performing this metadata copy.

A lot of the conversation with this TEP so far has revolved around the fact
that resolution might be something we want to expose not just for Pipelines but
also other Tekton projects. Having the metadata copy performed by an external
controller doesn't make a ton of sense in that case: metadata copying is a
feature/requirement of Tekton Pipelines, not a general requirement for
resolution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain a bit more about this? im not totally following because:

  1. afaik the experimental resolution controllers know enough about pipelineruns + taskruns to modify them directly, so in that sense they're not generic enough to use for anything else (maybe im missing something tho)
  2. metadata like labels and annotations is pretty universal across k8s resources, so in that sense having some logic around this even in a very generic tool seems reasonable

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resolution controller I'm proposing for the experimental repo would aim to deliver the functionality that we've been discussing through TEP-0060, which has now largely boiled down to: (1) It should be a separate project (2) It should process the resources before Pipelines sees them and (3) it shouldn't be Pipelines-specific.

It's true that metadata is a universal feature of objects in k8s. But the behaviour where metadata is copied verbatim from Tasks to TaskRuns and Pipelines to PipelineRuns is quite specifically Tekton Pipelines I think? Let's say we open the system up in such a way that anyone can write a Resolver. Should we require their resolver to also perform this Pipelines-specific metadata copying action? I guess I'm not convinced that this is an action we want left to external resolvers to perform. Those labels / annotations can trip up the pipelines controller when they're missing and are part of our backwards-compat story I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the behaviour where metadata is copied verbatim from Tasks to TaskRuns and Pipelines to PipelineRuns is quite specifically Tekton Pipelines I think?

i wonder if one of the reasons we're copying around metadata like this is because we needed a better way to query pipelineruns and taskruns - which maybe is the role that tekton results fills now? might be something we want to consider for v1

I guess I'm not convinced that this is an action we want left to external resolvers to perform.

in that case im leaning more toward the approach where we embed the entire resolved yaml document (or as much of it as we know the controller needs) 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, possibly the biggest reason for the metadata duplication is about querying the cluster for PipelineRuns associated to Pipelines (and TaskRuns to Tasks). I definitely agree that Results might be a better host for this kind of relational data.

Just to throw an additional wrinkle (because I'm also generally in favor of embedding the entire yaml doc): if we go the route of embedding the entire resolved yaml I imagine there will be some amount of time where we need to store both pipelineSpec and (hypothetical) pipelineYAML in single PipelineRuns' status for compatibility's sake. In turn I wonder if this would push PipelineRuns much more quickly towards etcd's size limits when very large Pipelines are involved. wdyt?

Copy link
Author

@ghost ghost Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't quite finish my thought inre: metadata duplication. While querying associations is one very big use case I think there are also probably others and it would probably behoove us to enumerate those and make sure they're accounted for before we decide to drop the copying behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine there will be some amount of time where we need to store both pipelineSpec and (hypothetical) pipelineYAML in single PipelineRuns' status for compatibility's sake.

hmm I'm trying to think through the scenario where we'd need to do this - all I can think of is if the controller is downgraded during pipelinerun execution?

The current controller expects to find a pipeline spec in the pipelinerun status - the updated version would expect to find the entire pipelineYAML but could also support the pipeline spec if it found it (we could wait until we're outside the compatibility window to remove that support) - but most cases wouldnt require actually storing both i dont think?

scenarios i can think of:

  1. upgrade from "pipeline spec" controller to new version:
    • completed pipelineruns would contain "pipeline spec" in status, doesnt matter cuz they're done
    • in progress pipelineruns would be using "pipeline spec" which would be okay as long as the new version still supported it
    • new pipelineruns would start using "pipelineYAML"
  2. downgrade from new version of the controller to "pipeline spec" controller
    • completed pipeline runs contain "pipelineYAML", doesnt matter cuz they're done (possible that on controller start up it gets upset about unexpected fields?)
    • in progress pipeline runs would be in trouble - the old controller wouldn't know how to read "pipelineYAML" out of their specs
    • new pipeline runs would use "pipeline spec"

Do we need to support the downgrade case in (2)? (specifically: support runs which execute during the update?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was specifically thinking of users who have written code that expects the pipelineSpec to be embedded in the status rather than scenarios where the system itself might be caught in an intermediate state. But it sounds like you don't think this is a situation we need to worry about?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've updated the proposal to pass the entire resolved yaml via a taskYaml field for taskruns and a pipelineYaml field for pipelineruns.

FYI @vdemeester and @dibyom - this is a change to the proposal so probably worth giving it one more look to confirm you're happy to keep approving.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolving both metadata and spec under a taskYAML or pipelineYAML field sounds good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users who have written code that expects the pipelineSpec to be embedded in the status

ah that's a good point, i was only thinking about this from the controller's perspective - i always forget about status as part of the api 😅

you're probably right, it's probably best to support both - maybe we can use (even more) config options to control whether we do that or not?


There are two approaches that I've considered for solutions to this:

1. Change the contract such that the _entire_ resolved YAML document is passed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the risk of making this too complicated and unreliable, i wonder if we should identify some exceptions to this right off the bat

I was just looking at a Task in the dogfooding cluster and the metadata looks like this:
image

I'm wondering if we want to identify a list of metadata fields and specific annotations we don't copy, e.g.:

  • the annotation kubectl.kubernetes.io/last-applied-configuration
  • managedFields

@_@ OR we start with only labels? we could pave the way for potentially embedding the entire thing by making the field able to store the entire yaml, but initially only store a certain subset that we know we need?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with a note to decide precisely how we filter the copied metadata as part of this experimental phase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we should exclude storing manageFields and last-applied-configuration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with mention of managedFields in addition to last-applied-configuration.

via the `TaskRun` or `PipelineRun` `status`. This would give the Tekton
Pipelines reconcilers total flexibility to process the resolved document as
they need, but would be backwards-incompatible.

2. Add a `status.resolvedMeta` field to `TaskRuns` and `PipelineRuns` that
allows the experimental controller to pass the `metav1.ObjectMeta`
information from a `Task` or `Pipeline` alongside the resolved spec. This
would give enough data for Tekton Pipelines' reconcilers to keep metadata
copying their responsibility, without being a hard change to the way resolved
`specs` are stored on runs.

Given the above two options we are proposing that we employ strategy (1) and
add a `taskYaml` field to the `status` of `TaskRuns` and a `pipelineYaml` field
to the status of `PipelineRuns`.

There are some exceptions to the metadata fields that should be copied across.
For example the annotation `kubectl.kubernetes.io/last-applied-configuration`
can be extremely large and would bloat a TaskRun or PipelineRun
during the resolution phase. Similarly for the `metadata.managedFields` field
The precise set of metadata we include/exclude should be figured out as part
of this experimental period but a couple of approaches would be:

- Limit to just metadata fields prefixed with `[a-zA-Z].tekton.dev/`.
- Exclude known edge-case fields (`last-applied-configuration` annotation,
`metadata.managedFields`)
- Limit to just labels.

### Notes/Caveats (optional)

Expand Down
2 changes: 1 addition & 1 deletion teps/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ This is the complete list of Tekton teps:
|[TEP-0057](0057-windows-support.md) | Windows support | proposed | 2021-03-18 |
|[TEP-0058](0058-graceful-pipeline-run-termination.md) | Graceful Pipeline Run Termination | implementable | 2021-04-27 |
|[TEP-0059](0059-skipping-strategies.md) | Skipping Strategies | implemented | 2021-08-23 |
|[TEP-0060](0060-remote-resource-resolution.md) | Remote Resource Resolution | proposed | 2021-06-15 |
|[TEP-0060](0060-remote-resource-resolution.md) | Remote Resource Resolution | proposed | 2021-08-23 |
|[TEP-0061](0061-allow-custom-task-to-be-embedded-in-pipeline.md) | Allow custom task to be embedded in pipeline | implemented | 2021-05-26 |
|[TEP-0062](0062-catalog-tags-and-hub-categories-management.md) | Catalog Tags and Hub Categories Management | implementable | 2021-03-30 |
|[TEP-0063](0063-workspace-dependencies.md) | Workspace Dependencies | proposed | 2021-04-23 |
Expand Down