Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout automation #17

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

Rollout automation #17

wants to merge 18 commits into from

Commits on Apr 12, 2022

  1. RFC on rollout automation

    This adds a "work in progress" RFC on changing the layering to make
    rollouts possible. It's an RFC, rather than a design, because it
    proposes changes to the design -- if I can get it into a good shape
    (including, possibly, a reference implementation), then I'll update
    the design docs as a follow-up.
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 12, 2022
    Configuration menu
    Copy the full SHA
    4e16dca View commit details
    Browse the repository at this point in the history
  2. Record questions and suggestions

    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 12, 2022
    Configuration menu
    Copy the full SHA
    3602288 View commit details
    Browse the repository at this point in the history

Commits on Apr 13, 2022

  1. Explain atomicity requirements

    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    1353834 View commit details
    Browse the repository at this point in the history
  2. Remove Partition type in favour of direct relation

    The partition type was the locus for the module->cluster assignment,
    with the idea being that a rollout controller can change which
    clusters belong to which partitions to effect a rollout.
    
    However, it's not necessary to have an intermediate object for this
    relation -- clusters can simply be owned by modules directly, without
    losing generality. Since partitions don't pay their way, I've removed
    them from the design.
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    0e4db2e View commit details
    Browse the repository at this point in the history
  3. Rework how assignment is represented

    Using owner references will interact with Kubernetes' garbage
    collection. Even if it's possible to work around that with e.g.,
    finalisers, it seems like a bad idea to subvert GC. It also doesn't
    help with atomicity, for which other workarounds are needed.
    
    Instead, use the object that already represents the assignment of
    modules to a cluster -- the RemoteAssemblage -- and change its meaning
    to suit. More shuffling may be necessary to make BootstrapModules fit
    too.
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    2455623 View commit details
    Browse the repository at this point in the history
  4. Summarise proposed changes in RFC001

    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    e300b4c View commit details
    Browse the repository at this point in the history
  5. Remove the extra layer in rollout RFC

    After realising that assemblages represent the assignment of modules
    to clusters -- thus no need for another object -- a further
    realisation is that building a Rollout layer on top of modules makes
    modules essentially passive objects. Another way to look at it is that
    Modules may as well be the locus for rollout logic, since otherwise
    they do not have a job to do.
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    d2f04be View commit details
    Browse the repository at this point in the history
  6. Show user in RFC diagrams

    This is just to indicate where the user is expected to interact with
    the system. It doesn't affect the technical bits of the design, but
    might help orient readers.
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    b8e5236 View commit details
    Browse the repository at this point in the history
  7. Add ProxyAssemblage type

        kubebuilder create api --kind ProxyAsssemblage ...
    
    One change needed for rollouts is to have an "assemblage" object to be
    the place all the assigned bootstrap modules (syncs which get applied
    remotely from the control plane) get recorded for each cluster. A good
    name for that would be `RemoteAssemblage` (e.g., assemblage which is
    applied remotely) -- but that name is already used for the one that
    proxies to a remote cluster, which would be better-named
    `ProxyAssemblage`. To create myself some bother, I'm adding the new
    type and the rename at the same time.
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    98ee78d View commit details
    Browse the repository at this point in the history
  8. Copy implementation for remote to proxy

    As part of the rename of RemoteAssemblage to ProxyAssemblage, this
    commit copies the implementation of the remote assemblage controller
    to the proxy assemblage controller, ports the tests for the former to
    the latter, and likewise swaps the type used in module_controller.go.
    
    The implementation for (repurposed) remote assemblages will be new,
    and need new tests.
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    c335b54 View commit details
    Browse the repository at this point in the history
  9. Expand bootstrap modules via remote assemblages

    At present, bootstrap modules get expanded directly into Flux
    primitives (one GitRepository, and a Kustomization per
    cluster). Incremental rollouts could be implemented within this
    scheme; however, it would not help third party integrations much since
    they'd have to implement all of it themselves. Instead, I'm
    repurposing the RemoteAssemblage type -- the behaviour of which was
    moved to the better-named ProxyAssemblage -- to represent a set of
    syncs to be applied remotely to a cluster.
    
    In this commit:
    
     - change the definition of the RemoteAssemblage type so it contains
       syncs to apply remotely, rather than syncs to proxy (those were
       moved to ProxyAssemblage)
    
     - move the "expand to Flux primitives" code from the bootstrap module
       controller to the remote assemblage controller
    
     - implement the construction of remote assemblages in the bootstrap
       module controller
    
     - fix up type names (remote v proxy) in RBAC anontations and comments
     - adapt test code to the above changes
    
    The aim is to eventually put the commonly useful bits -- expansion to
    Flux primitives, and binding evaluation -- in the assemblage code, and
    the rollout logic in the module code. An integration or extension can
    then replace the module part by building on the assemblage part.
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    030983c View commit details
    Browse the repository at this point in the history
  10. Avoid a sync vs cluster ready race

    When you create a cluster with Cluster API, it will typically take
    some minutes between the Cluster object (and its companions) appearing
    and the provider creating a secret containing a kubeconfig for the
    newly-provisioned cluster. If you have a bootstrap module already, the
    Flux primitives targeting the new cluster will fail in that period,
    possibly leading to a long back-off.
    
    There's at least a couple of places this can be mitigated:
    
     - only create a RemoteAssemblage when a cluster is marked ready;
     - only create Flux primitives when the required secret exists;
    
    Here, I've implemented the latter, since it is at a lower level and
    therefore will apply more broadly.
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    1c9c728 View commit details
    Browse the repository at this point in the history
  11. Factor out Kustomize ready-state func

    The assemblage controller has a func for calculating the ready state
    of the Kustomization objects it created. This is useful for the remote
    assemblage controller too, which is in a different layer (module
    rather than assemblage) -- so, recreate it in pkg/sync to be available
    to both layers.
    
    I changed the argument type; it seems like a calculation specific to
    the Flux Kustomize type. But if I need it for other things, perhaps it
    can be generalised, or just replaced with kstatus.
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    c577fe6 View commit details
    Browse the repository at this point in the history
  12. Do more to record bootstrap module status

     - adapt the code for tracking assemblages and recording sync states
       and module summaries, from module_controller to
       bootstrapmodule_controller
     - use KustomizeReadyState from pkg/api/, consistently
     - try to get the RBAC annotations more accurate (again)
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    2af9d21 View commit details
    Browse the repository at this point in the history
  13. Watch owned objects to react to OOB changes

    It's expected (and useful) behaviour to recreate a controller-created
    object that is deleted out of band (e.g., by a pesky human). To make
    it work this way, it's usually only necessary for the controller to
    watch the objects in question, since a deletion will then trigger a
    reconciliation of the owner object.
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    8e7416e View commit details
    Browse the repository at this point in the history
  14. Wait for clusters to be marked as ready

    Previously, the remote assemblage controller was made to skip (and
    come back to) a cluster when its kubeconfig secret is missing. This
    accounts broadly for races in provisioning clusters, since ultimately
    its the secret that matters.
    
    However, with Cluster API providers its possible for a kubeconfig
    secret to exist before the cluster is ready to be used. This situation
    _also_ leads to long backoffs and apparent lack of responsiveness. To
    mitigate that, this commit adds a similar mechanism to the above, at
    another layer: clusters that are not marked as ready (in
    `.status.controlPlaneReady`) are skipped. This is at the module (and
    bootstrapmodule) level, because it's modules that react to clusters --
    assemblages just refer to kubeconfig secrets.
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    10e23fe View commit details
    Browse the repository at this point in the history
  15. Poll for assemblage updates in the downstream

    The ProxyAssemblage controller does not have the opportunity to be
    notified of changes in a downstream cluster (or at least, this would
    be tricky to arrange, especially at scale), so it won't in general see
    when its Assemblage counterpart changes status.
    
    A dumb but effective way to make sure updates are seen is to simply
    requeue each ProxyAssemblage to be re-examined after a small
    interval. This will be OK at moderate scale (I would guess O(100)
    clusters), since more worker threads can be configured to cope with
    polling. Larger scales may need a less brute force approach to be
    figured out.
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    77bddb3 View commit details
    Browse the repository at this point in the history
  16. Tidy up demo scripts and instructions

     - fix a dependency loop in the demo with CRDs and their use
     - fix a typo in the enrol script
     - give some clues about manually cleaning up
    
    Signed-off-by: Michael Bridgen <mikeb@squaremobius.net>
    squaremo committed Apr 13, 2022
    Configuration menu
    Copy the full SHA
    7755045 View commit details
    Browse the repository at this point in the history