Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common update #89

Merged
merged 190 commits into from
Aug 1, 2023
Merged

common update #89

merged 190 commits into from
Aug 1, 2023

Conversation

day0hero
Copy link
Collaborator

updated common to fix ci issues with ESO caused by SECCOMP profile.

claudiol and others added 30 commits March 21, 2023 10:06
This change adds an experimental letsencrypt chart that allows a pattern
user/developer to have all routes and the API endpoint use signed
certificates by letsencrypt.

At this stage only AWS is supported. The full documentation is contained
in the chart's README.md file
In the same vein as Industrial Edge 57f41dc135f72011d3796fe42d9cbf05d2b82052
we call kustomize build.

Newer gitops versions dropped the openshift-clients rpm by default which
contained kubectl. Let's just invoke "kustomize" directly as the binary
is present in both old and new gitops versions

Since "kubectl kustomize" builds the set of resources by default, we
need to switch to "kubectl build" by default

We also use the same naming conventions used in Industrial Edge while
we're at it.
Tested on MCG with hub and spoke
Just a simple example that reads a helm value and puts it in a configmap
Avoid checking those two playbooks the action seems to be too limited
to understand where the ansible.cfg is
This allows argo to continue rolling out the rest of the applications.
Without the health check the application is stuck in a progressing state
and will not continue thus preventing any downstream application from
deploying.
Update super-linter image to latest
Add dependabot settings for github actions
mbaldessari and others added 29 commits July 16, 2023 09:15
We currently have a small inconsistency where we use common/clustergroup
in order to point Argo CD to this chart, but the name inside the chart
is 'pattern-clustergroup'.

This inconsistency is currently irrelevant, but in the future when
migrating to helm charts inside proper helm repos, this becomes
problematic. So let's fix the name to be the same as the folder.

Tested on MCG successfully.
Check if the KUBECONFIG file is inside /tmp
Currently with the following values snippet:

  managedClusterGroups:
    exampleRegion:
      name: group-one
      acmlabels:
      - name: clusterGroup
        value: group-one
      helmOverrides:
      - name: clusterGroup.isHubCluster
        value: false
      clusterPools:
        exampleAWSPool:
          size: 1
          name: aws-ap-bandini
          openshiftVersion: 4.12.24
          baseDomain: blueprints.rhecoeng.com
          controlPlane:
            count: 1
            platform:
              aws:
                type: m5.2xlarge
          workers:
            count: 0
          platform:
            aws:
              region: ap-southeast-2
          clusters:
          - One

You will get a clusterClaim that is pointing to the wrong Pool:
NAMESPACE                 NAME                       POOL
open-cluster-management   one-group-one              aws-ap-bandini

This is wrong because the clusterPool name will be generated using the
pool name + "-" group name:

  {{- $pool := . }}
  {{- $poolName := print .name "-" $group.name }}

But the clusterPoolName inside the clusterName is only using the
"$pool.name" which will make the clusterClaim ineffective as the pool
does not exist.

Switch to using the same poolName that is being used when creating the
clusterPool.
Fix the clusterPoolName in clusterClaims
Let's improve readability by adding some comments to point out which
flow constructs are being ended.
Add some comments to make if/else and loops clearer
Just like we did for the clustergroup chart, let's split the values
file list into a dedicated helper. This time since there are no global
variables we include it with the current context and not with the '$'
context.

Tested with MCG: hub and spoke. Correctly observed all the applications
running on the spoke.
They changed because we made the list indentation more correct (two
extra spaces to the left)
Fix sa/namespace mixup in vault_spokes_init
Also set seccompProfile to null to make things work on OCP 4.10
Tested the ESO upgrade on MCG on both 4.10 and 4.13
* Updated namespaces template to include labels and annotations functionality

* Added schema validation to support additional formal for labels and annotations

* Updated the values-example.yaml to include new format for namespaces

* Updated Changes.md to include new namespaces functionality.

* Updating CI tests

* Fixed Markdown errors

* Add an experimental letsencypt chart

This change adds an experimental letsencrypt chart that allows a pattern
user/developer to have all routes and the API endpoint use signed
certificates by letsencrypt.

At this stage only AWS is supported. The full documentation is contained
in the chart's README.md file

* Do not run kubeconform on the certificate stuff just yet

* Fix up kustomize example

In the same vein as Industrial Edge 57f41dc135f72011d3796fe42d9cbf05d2b82052
we call kustomize build.

Newer gitops versions dropped the openshift-clients rpm by default which
contained kubectl. Let's just invoke "kustomize" directly as the binary
is present in both old and new gitops versions

Since "kubectl kustomize" builds the set of resources by default, we
need to switch to "kubectl build" by default

We also use the same naming conventions used in Industrial Edge while
we're at it.

* Upgrade vault-helm to v0.24.0

Tested on MCG with hub and spoke

* Add a hello-world ansible playbook example

Just a simple example that reads a helm value and puts it in a configmap

* Inject ANSIBLE_CONFIG in make ansible-lint

* Use new ansible-lint action

* Fix some ansible-lint warnings

* Fix up python versions

* Skip cannot find role error

Avoid checking those two playbooks the action seems to be too limited
to understand where the ansible.cfg is

* Added health check for pvc resource in argocd.yaml

This allows argo to continue rolling out the rest of the applications.
Without the health check the application is stuck in a progressing state
and will not continue thus preventing any downstream application from
deploying.

* adding tests

* Update super-linter image to latest

* Update super-linter image to latest

* Update CI workflows

* updated template with why implemented comment

* Add dependabot settings for github actions

* adding tests

* - Added functionality to support the following format for labels and annotations:
      labels:
        openshift.io/node-selector: ""
      annotations:
        openshift.io/cluster-monitoring: "true"

* Fixed CI Issues

* Applying @claudiol recommendation

* make test

* Avoid exited containers proliferation

When running the `pattern.sh` script multiple times, a lot of
podman exited containers will be left on the machine, adding
`--rm` parameter to `podman run` makes podman automatically
delete the exited containers leaving the machine cleaner.

* Handling of pre-release builds is too complex for a helm chart

Generating the ICSP and allowing insecure registries is best done prior
to helm upgrade, and requires VPN access to registry-proxy.engineering.redhat.com

* Fixing issues with operator groups

* Adding CI test

* Updated operator group template

* Updating CI issues

* Removed duplicate code for operatorgroup by using multiple conditions

* Allow overriding the pattern's name

This is especially useful when multiple people are working on a pattern
an have been using different names:

    $ make help |grep Pattern:
    Pattern: multicloud-gitops
    $ make NAME=foobar help |grep Pattern:
    Pattern: foobar

* Add precise instruction to upgrade the vault subchart

* Upgrade vault-helm to v0.24.1

* Add an item to README.md

* Fix up common/ tests

* Fix super linter

* Set gitOpsSpec.operatorSource

After merging validatedpatterns/patterns-operator@235b303
it is now effectively possible to pick a different catalogSource for
the gitops operator. This is needed in order to allow CI to install
the gitops operator from an IIB

* Introduce EXTRA_HELM_OPTS

This variable can be set in order to pass additional helm arguments from the
the command line. I.e. we can set things without having to tweak values files
So it is now possible to run something like the following:

  ./pattern.sh make install \
  EXTRA_HELM_OPTS="--set main.gitops.operatorSource=iib-49232"

* Disable var-naming[no-role-prefix] in ansible lint

* Add new ansible role to deal with IIBs

* Simplify load-iib target

* Add templates folder

* Fix a couple of linting warnings

* Fix some super-linter complaints

* Skip the iib-ci playbook

* Drop var-naming[no-role-prefix] linter

* Allow for multiple images when calling load-iib

* Add help for load-iib

* Output index_image in make

* Output index_image in make (2)

* Set facts later in the playbook not in defaults/

* Fix how we export vars in make load-iib

* Fix how we export vars in make load-iib (2)

* Use machineCount to register the number of nodes that need to be ready

* Add helpful debug messages

* Add | on shell now that we call pipefail

* Test dropping nevercontact source

* Skip insecure tls when logging in

* Also allow gchr.io

* Revert "Test dropping nevercontact source"

This reverts commit d8746a37fce2663018f52203c892f00b825e32a7.

* Fix typo

* Clarify instructions in the README file

* Automate the channel example

* Find out KUBEADMINAPI programmatically

* Use command instead of shell

* Do not grep for operator bundle unless it is the gitops operator

* Also whitelist ghcr.io

* Fetch the operator bundle itself in a more robust way

It seems that the operator bundle image itself is nowhere to be found
inside any OCP cluster object (it's not in packagemanifests nor
catalogsource). Resorting to parsing the IIB via opm alpha commands
to fetch the exact image.

* Add more mirrors

* Some more work to support MCE

* Cleanup spacing

* Fix super-linter

* Move task in right folder

* Drop last mention of operator instead of item

* Improve the grepping for the operator bundle

Without also grepping for the default_channel we can end up getting
multiple results, which breaks everything.

Tested this and it fixed the issue I was seeing with the
openshift-gitops-operator this morning

* Drop display_skipped_hosts

display_skipped_hosts=False has a horrible side-effect:
When a task takes a long time, it is always the *next* task and not the
one printed on the screen/log. That is because ansible has to wait for
the task to finish before printing it as it does not know before hand if
the host will be skipped and hence the task should not be displayed at
all

* Be more specific about the steps in the README

* Upgrade ESO to v0.8.2

* Update README.md

* Update tests after eso 0.8.2 upgrade

* Move to new spec format for dex/sso

Via https://issues.redhat.com/browse/GITOPS-2761 we are told that the
dex configuration has a new format.
Old format:

    spec:
      dex:
        openShiftOAuth: true
        resources:
        ...

New format:

    spec:
      sso:
        provider: dex
        dex:
          openShiftOAuth: true
          resources:
          ...

This format is only supported starting with gitops-1.8.0, so we should
merge this only when we are absolutely sure that no pattern in no
situation needs an older gitops version.

Tested on MCG with gitops-1.8.2

Note: with this change gitops < 1.8 is not supported. Starting with
gitops-1.9 the old format will be unsupported.

* Disable ArgoCD from kubeconform

The reason is that most of the tools we used to generate the json
schema, seem to be unmaintained, so it is getting hard to update
our schemas in our GH org. We'll need to revisit this in the future.

* Add a short line about username/token for the iib role on OCP <= 4.12

* Drop https:// from podman login

Seems we hit https://www.github.com/containers/podman/issues/13691 at
least with older podman versions.

If this turns out to break podman 4.5.0 I will special case it later

* Set the mce-subscription-spec annotation

We set it by default to "redhat-operators" and if defined to .Values.clusterGroup.subscriptions.acm.source
The reason we do this is the following:
1. In a default deployment scenario MCE has to be deployed as normal
   from the redhat-operators catalogSource just as ACM is
2. When we deploy gitops-operator from an IIB instead, MCE would be
   installed trying to get it from the IIB because
   https://www.github.com/stolostron/multiclusterhub-operator/pull/975
   made it so that it picks the latest version looking at all catalog
   sources. But since we only mirrored the gitops operator in the
   cluster, this breaks as the images for MCE from the IIB are not there
   By setting the default to "redhat-operators" we fix this case
3. Now in the case where we want to install ACM from an IIB we need to
   be able to override this and we will pick whatever value is set in
   .Values.clusterGroup.subscriptions.acm.source, which will need to be
   defined for this to work when testing ACM+MCE from an IIB

Note: Currently point 3. works only if you set it in a values file.
Setting .Values.clusterGroup.subscriptions.acm.source via extraParams
won't be passed down from the clusterGroup app to the applications.
It's a bug that we need to fix.

Note(2): We surround this with an 'if kindIs "map" .Values.clusterGroup.subscriptions'
because we do not want to break things if subscription is a list and not
a map. If we ever manage to drop subscriptions as list, then we can
remove that if

* Fix typo in README for iib

* Simplify the README a bit

* Add support for extraParams being passed down to all applications

Via validatedpatterns/patterns-operator#74
we add the extraParams in an extraParametersNested dictionary that holds
the extraParams key/value pairs. If they exist, let's add them as
parameters.

This allows them to end up in the applications.

* Add a lookup playbook to figure out IIB numbers

* Allow overriding channel and source when installing the patterns-operator

This will allow us to test the patterns-operator using a different
catalogsource (potentially installed via an IIB). So we can run:

make EXTRA_HELM_OPTS="\
  --set main.extraParameters[0].name=main.patternsOperator.channel --set main.extraParameters[0].value=slow \
  --set main.extraParameters[1].name=main.patternsOperator.source --set main.extraParameters[1].value=patten-index" install

* Fix small typo in iib instructions

* Drop a redirect and up retries when pushing the IIB to the internal registry

* Update ESO to v0.8.3

* WIP add presync for eso that waits for vault to be up

* Add tests

* Fix image and comment

* Adding rbac to support the vault sa checking on the vault-0 pod status.

* Make Test

* Revert "Make Test"

This reverts commit 64e9dc7.

* Revert "Adding rbac to support the vault sa checking on the vault-0 pod status."

This reverts commit 598bc74.

* Revert "Fix image and comment"

This reverts commit d4d3fe1.

* Revert "Add tests"

This reverts commit ab5532a.

* Revert "WIP add presync for eso that waits for vault to be up"

This reverts commit 2797699.

* Increase the default retry limit when syncing

ArgoCD will retry 5 times by default to sync an application in case of
errors and then will give up. So if an application contains a reference
to a CRD that has not been installed yet (say because it will be
installed by another application), it will error out and retry later.
This happens by default for a maximum of 5 times [1]. After those 5 times
the application will give up and will stay in Degraded moded and
eventually move to Failed. In this case a manual sync will usually fix
the application just fine (i.e. as long as the missing CRD has been
installed in the meantime).

Now to solve this issue we can add complex preSync Jobs that wait for
the needed resources, but this fundamentally breaks the simplicity of
things and introduces unneeded dependencies. In this change we just
increase the default retry limit to something larger (20) that should
cover most cases. The retry limit functionality is rather undocumented
currently in the docs but is defined at [2] and also shown at [3].

In our patterns' case the concrete issue happened as follows:
1. ESO ClusterSecrets were often not synced/degraded
2. We introduced a Job in a preSync hook for the ESO chart that would
   wait on vault to be ready before applying the rest of ESO
3. MCG started failing because the config-demo app had already tried to
   sync 5 times and failed everytime because the ESO CRDs were not
   installed yet (due to ESO waiting on vault)

So instead of adding yet another job, let's just try a lot more often.
We picked 20 as a sane default because that should have argo try for
about 60 minutes (3min is the default maximum backoff limit)

Tested with two MCG installations (with the ESO Job hook included) and
both worked out of the box. Whereas before I managed to get three
failures out of three installs.

[1] https://github.com/argoproj/argo-cd/blob/master/controller/appcontroller.go#L1680
[2] https://github.com/argoproj/argo-cd/blob/master/manifests/crds/application-crd.yaml#L1476
[3] https://github.com/argoproj/argo-cd/blob/master/docs/operator-manual/application.yaml#L202C18-L202C100

* Add Changes.md entry

* Split off global helm variables to a helper definition

We can only split out bits of yaml that reference $.* variables. This is
because these sinppets in _helpers.tbl are passed a single context
either $ or . and cannot use both like the top-level domain.

* Switch ApplicationSets to use the newly-introduced helpers

I only remove the variables that are defined identically in
ApplicationSet and in the helper. Leaving the other ones as is
as their presence is not entirely clear to me and I do not want to
risk breaking things.

* Split off valueFiles to _helpers.tbl

* Switch applicationsets to use the new helper

* Drop some older comments

* Tweak the load secret debug message to be clearer

When HOME is set we replace it with '~' in this debug message
because when run from inside the container the HOME is /pattern-home
which is confusing for users. Printing out '~' when at the start of
the string is less confusing.

Before:
ok: [localhost] => {
    "msg": "/home/michele/.config/hybrid-cloud-patterns/values-secret-multicloud-gitops.yaml"
}

After:
ok: [localhost] => {
    "msg": "~/.config/hybrid-cloud-patterns/values-secret-multicloud-gitops.yaml"
}

* Check if the KUBECONFIG file is pointing outside of the HOME folder

If it is somewhere under /tmp or out of the HOME folder, bail out
explaining why. This has caused a few silly situations where the user
would save the KUBECONFIG file under /tmp. Since bind-mounting /tmp
seems like a wrong thing to do in general, we at least bail out with a
clear error message. To do this we rely on a bash functionality so let's
just switch the script to use that.

Tested as follows:
export KUBECONFIG=/tmp/kubeconfig
./scripts/pattern-util.sh make help
/tmp/kubeconfig is pointing outside of the HOME folder, this will make it unavailable from the container.
Please move it somewhere inside your /home/michele folder, as that is what gets bind-mounted inside the container

export KUBECONFIG=~/kubeconfig
./scripts/pattern-util.sh make help
Pattern: common

Usage:
  make <target>
...

* Include an example SNO cluster pool in the tests

* Enforce lowercase names for cluster claims

* Avoid mixing yaml and json in the OCP install-config

* Update provisioning tests

* Sanely handle cluster pools with no clusters (yet)

* Clustergroup Chart.yaml name change

We currently have a small inconsistency where we use common/clustergroup
in order to point Argo CD to this chart, but the name inside the chart
is 'pattern-clustergroup'.

This inconsistency is currently irrelevant, but in the future when
migrating to helm charts inside proper helm repos, this becomes
problematic. So let's fix the name to be the same as the folder.

Tested on MCG successfully.

* Fix the clusterPoolName in clusterClaims

Currently with the following values snippet:

  managedClusterGroups:
    exampleRegion:
      name: group-one
      acmlabels:
      - name: clusterGroup
        value: group-one
      helmOverrides:
      - name: clusterGroup.isHubCluster
        value: false
      clusterPools:
        exampleAWSPool:
          size: 1
          name: aws-ap-bandini
          openshiftVersion: 4.12.24
          baseDomain: blueprints.rhecoeng.com
          controlPlane:
            count: 1
            platform:
              aws:
                type: m5.2xlarge
          workers:
            count: 0
          platform:
            aws:
              region: ap-southeast-2
          clusters:
          - One

You will get a clusterClaim that is pointing to the wrong Pool:
NAMESPACE                 NAME                       POOL
open-cluster-management   one-group-one              aws-ap-bandini

This is wrong because the clusterPool name will be generated using the
pool name + "-" group name:

  {{- $pool := . }}
  {{- $poolName := print .name "-" $group.name }}

But the clusterPoolName inside the clusterName is only using the
"$pool.name" which will make the clusterClaim ineffective as the pool
does not exist.

Switch to using the same poolName that is being used when creating the
clusterPool.

* Add some comments to make if/else and loops clearer

Let's improve readability by adding some comments to point out which
flow constructs are being ended.

* Add some more comments in applications.yaml

* Add a default for options applicationRetryLimit

* Split out values files to a helper for the acm chart

Just like we did for the clustergroup chart, let's split the values
file list into a dedicated helper. This time since there are no global
variables we include it with the current context and not with the '$'
context.

Tested with MCG: hub and spoke. Correctly observed all the applications
running on the spoke.

* Fix up tests

They changed because we made the list indentation more correct (two
extra spaces to the left)

* Fix sa/namespace mixup in vault_spokes_init

* Update local patch

Also set seccompProfile to null to make things work on OCP 4.10

* Update ESO to 0.8.5

* Tweak ESO UBI images

Tested the ESO upgrade on MCG on both 4.10 and 4.13

* Removed previous version of common to convert to subtree from https://github.com/hybrid-cloud-patterns/common.git main

* make test

---------

Co-authored-by: Lester Claudio <claudiol@redhat.com>
Co-authored-by: Michele Baldessari <michele@acksyn.org>
Co-authored-by: Lorenzo Dalrio <ldalrio@redhat.com>
Co-authored-by: Andrew Beekhof <andrew@beekhof.net>
Co-authored-by: Martin Jackson <mhjacks@redhat.com>
Co-authored-by: Tom Stockwell <2060486+stocky37@users.noreply.github.com>
If you call the load-iib target you *must* set INDEX_IMAGES, so
let's error out properly if you do not.

Tested as:

        $ unset INDEX_IMAGES
        $ make load-iib
        make -f common/Makefile load-iib
        make[1]: Entering directory '/home/michele/Engineering/cloud-patterns/multicloud-gitops'
        No INDEX_IMAGES defined. Bailing out

        $ export INDEX_IMAGES=foo
        make load-iib
        make -f common/Makefile load-iib
        make[1]: Entering directory '/home/michele/Engineering/cloud-patterns/multicloud-gitops'

        PLAY [IIB CI playbook] ***
Error out from load-iib when INDEX_IMAGES is undefined
git-subtree-dir: common
git-subtree-mainline: eb45d81
git-subtree-split: 35e64a1
@day0hero day0hero merged commit a246609 into v2.5 Aug 1, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants