Skip to content

Troubleshooting

Jan Richter edited this page Oct 16, 2023 · 7 revisions

This page contains all the hints about troubleshooting we did while working with our own long-living cluster hosting an RHTAP deployment from our infra-deployments repo fork.

Bootstrap job failing

If the bootstrap job is failing, check the status of all applications in the ArgoCD. You can also check the Home -> Overview page on the cluster to see the Cluster inventory -> Pods section. If there are some pods crashing, they should be found there.

Subscription issues

It probably happens sometimes when the update of some operator fails. It results in two operators existing, one old and one new, having conflict with the subscription as they are cross-pointing to it.

Symptoms

constraints not satisfiable:
...
originate from package <package>, clusterserviceversion
<operator> exists and is not referenced by a subscription, subscription <subscription> requires at least one of ...

Solution

Go to the openshift-operator-lifecycle-manager namespace and restart both the catalog-operator and olm-operator pods as they sometimes keep cashed data that may prevent proper updates.

Private repo authentication

The shared-configuration-file has to include a secret with an actual client ID and client secret of the GitHub OAuth application. The OAuth application used by tests is created under the hac-test GH test user. The password of this user is stored in Vault.

Toolchain stops provisioning users

Maybe be caused by reaching the limit of resources for auto-approval - by default set to 80%. You can check the resources consumption by running:

oc get toolchainstatus -n toolchain-host-operator -o yaml

Keycloak stops provisioning users

Keycloak stops serving new users for the PR check jobs. The PostgreSQL database is outdated and not able to update (10 -> 13). It's not clear why is it happening if it was already fixed once.

Symptoms

In the PR check, you can see this instead of user provisioning:

 # Call the keycloak API and add a user
 B64_USER=$(oc get secret ${ENV_NAME}-keycloak -o json | jq '.data.username'| tr -d '"')
 B64_PASS=$(oc get secret ${ENV_NAME}-keycloak -o json | jq '.data.password' | tr -d '"')
 # These ENVs are populated in the Jenkins job by Vault secrets
 python tmp/keycloak.py $HAC_KC_SSO_URL $HAC_KC_USERNAME $HAC_KC_PASSWORD $B64_USER $B64_PASS $HAC_KC_REGISTRATION
 Traceback (most recent call last):
   File "/var/lib/jenkins/workspace/openshift-hac-dev-pr-check/build/container_workspace/.bonfire_venv/lib64/python3.6/site-packages/requests/models.py", line 910, in json
     return complexjson.loads(self.text, **kwargs)
   File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
     return _default_decoder.decode(s)
   File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
   File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
     raise JSONDecodeError("Expecting value", s, err.value) from None
 json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Solution

Check the dev-sso namespace in the cluster or in ArgoCD, there should be pods failing. The pod should be in the CrashLoopBackoff and should have logs:

Incompatible data directory.  This container image provides
PostgreSQL '13', but data directory is of
version '10'.
This image supports automatic data directory upgrade from
'12', please _carefully_ consult image documentation
about how to use the '$POSTGRESQL_UPGRADE' startup option.

The $POSTGRESQL_UPGRADE option is not suitable for us as ArgoCD is preventing us from doing changes to relevant files. The only fix is to remove PersistentVolumeClaim and PersistentVolume hosting the Postgre database. You can find those in Storage section. There is one and only PVC in dev-sso namespace, but be sure to remove the PV with the keycloak-postgresql-claim Claim. If you remove those, you have to manually remove finalizers from their yamls. Otherwise, they stay there in a Terminating state forever.

SPI update fails

From time to time, during the SPI update, the Vault is sealed and prevents the update to finish. In the Bootstrap job, there is a step to unseal a Vault in case of the Bootstrap job failure. To do that manually, go to the spi-vault namespace, and in the pod terminal, run the /vault/userconfig/scripts/poststart.sh script.

Components are not being deployed after build

Components created get built successfully, but their pods never get spun up afterwards.

Symptoms

Component builds finish, but their respective SnapshotEnvironmentBinding shows something like

componentDeploymentConditions:
    - lastTransitionTime: '2023-07-17T12:23:46Z'
      message: 0 of 1 components deployed
      reason: CommitsUnsynced
      status: 'False'
      type: AllComponentsDeployed

No pod running the component gets deployed.

Solution

Gitops service is being blocked by one of the tenant namespaces being inaccessible. So far this only happened when the namespace in question is stuck in terminating state. Check the application controller in gitops-service-argocd namespace. Look for errors the likes of

{\"lastTransitionTime\":\"2023-07-17T14:48:16Z\",\"message\":\"error synchronizing cache state : failed to sync cluster https://172.30.0.1:443: failed to load initial state of resource RoleBinding.rbac.authorization.k8s.io: rolebindings.rbac.authorization.k8s.io is forbidden: User \\\"system:serviceaccount:gitops-service-argocd:gitops-service-argocd-argocd-application-controller\\\" cannot list resource \\\"rolebindings\\\" in API group \\\"rbac.authorization.k8s.io\\\" in the namespace \\\"50y1wy7c-tenant\\\"\",\"type\":\"UnknownError\"} 

Delete the mentioned namespace. More generally, delete any namespaces stuck in terminating to be sure.

Enterprise Contract keeps failing on image signing

The compulsory enterprise contract integration test keeps failing no matter the source or runtime image.

Symptoms

EC consistently fails with the following

violations:
  - metadata:
      code: builtin.attestation.signature_check
      description: The attestation signature matches available signing materials.
      title: Attestation signature check passed
    msg: No image attestations found matching the given public key. Verify the correct
      public key was provided, and one or more attestations were created.
  - metadata:
      code: builtin.image.signature_check
      description: The image signature matches available signing materials.
      title: Image signature check passed
    msg: 'Image signature check failed: no signatures found for image'

Solution

Re-run the tekton-chains-secrets-migration Job to update the latest signing-secrets into tekton-chains and openshift-pipelines namespace.

Check the secret called signing-secrets in openshift-pipelines namespace. If the secret doesn't exist, or is not equal to the same secret in tekton-chains namespace, copy the secret from tekton-chains to openshift-pipelines.

PostgeSQL pod is in CrashLoopBackoff state

It happened in the gitops namespace but may happen for any PostgreSQL deployment.

Symptoms

The pod running PostgreSQL is in the CrashLoopBackoff (and also all other pods that want to connect there). In the logs, there is just one line saying

chmod: changing permissions of '/var/lib/pgsql/data/userdata': Operation not permitted

A restart of the pod did not solve it.

Solution

Inspired by Solution on RH Customer Portal, I've tried changing permission for the /var/lib/pgsql/data/userdata folder. I had issues executing chmod during the oc debug command. The way that worked for me was removing the folder and creating it again (which led to losing data there).

Environments not provisioning/deleting

New user environments are not being provisioned or deleted, blocking new usersignups from completing.

Symptoms

Gitops service manager keeps reporting TLS errors. Toolchain host reports something like

Error from server (InternalError): error when replacing "STDIN": Internal error occurred: failed calling webhook "venvironment.kb.io": failed to call webhook: Post "https://gitops-appstudio-service-webhook-service.gitops.svc:443/validate-appstudio-redhat-com-v1alpha1-environment?timeout=10s": x509: certificate signed by unknown authority

Solution

Go to API -> ValidatingWebhookConfiguration and find gitops-appstudio-service-validating-webhook-configuration instance. Look for venvironment.kb.io webhook. In its clientConfig there should be a caBundle item identical to other webhooks. If it is empty, copy it over from another webhook config.

GH hac-test backup recovery codes

GH hac-test recovery code