-
Notifications
You must be signed in to change notification settings - Fork 1
Troubleshooting
This page contains all the hints about troubleshooting we did while working with our own long-living cluster hosting an RHTAP deployment from our infra-deployments repo fork.
If the bootstrap job is failing, check the status of all applications in the ArgoCD. You can also check the Home -> Overview
page on the cluster to see the Cluster inventory -> Pods
section. If there are some pods crashing, they should be found there.
It probably happens sometimes when the update of some operator fails. It results in two operators existing, one old and one new, having conflict with the subscription as they are cross-pointing to it.
constraints not satisfiable:
...
originate from package <package>, clusterserviceversion
<operator> exists and is not referenced by a subscription, subscription <subscription> requires at least one of ...
Go to the openshift-operator-lifecycle-manager
namespace and restart both the catalog-operator
and olm-operator
pods as they sometimes keep cashed data that may prevent proper updates.
The shared-configuration-file
has to include a secret with an actual client ID and client secret of the GitHub OAuth application. The OAuth application used by tests is created under the hac-test
GH test user. The password of this user is stored in Vault.
Maybe be caused by reaching the limit of resources for auto-approval - by default set to 80%. You can check the resources consumption by running:
oc get toolchainstatus -n toolchain-host-operator -o yaml
Keycloak stops serving new users for the PR check jobs. The PostgreSQL database is outdated and not able to update (10 -> 13). It's not clear why is it happening if it was already fixed once.
In the PR check, you can see this instead of user provisioning:
# Call the keycloak API and add a user
B64_USER=$(oc get secret ${ENV_NAME}-keycloak -o json | jq '.data.username'| tr -d '"')
B64_PASS=$(oc get secret ${ENV_NAME}-keycloak -o json | jq '.data.password' | tr -d '"')
# These ENVs are populated in the Jenkins job by Vault secrets
python tmp/keycloak.py $HAC_KC_SSO_URL $HAC_KC_USERNAME $HAC_KC_PASSWORD $B64_USER $B64_PASS $HAC_KC_REGISTRATION
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/openshift-hac-dev-pr-check/build/container_workspace/.bonfire_venv/lib64/python3.6/site-packages/requests/models.py", line 910, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Check the dev-sso
namespace in the cluster or in ArgoCD, there should be pods failing. The pod should be in the CrashLoopBackoff and should have logs:
Incompatible data directory. This container image provides
PostgreSQL '13', but data directory is of
version '10'.
This image supports automatic data directory upgrade from
'12', please _carefully_ consult image documentation
about how to use the '$POSTGRESQL_UPGRADE' startup option.
The $POSTGRESQL_UPGRADE option is not suitable for us as ArgoCD is preventing us from doing changes to relevant files. The only fix is to remove PersistentVolumeClaim
and PersistentVolume
hosting the Postgre database. You can find those in Storage
section. There is one and only PVC in dev-sso namespace, but be sure to remove the PV with the keycloak-postgresql-claim
Claim. If you remove those, you have to manually remove finalizers
from their yamls. Otherwise, they stay there in a Terminating state forever.
From time to time, during the SPI update, the Vault is sealed and prevents the update to finish. In the Bootstrap job, there is a step to unseal a Vault in case of the Bootstrap job failure. To do that manually, go to the spi-vault
namespace, and in the pod terminal, run the /vault/userconfig/scripts/poststart.sh
script.
Components created get built successfully, but their pods never get spun up afterwards.
Component builds finish, but their respective SnapshotEnvironmentBinding
shows something like
componentDeploymentConditions:
- lastTransitionTime: '2023-07-17T12:23:46Z'
message: 0 of 1 components deployed
reason: CommitsUnsynced
status: 'False'
type: AllComponentsDeployed
No pod running the component gets deployed.
Gitops service is being blocked by one of the tenant namespaces being inaccessible. So far this only happened when the namespace in question is stuck in terminating
state. Check the application controller in gitops-service-argocd
namespace. Look for errors the likes of
{\"lastTransitionTime\":\"2023-07-17T14:48:16Z\",\"message\":\"error synchronizing cache state : failed to sync cluster https://172.30.0.1:443: failed to load initial state of resource RoleBinding.rbac.authorization.k8s.io: rolebindings.rbac.authorization.k8s.io is forbidden: User \\\"system:serviceaccount:gitops-service-argocd:gitops-service-argocd-argocd-application-controller\\\" cannot list resource \\\"rolebindings\\\" in API group \\\"rbac.authorization.k8s.io\\\" in the namespace \\\"50y1wy7c-tenant\\\"\",\"type\":\"UnknownError\"}
Delete the mentioned namespace. More generally, delete any namespaces stuck in terminating to be sure.
The compulsory enterprise contract integration test keeps failing no matter the source or runtime image.
EC consistently fails with the following
violations:
- metadata:
code: builtin.attestation.signature_check
description: The attestation signature matches available signing materials.
title: Attestation signature check passed
msg: No image attestations found matching the given public key. Verify the correct
public key was provided, and one or more attestations were created.
- metadata:
code: builtin.image.signature_check
description: The image signature matches available signing materials.
title: Image signature check passed
msg: 'Image signature check failed: no signatures found for image'
Re-run the tekton-chains-secrets-migration Job to update the latest signing-secrets
into tekton-chains
and openshift-pipelines
namespace.
Check the secret called signing-secrets
in openshift-pipelines
namespace. If the secret doesn't exist, or is not equal to the same secret in tekton-chains
namespace, copy the secret from tekton-chains
to openshift-pipelines
.
It happened in the gitops
namespace but may happen for any PostgreSQL deployment.
The pod running PostgreSQL is in the CrashLoopBackoff (and also all other pods that want to connect there). In the logs, there is just one line saying
chmod: changing permissions of '/var/lib/pgsql/data/userdata': Operation not permitted
A restart of the pod did not solve it.
Inspired by Solution on RH Customer Portal, I've tried changing permission for the /var/lib/pgsql/data/userdata
folder. I had issues executing chmod
during the oc debug
command. The way that worked for me was removing the folder and creating it again (which led to losing data there).
New user environments are not being provisioned or deleted, blocking new usersignups from completing.
Gitops service manager keeps reporting TLS errors. Toolchain host reports something like
Error from server (InternalError): error when replacing "STDIN": Internal error occurred: failed calling webhook "venvironment.kb.io": failed to call webhook: Post "https://gitops-appstudio-service-webhook-service.gitops.svc:443/validate-appstudio-redhat-com-v1alpha1-environment?timeout=10s": x509: certificate signed by unknown authority
Go to API -> ValidatingWebhookConfiguration
and find gitops-appstudio-service-validating-webhook-configuration
instance. Look for venvironment.kb.io
webhook. In its clientConfig
there should be a caBundle
item identical to other webhooks. If it is empty, copy it over from another webhook config.