Ability to set custom ccp_monitoring pass #3719

jmckulk · 2023-09-13T15:40:49Z

With this change, users can update the -monitoring secret with a password, in either the stringData or data secret fields, and remove the verifier to update the ccp_monitoring password in Postgres. After changing the secret, the watcher logic will notice the change and restart the postgres_exporter process with the updated password. This change is to support monitoring for standby clusters. Before this, a standby cluster would be created without having access to the ccp_monitoring user password that was replicated from the primary cluster.

In addition to allowing a custom ccp_monitoring password, this change updates the exporter to use queries from pgMonitor 4.9 and refactors our e2e tests.

benjaminjb

I know you wanted to get the passfile work in, so this is just an initial look at this step.

benjaminjb · 2023-09-14T15:28:26Z

testing/kuttl/e2e/exporter-password-change/00-assert.yaml

+commands:
+# Check that all containers in the instance pod are ready
+- script: |
+    retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; }


Is this retrying when you have

[ "$pod" = "" ] && retry "Pod not found" && exit 1

? We don't need to try to get pod again?

"retry" here is basically sleep then exit with a message (maybe there is a better name for the function 🤔 ). If the pod isn't found, the script fails and Kuttl runs it again.

So the whole script will be run every time

Ah, right, because what we discovered in testing was that the commands script in a TestAssert gets rerun from the beginning if any element fails.

exit_with_message()
wait_and_exit()

🤔

How many times would it retry?

the retry doesn't do anything special - it just waits and expects an exits 1. Kuttl will loop through the script until it exits 0 or hits a timeout

Yeah, I guess this is a feature of TestAssert steps that has an unexpected side-effect when using commands:

TestAsserts wait until their conditions are true -- so if you have a TestAssert looking for a pod with particular labels, KUTTL will keep checking until it finds that pod or until it times out.

so if a commands section in a TestAssert fails, it just gets restarted.

If I put it that way, I get it, but I'm still surprised by that behavior the first few times I've seen it. Maybe worth a comment in the README?

Yeah, i imagine a kuttl how to doc somewhere?

One thing of note - the script will loop until 1) it passes AND 2) the assert files are met. So even if a script exits 0, it will continue to run until the assert files (e.g., yaml asserts on a postgres cluster) also pass

I didn't end up changing the name of the retry function - Will add to backlog

testing/kuttl/e2e/exporter-password-change/00-assert.yaml

testing/kuttl/e2e/exporter-password-change/01-assert.yaml

testing/kuttl/e2e/exporter-password-change/04-assert.yaml

testing/kuttl/e2e/exporter-password-change/README.md

internal/controller/postgrescluster/pgmonitor.go

benjaminjb

Just a few comments/questions, mostly re comments/naming things

internal/pgmonitor/exporter.go

benjaminjb · 2023-09-20T17:32:43Z

testing/kuttl/e2e/exporter-custom-queries/00-assert.yaml

+# Then, check the contents of the queries to ensure queries.yml was generated correctly
+# Finally, store the current exporter pid as an annotation
+- script: |
+    retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; }


I almost want to move the exit 1 into retry so then we can say something like, "when this func is called, we sleep, then fail, to signal to KUTTL to restart this test step"

i like that

I'm not sure if this would work 🤔 would that just exit the retry function?

hmmm, what if we set -e at the top of the script? Not sure abt that, just a thought to maybe try.

I'm curious: if you take retry out and just put in a sleep, does anything get printed if the script fails?

I did a quick test and set -e does return as expected.

I'm curious: if you take retry out and just put in a sleep, does anything get printed if the script fails?

I'm not tracking... like this?

{ contains x y } || { sleep 5 && exit 1}

with that, there is no error message - i'm not sure what would happen if the error continued until the timeout

the first command just sleeps and exits but the second uses retry

logger.go:42: 09:58:34 | exporter-no-tls/0-00--create-cluster | running command: [sh -c set -e retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5 && exit 1' - "$@"; } ... pod=$(kubectl get pods -o name -n "${NAMESPACE}" \ -l postgres-operator.crunchydata.com/cluster=exporter-no-tls \ -l postgres-operator.crunchydata.com/crunchy-postgres-exporter=true) [ "$pod" = "" ] && sleep 5 && exit 1 ... logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | running command: [sh -c set -e ... condition_json=$(kubectl get "${pod}" -n "${NAMESPACE}" -o jsonpath="{.status.conditions}") [ "$condition_json" = "" ] && retry "conditions not found" { check_containers_ready "$condition_json"; } || { retry "containers not ready" } ... logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | false logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | containers not ready logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | Sleeping... logger.go:42: 09:58:46 | exporter-no-tls/0-00--create-cluster | running command: [sh -c set -e

Yeah, this

{ contains x y } || { sleep 5 && exit 1}

is what I was thinking. I was wondering if KUTTL would report something when it gets to that exit, but no, thinking about it now, I don't see why it would.

That said, do you know where the false is coming from in

logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | false

check_containers_ready... it takes the argument, the .status.conditions of the pod, uses jq and whatnot to get the ContainersReady status, and compares it to True and returns the result which in this case is false

locally i have sent the output of jq > /dev/null so it doesn't have that line. I'll push that up shortly

I didn't end up pushing this change. Will add to backlog

testing/kuttl/e2e/exporter-password-change/02-assert.yaml

testing/kuttl/e2e/exporter-replica/00-assert.yaml

testing/kuttl/e2e/exporter-password-change/files/update-monitoring-password-checks.yaml

internal/pgmonitor/exporter.go

testing/kuttl/e2e-other/exporter-append-custom-queries/README.md

dsessler7 · 2023-09-20T21:10:09Z

testing/kuttl/e2e/exporter-password-change/00-assert.yaml

+commands:
+# Check that all containers in the instance pod are ready
+- script: |
+    retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; }


How many times would it retry?

testing/kuttl/e2e/exporter-password-change/README.md

testing/kuttl/e2e/exporter-replica/00--create-cluster.yaml

internal/pgmonitor/exporter.go

testing/kuttl/e2e/exporter-standby/00--create-certs.yaml

testing/kuttl/e2e/exporter-standby/02--set-primary-password.yaml

testing/kuttl/e2e/exporter-standby/03--create-standby.yaml

testing/kuttl/e2e/exporter-standby/04--set-standby-password.yaml

testing/kuttl/e2e/exporter-standby/04-assert.yaml

testing/kuttl/e2e/exporter-standby/README.md

dsessler7

LGTM

testing/kuttl/e2e/exporter-no-tls/00-assert.yaml

benjaminjb

LGTM

With this change users can update the <cluster>-monitoring secret with a password, in either the `stringData` or `data` secret fields, and remove the verifier to update the ccp_monitoring password in postgres. After this users will need to restart the exporter process by deleting the instance pods (a solution that doesn't require full pod restarts is coming). This change is to support monitoring for standby clusters. Before this a standby cluster would be created without having access to the ccp_monitoring user password that was replicated from the primary cluster. Test to ensure that the postgres_exporter can scrape postgres using a custom ccp_monitoring password. The tests will: 1. create a cluster with exporter enabled and ensure metrics can be collected 2. Update the password and restart the pod 3. ensure that metrics can still be collected with the new password Tests now require jq to run - Refactor existing exporter tests - Split out the tls and no-tls tests into separate directories. - Update the tests to check the containers ready conditions - Add collectors for test failures - Include a test where we deploy a postgres-cluster with monitoring enabled on a replica. It will then check that the exporter on the replica can hit the query the database - Update exporter to use pass file - The exporter container now provides the ccp_monitoring password to postgres_exporter using a password file instead of an environment variable. With this, the password can be updated without requiring a container restart. The path to the password file has also been added to the exporter watcher logic meaning that the postgres_exporter process will be restarted when either the queries directory or password file change. - The password change test is updated to check that the postgres_exporter pid has been updated before trying to re-connect. - Update pgmonitor 4.9 - Update to pull pgMonitor 4.9 queries. The new version has a specific file for the global_dbsize metric that needs to be included when generating the default queries - Standby metrics testing - Now that the password for the monitoring user is configurable, users can configure a standby cluster to allow the exporter to query postgres using the ccp_monitoring user. This change implements testing to validate this use case. This test is included in e2e-other because it requires more work. We need to ensure a backup is complete before attempting to curl metrics. See note below* Note: Move standby and replica tests to e2e-other These two test can fail because of a scrape_error if a backup has not completed. They need to be updated to check that a backup is complete before attempting to collect metrics. There is a related story in our backlog. Due to the race condition, backup not being complete, they could pass or fail. After a backup chack is in place they should be able to move back into the e2e directory.

benjaminjb reviewed Sep 14, 2023

View reviewed changes

cbandy reviewed Sep 14, 2023

View reviewed changes

internal/controller/postgrescluster/pgmonitor.go Outdated Show resolved Hide resolved

internal/controller/postgrescluster/pgmonitor.go Outdated Show resolved Hide resolved

jmckulk force-pushed the custom_exporter_password branch from 5e4cc4e to beea11d Compare September 20, 2023 15:45

jmckulk marked this pull request as ready for review September 20, 2023 15:46

benjaminjb reviewed Sep 20, 2023

View reviewed changes

dsessler7 reviewed Sep 20, 2023

View reviewed changes