-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to set custom ccp_monitoring pass #3719
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you wanted to get the passfile work in, so this is just an initial look at this step.
commands: | ||
# Check that all containers in the instance pod are ready | ||
- script: | | ||
retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this retrying when you have
[ "$pod" = "" ] && retry "Pod not found" && exit 1
? We don't need to try to get pod
again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"retry" here is basically sleep then exit with a message (maybe there is a better name for the function 🤔 ). If the pod isn't found, the script fails and Kuttl runs it again.
So the whole script will be run every time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right, because what we discovered in testing was that the commands
script in a TestAssert gets rerun from the beginning if any element fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exit_with_message()
wait_and_exit()
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many times would it retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the retry
doesn't do anything special - it just waits and expects an exits 1. Kuttl will loop through the script until it exits 0 or hits a timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess this is a feature of TestAssert steps that has an unexpected side-effect when using commands
:
- TestAsserts wait until their conditions are true -- so if you have a TestAssert looking for a pod with particular labels, KUTTL will keep checking until it finds that pod or until it times out.
- so if a
commands
section in a TestAssert fails, it just gets restarted.
If I put it that way, I get it, but I'm still surprised by that behavior the first few times I've seen it. Maybe worth a comment in the README?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i imagine a kuttl how to doc somewhere?
One thing of note - the script will loop until 1) it passes AND 2) the assert files are met. So even if a script exits 0, it will continue to run until the assert files (e.g., yaml asserts on a postgres cluster) also pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't end up changing the name of the retry
function - Will add to backlog
5e4cc4e
to
beea11d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments/questions, mostly re comments/naming things
# Then, check the contents of the queries to ensure queries.yml was generated correctly | ||
# Finally, store the current exporter pid as an annotation | ||
- script: | | ||
retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost want to move the exit 1
into retry
so then we can say something like, "when this func is called, we sleep, then fail, to signal to KUTTL to restart this test step"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this would work 🤔 would that just exit the retry
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, what if we set -e
at the top of the script? Not sure abt that, just a thought to maybe try.
I'm curious: if you take retry
out and just put in a sleep, does anything get printed if the script fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick test and set -e
does return as expected.
I'm curious: if you take retry out and just put in a sleep, does anything get printed if the script fails?
I'm not tracking... like this?
{ contains x y } || { sleep 5 && exit 1}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with that, there is no error message - i'm not sure what would happen if the error continued until the timeout
the first command just sleeps and exits but the second uses retry
logger.go:42: 09:58:34 | exporter-no-tls/0-00--create-cluster | running command: [sh -c set -e
retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5 && exit 1' - "$@"; }
...
pod=$(kubectl get pods -o name -n "${NAMESPACE}" \
-l postgres-operator.crunchydata.com/cluster=exporter-no-tls \
-l postgres-operator.crunchydata.com/crunchy-postgres-exporter=true)
[ "$pod" = "" ] && sleep 5 && exit 1
...
logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | running command: [sh -c set -e
...
condition_json=$(kubectl get "${pod}" -n "${NAMESPACE}" -o jsonpath="{.status.conditions}")
[ "$condition_json" = "" ] && retry "conditions not found"
{ check_containers_ready "$condition_json"; } || {
retry "containers not ready"
}
...
logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | false
logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | containers not ready
logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | Sleeping...
logger.go:42: 09:58:46 | exporter-no-tls/0-00--create-cluster | running command: [sh -c set -e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this
{ contains x y } || { sleep 5 && exit 1}
is what I was thinking. I was wondering if KUTTL would report something when it gets to that exit
, but no, thinking about it now, I don't see why it would.
That said, do you know where the false
is coming from in
logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_containers_ready
... it takes the argument, the .status.conditions
of the pod, uses jq
and whatnot to get the ContainersReady
status, and compares it to True
and returns the result which in this case is false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
locally i have sent the output of jq > /dev/null
so it doesn't have that line. I'll push that up shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't end up pushing this change. Will add to backlog
testing/kuttl/e2e/exporter-password-change/files/update-monitoring-password-checks.yaml
Outdated
Show resolved
Hide resolved
testing/kuttl/e2e-other/exporter-append-custom-queries/README.md
Outdated
Show resolved
Hide resolved
commands: | ||
# Check that all containers in the instance pod are ready | ||
- script: | | ||
retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many times would it retry?
testing/kuttl/e2e/exporter-standby/02--set-primary-password.yaml
Outdated
Show resolved
Hide resolved
testing/kuttl/e2e/exporter-standby/04--set-standby-password.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
e9ce842
to
72ad15f
Compare
With this change users can update the <cluster>-monitoring secret with a password, in either the `stringData` or `data` secret fields, and remove the verifier to update the ccp_monitoring password in postgres. After this users will need to restart the exporter process by deleting the instance pods (a solution that doesn't require full pod restarts is coming). This change is to support monitoring for standby clusters. Before this a standby cluster would be created without having access to the ccp_monitoring user password that was replicated from the primary cluster. Test to ensure that the postgres_exporter can scrape postgres using a custom ccp_monitoring password. The tests will: 1. create a cluster with exporter enabled and ensure metrics can be collected 2. Update the password and restart the pod 3. ensure that metrics can still be collected with the new password Tests now require jq to run - Refactor existing exporter tests - Split out the tls and no-tls tests into separate directories. - Update the tests to check the containers ready conditions - Add collectors for test failures - Include a test where we deploy a postgres-cluster with monitoring enabled on a replica. It will then check that the exporter on the replica can hit the query the database - Update exporter to use pass file - The exporter container now provides the ccp_monitoring password to postgres_exporter using a password file instead of an environment variable. With this, the password can be updated without requiring a container restart. The path to the password file has also been added to the exporter watcher logic meaning that the postgres_exporter process will be restarted when either the queries directory or password file change. - The password change test is updated to check that the postgres_exporter pid has been updated before trying to re-connect. - Update pgmonitor 4.9 - Update to pull pgMonitor 4.9 queries. The new version has a specific file for the global_dbsize metric that needs to be included when generating the default queries - Standby metrics testing - Now that the password for the monitoring user is configurable, users can configure a standby cluster to allow the exporter to query postgres using the ccp_monitoring user. This change implements testing to validate this use case. This test is included in e2e-other because it requires more work. We need to ensure a backup is complete before attempting to curl metrics. See note below* Note: Move standby and replica tests to e2e-other These two test can fail because of a scrape_error if a backup has not completed. They need to be updated to check that a backup is complete before attempting to collect metrics. There is a related story in our backlog. Due to the race condition, backup not being complete, they could pass or fail. After a backup chack is in place they should be able to move back into the e2e directory.
72ad15f
to
be108de
Compare
With this change, users can update the -monitoring secret with a password, in either the
stringData
ordata
secret fields, and remove theverifier
to update the ccp_monitoring password in Postgres. After changing the secret, thewatcher
logic will notice the change and restart the postgres_exporter process with the updated password. This change is to support monitoring for standby clusters. Before this, a standby cluster would be created without having access to the ccp_monitoring user password that was replicated from the primary cluster.In addition to allowing a custom ccp_monitoring password, this change updates the exporter to use queries from pgMonitor 4.9 and refactors our e2e tests.