Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to set custom ccp_monitoring pass #3719

Merged
merged 1 commit into from
Sep 22, 2023

Conversation

jmckulk
Copy link
Collaborator

@jmckulk jmckulk commented Sep 13, 2023

With this change, users can update the -monitoring secret with a password, in either the stringData or data secret fields, and remove the verifier to update the ccp_monitoring password in Postgres. After changing the secret, the watcher logic will notice the change and restart the postgres_exporter process with the updated password. This change is to support monitoring for standby clusters. Before this, a standby cluster would be created without having access to the ccp_monitoring user password that was replicated from the primary cluster.

In addition to allowing a custom ccp_monitoring password, this change updates the exporter to use queries from pgMonitor 4.9 and refactors our e2e tests.

Copy link
Contributor

@benjaminjb benjaminjb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you wanted to get the passfile work in, so this is just an initial look at this step.

commands:
# Check that all containers in the instance pod are ready
- script: |
retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this retrying when you have

[ "$pod" = "" ] && retry "Pod not found" && exit 1

? We don't need to try to get pod again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"retry" here is basically sleep then exit with a message (maybe there is a better name for the function 🤔 ). If the pod isn't found, the script fails and Kuttl runs it again.

So the whole script will be run every time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, because what we discovered in testing was that the commands script in a TestAssert gets rerun from the beginning if any element fails.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exit_with_message()
wait_and_exit()

🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many times would it retry?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the retry doesn't do anything special - it just waits and expects an exits 1. Kuttl will loop through the script until it exits 0 or hits a timeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess this is a feature of TestAssert steps that has an unexpected side-effect when using commands:

  • TestAsserts wait until their conditions are true -- so if you have a TestAssert looking for a pod with particular labels, KUTTL will keep checking until it finds that pod or until it times out.
  • so if a commands section in a TestAssert fails, it just gets restarted.

If I put it that way, I get it, but I'm still surprised by that behavior the first few times I've seen it. Maybe worth a comment in the README?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i imagine a kuttl how to doc somewhere?

One thing of note - the script will loop until 1) it passes AND 2) the assert files are met. So even if a script exits 0, it will continue to run until the assert files (e.g., yaml asserts on a postgres cluster) also pass

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't end up changing the name of the retry function - Will add to backlog

testing/kuttl/e2e/exporter-password-change/04-assert.yaml Outdated Show resolved Hide resolved
testing/kuttl/e2e/exporter-password-change/04-assert.yaml Outdated Show resolved Hide resolved
@jmckulk jmckulk force-pushed the custom_exporter_password branch from 5e4cc4e to beea11d Compare September 20, 2023 15:45
@jmckulk jmckulk marked this pull request as ready for review September 20, 2023 15:46
Copy link
Contributor

@benjaminjb benjaminjb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments/questions, mostly re comments/naming things

internal/pgmonitor/exporter.go Outdated Show resolved Hide resolved
# Then, check the contents of the queries to ensure queries.yml was generated correctly
# Finally, store the current exporter pid as an annotation
- script: |
retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost want to move the exit 1 into retry so then we can say something like, "when this func is called, we sleep, then fail, to signal to KUTTL to restart this test step"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this would work 🤔 would that just exit the retry function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, what if we set -e at the top of the script? Not sure abt that, just a thought to maybe try.

I'm curious: if you take retry out and just put in a sleep, does anything get printed if the script fails?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick test and set -e does return as expected.

I'm curious: if you take retry out and just put in a sleep, does anything get printed if the script fails?

I'm not tracking... like this?

{ contains x y } || { sleep 5 && exit 1}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with that, there is no error message - i'm not sure what would happen if the error continued until the timeout

the first command just sleeps and exits but the second uses retry

    logger.go:42: 09:58:34 | exporter-no-tls/0-00--create-cluster | running command: [sh -c set -e
        retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5 && exit 1' - "$@"; }
...
        pod=$(kubectl get pods -o name -n "${NAMESPACE}" \
          -l postgres-operator.crunchydata.com/cluster=exporter-no-tls \
          -l postgres-operator.crunchydata.com/crunchy-postgres-exporter=true)
        [ "$pod" = "" ] && sleep 5 && exit 1
...
    logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | running command: [sh -c set -e
...
        condition_json=$(kubectl get "${pod}" -n "${NAMESPACE}" -o jsonpath="{.status.conditions}")
        [ "$condition_json" = "" ] && retry "conditions not found"
        { check_containers_ready "$condition_json"; } || {
          retry "containers not ready"
        }
...
    logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | false
    logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | containers not ready
    logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | Sleeping...
    logger.go:42: 09:58:46 | exporter-no-tls/0-00--create-cluster | running command: [sh -c set -e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this

{ contains x y } || { sleep 5 && exit 1}

is what I was thinking. I was wondering if KUTTL would report something when it gets to that exit, but no, thinking about it now, I don't see why it would.

That said, do you know where the false is coming from in

logger.go:42: 09:58:40 | exporter-no-tls/0-00--create-cluster | false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_containers_ready... it takes the argument, the .status.conditions of the pod, uses jq and whatnot to get the ContainersReady status, and compares it to True and returns the result which in this case is false

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locally i have sent the output of jq > /dev/null so it doesn't have that line. I'll push that up shortly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't end up pushing this change. Will add to backlog

testing/kuttl/e2e/exporter-password-change/02-assert.yaml Outdated Show resolved Hide resolved
testing/kuttl/e2e/exporter-replica/00-assert.yaml Outdated Show resolved Hide resolved
internal/pgmonitor/exporter.go Show resolved Hide resolved
internal/pgmonitor/exporter.go Outdated Show resolved Hide resolved
commands:
# Check that all containers in the instance pod are ready
- script: |
retry() { bash -ceu 'printf "$1\nSleeping...\n" && sleep 5' - "$@"; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many times would it retry?

testing/kuttl/e2e/exporter-password-change/README.md Outdated Show resolved Hide resolved
testing/kuttl/e2e/exporter-replica/00--create-cluster.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@dsessler7 dsessler7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@benjaminjb benjaminjb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jmckulk jmckulk force-pushed the custom_exporter_password branch from e9ce842 to 72ad15f Compare September 22, 2023 21:14
With this change users can update the
<cluster>-monitoring secret with a password, in either the `stringData`
or `data` secret fields, and remove the verifier to update the
ccp_monitoring password in postgres. After this users will need to
restart the exporter process by deleting the instance pods (a solution
that doesn't require full pod restarts is coming).

This change is to support monitoring for standby clusters. Before this a
standby cluster would be created without having access to the
ccp_monitoring user password that was replicated from the primary
cluster.

Test to ensure that the postgres_exporter can scrape postgres using a
custom ccp_monitoring password.

The tests will:
1. create a cluster with exporter enabled and ensure metrics can be
   collected
2. Update the password and restart the pod
3. ensure that metrics can still be collected with the new password

Tests now require jq to run

- Refactor existing exporter tests

  - Split out the tls and no-tls tests into separate directories.
  - Update the tests to check the containers ready conditions
  - Add collectors for test failures
  - Include a test where we deploy a postgres-cluster with monitoring
    enabled on a replica. It will then check that the exporter on the
    replica can hit the query the database

- Update exporter to use pass file

  - The exporter container now provides the ccp_monitoring password to
    postgres_exporter using a password file instead of an environment
    variable. With this, the password can be updated without requiring a
    container restart. The path to the password file has also been added to
    the exporter watcher logic meaning that the postgres_exporter process
    will be restarted when either the queries directory or password file
    change.

  - The password change test is updated to check that the postgres_exporter
    pid has been updated before trying to re-connect.

- Update pgmonitor 4.9

  - Update to pull pgMonitor 4.9 queries. The new version has a specific
    file for the global_dbsize metric that needs to be included when
    generating the default queries

- Standby metrics testing

  - Now that the password for the monitoring user is configurable, users can
    configure a standby cluster to allow the exporter to query postgres
    using the ccp_monitoring user. This change implements testing to
    validate this use case. This test is included in e2e-other because it
    requires more work. We need to ensure a backup is complete before
    attempting to curl metrics. See note below*

Note: Move standby and replica tests to e2e-other

These two test can fail because of a scrape_error if a backup has not
completed. They need to be updated to check that a backup is complete
before attempting to collect metrics. There is a related story in our
backlog.

Due to the race condition, backup not being complete, they could pass or
fail. After a backup chack is in place they should be able to move back
into the e2e directory.
@jmckulk jmckulk force-pushed the custom_exporter_password branch from 72ad15f to be108de Compare September 22, 2023 21:25
@jmckulk jmckulk merged commit 467747a into CrunchyData:master Sep 22, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants