Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconciling of multiple trino clusters results in clusterwide coordinator downtime #618

Open
maxgruber19 opened this issue Jul 23, 2024 · 0 comments

Comments

@maxgruber19
Copy link

we're dealing with the issue of concurrent reconcilations when trinocluster resources change. this issue occurs e.g. when a catalog is applied to the cluster matching more than one catalog-matchlabel or when all trino cluster resources are changed at the same time because they are configured in custom helm wrappers.

since we use argo for continous deployments we are not able to change clusters / upsert catalogs subsequently in a manual way.

we did not make progress with trino-lb (#490) yet but I'm sure even with trino-lb running this would cause outages everytime the trinocluster resources are (re-)configured or catalogs are upserted. unfortunately running trino in a high available way is mission critical for our production scenario

possible solution: subsequent reconcilation

introducing a flag for the operator (maybe other product operators might be affecated as well) which enables subsequent reconcilations in a queue style instead of parallelized reconcilations which lead to all clusters going offline at the same time.

disadvantage might be that a malicious cluster kills the whole reconcilation process until the resource is fixed manually.

possible solution: pdb

we already defined following pdb to make sure one coordinator per kubernetes cluster is available. unfortunately the pdb is ignored and all coordinators get killed concurrently. @maltesander @sbernauer already told about delete operations instead of evictions which would take care of the pdb. feel free to edit / add some further details

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: trino-highavailiability-coordinator
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: coordinator

Seems like somebody is feeling similar pain with elasticsearch kubernetes/kubernetes#91808 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants