reconciling of multiple trino clusters results in clusterwide coordinator downtime #618

maxgruber19 · 2024-07-23T21:24:31Z

we're dealing with the issue of concurrent reconcilations when trinocluster resources change. this issue occurs e.g. when a catalog is applied to the cluster matching more than one catalog-matchlabel or when all trino cluster resources are changed at the same time because they are configured in custom helm wrappers.

since we use argo for continous deployments we are not able to change clusters / upsert catalogs subsequently in a manual way.

we did not make progress with trino-lb (#490) yet but I'm sure even with trino-lb running this would cause outages everytime the trinocluster resources are (re-)configured or catalogs are upserted. unfortunately running trino in a high available way is mission critical for our production scenario

possible solution: subsequent reconcilation

introducing a flag for the operator (maybe other product operators might be affecated as well) which enables subsequent reconcilations in a queue style instead of parallelized reconcilations which lead to all clusters going offline at the same time.

disadvantage might be that a malicious cluster kills the whole reconcilation process until the resource is fixed manually.

possible solution: pdb

we already defined following pdb to make sure one coordinator per kubernetes cluster is available. unfortunately the pdb is ignored and all coordinators get killed concurrently. @maltesander @sbernauer already told about delete operations instead of evictions which would take care of the pdb. feel free to edit / add some further details

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: trino-highavailiability-coordinator
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: coordinator

Seems like somebody is feeling similar pain with elasticsearch kubernetes/kubernetes#91808 (comment)

The text was updated successfully, but these errors were encountered:

sbernauer added the customer-request label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reconciling of multiple trino clusters results in clusterwide coordinator downtime #618

reconciling of multiple trino clusters results in clusterwide coordinator downtime #618

maxgruber19 commented Jul 23, 2024

reconciling of multiple trino clusters results in clusterwide coordinator downtime #618

reconciling of multiple trino clusters results in clusterwide coordinator downtime #618

Comments

maxgruber19 commented Jul 23, 2024

possible solution: subsequent reconcilation

possible solution: pdb