Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator fails "silently" to update TrinoCluster's resource #342

Open
Tracked by #383
zaultooz opened this issue Nov 16, 2022 · 2 comments
Open
Tracked by #383

Operator fails "silently" to update TrinoCluster's resource #342

zaultooz opened this issue Nov 16, 2022 · 2 comments

Comments

@zaultooz
Copy link

Hey Stackable Team

I have noticed when updating a TrinoCluster object with changes to replicasion, the Trino operator fail at updating. It results in a bunch of error logs for the TrinoCluster object:

ErrorResponse: { 
  status: "Failure", 
  message: "StatefulSet.apps \"gbif-trino-coordinator-default\" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden",
  reason: "Invalid", 
  code: 422 }

Steps to recreate:

  1. Create a TrinoCluster object in your namespace
  2. Change the resource quotars in a roleGroup
  3. Apply the changes
  4. See the Event log for the TrinoCluster object

As this is expected behavior for the Statefulset resource and the workaround is to recreate the object in Kubernetes so I don't see it as high priority.

I was think it would be nice to have the error returned from Kubernetes when trying to update the forbidden fields on the TrinoCluster and reject the object as to not push changes that aren't applied before recreation of the statefulset.

Another option could maybe be to allow the Trino-operator to recreate the statefulsets if the fields like resource allocation is updated.

If you don't think the issue makes sense to look into, you can close it :)

@maltesander
Copy link
Member

maltesander commented Mar 9, 2023

This probably affects every operator, not just Trino?

I could recreate the problem when changing the "data" storage size.

Changing CPU or Memory resources seems to work (results in a Pod restart).

In order to capture this we either need webhooks or track the resources in the status of the custom resource and delete and recreate the statefulset (PVCs should be resized automatically?).

@Maleware Maleware self-assigned this May 4, 2023
@Maleware Maleware moved this from Next to Refinement: In Progress in Stackable Engineering May 4, 2023
@Maleware
Copy link
Contributor

Maleware commented May 5, 2023

I've further tested that stuff and it is the same in Hive, seems like we have do fix it in all operators.

I've testet that behaviour on Ionos and on gcloud and changing memory and cpu on Ionos works perfectly fine, however on gcloud we encounter a problem:

Warning FailedScheduling 8m52s default-scheduler 0/3 nodes are available: 3 waiting for ephemeral volume controller to create the persistentvolumeclaim "simple-trino-worker-default-0-server-tls-mount". preemption: 0/3 │
│ nodes are available: 3 Preemption is not helpful for scheduling. │
│ Warning FailedBinding 8m51s ephemeral_volume ephemeral volume hive-simple-s3-credentials-secret-class: PVC trino-test/simple-trino-worker-default-0-hive-simple-s3-credentials-secret-class was not created for pod tr │
│ ino-test/simple-trino-worker-default-0 (pod is not owner) │
│ Warning FailedBinding 8m51s ephemeral_volume ephemeral volume internal-tls-mount: create PVC simple-trino-worker-default-0-internal-tls-mount: persistentvolumeclaims "simple-trino-worker-default-0-internal-tls-moun │
│ t" already exists │
│ Warning FailedScheduling 3m32s (x2 over 8m50s) default-scheduler running "NodeVolumeLimits" filter plugin: PVC trino-test/simple-trino-coordinator-default-0-server-tls-mount was not created for pod trino-test/simple-trino-coordinator- │
│ default-0 (pod is not owner)

edit: After a longer period of time the PVC gets attached to the volume again

Changing size of PVC's always ends in the error documented. No matter if it's testet on ionos or glcoud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants