Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8ssandra-operator in CrashLoopBackOff during Kubernetes upgrade #1454

Open
c3-clement opened this issue Nov 22, 2024 · 3 comments · May be fixed by #1462
Open

k8ssandra-operator in CrashLoopBackOff during Kubernetes upgrade #1454

c3-clement opened this issue Nov 22, 2024 · 3 comments · May be fixed by #1462
Labels
bug Something isn't working

Comments

@c3-clement
Copy link
Contributor

c3-clement commented Nov 22, 2024

What happened?

We had a K8ssandra cluster with Medusa enabled deployed in AKS 1.29.
Cluster admin upgraded AKS to 1.30, and after that k8ssandra-operator was in CrashLoopBackOff and the K8ssandra pods were not starting.
K8ssandra pods started finally 4h30 after the AKS upgrade.

Did you expect to see something different?

I expect a Kubernetes upgrade to not causes k8ssandra unavailibity.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy an AKS cluster 1.29
  2. Deploy k8ssandra-operator v1.14
  3. Deploy a K8ssandra cluster with at least 3 replicas and Medusa backup enabled
  4. Create a MedusaBackupSchedule or create MedusaBackupJobs
  5. Delete the nodepool containing Cassandra pods, so that all Cassandra pods will be evicted
  6. k8ssandra-operator should crash when it attempts to reconcile a MedusaBackupJob for this k8ssandra cluster
  7. Create a new 1.30 nodepool

Environment

  • K8ssandra Operator version:
    1.14.0

  • Kubernetes version information:
    1.29 -> 1.30

  • Kubernetes cluster kind:
    AKS

  • Manifests:

K8ssandraCluster manifest (after recovery):

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  annotations:
    k8ssandra.io/initial-system-replication: '{"cs-1855ea8cc2":3}'
  creationTimestamp: "2023-08-15T18:06:51Z"
  finalizers:
  - k8ssandracluster.k8ssandra.io/finalizer
  generation: 3236
  name: cs-1855ea8cc2
  namespace: qaazv8dow
  resourceVersion: "348869397"
  uid: 6ffc08c2-d358-4f3b-a102-e99dbd396651
spec:
  auth: true
  cassandra:
    datacenters:
    - config:
        cassandraYaml:
          audit_logging_options:
            enabled: true
          commitlog_directory: /c3/cassandra/data/commitlog
          data_file_directories:
          - /c3/cassandra/data/data
          saved_caches_directory: /c3/cassandra/data/saved_caches
        jvmOptions:
          gc: G1GC
          heapSize: 12G
      containers:
      - args:
        - -c
        - tail -n+1 -F /c3/cassandra/logs/system.log
        command:
        - /bin/sh
        image: prodqaacr.azurecr.io/bundled-cassandra-jre-bcfips:4.0.12-r1-202407171520
        imagePullPolicy: Always
        name: server-system-logger
        resources: {}
        volumeMounts:
        - mountPath: /c3/cassandra
          name: server-data
      - args:
        - mgmtapi
        command:
        - /docker-entrypoint.sh
        name: cassandra
        resources: {}
        volumeMounts:
        - mountPath: /c3/cassandra
          name: server-data
      - env:
        - name: MEDUSA_TMP_DIR
          value: /c3/cassandra/medusa/tmp
        name: medusa
        resources: {}
        volumeMounts:
        - mountPath: /c3/cassandra
          name: server-data
      initContainers:
      - command:
        - /usr/local/bin/entrypoint
        image: prodqaacr.azurecr.io/bundled-cassandra-jre-bcfips:4.0.12-r1-202407171520
        imagePullPolicy: Always
        name: server-config-init
        resources: {}
      - env:
        - name: MEDUSA_TMP_DIR
          value: /c3/cassandra/medusa/tmp
        name: medusa-restore
        resources: {}
        volumeMounts:
        - mountPath: /c3/cassandra
          name: server-data
      metadata:
        name: cs-1855ea8cc2
        pods:
          labels:
            azure.workload.identity/use: "true"
            c3__app-0: 0c30
            c3__app_id-0: 0qaazv8dow-c3-c30
            c3__cluster-0: 0qaazv8dow0
            c3__created-0: 02023-12-13T00_3A42_3A42.127Z0
            c3__created_by-0: 0worker0
            c3__created_from-0: 0qaazv8dow-c3-c3-k8spo-appleader-001-6648dbd5cbsq4hl0
            c3__env-0: 0c30
            c3__env_id-0: 0qaazv8dow-c30
            c3__func-0: 0k8sc3cass0
            c3__id-0: 0qaazv8dow-c3-c3-k8sc3cass0
            c3__namespace-0: 0qaazv8dow0
            c3__role-0: "00"
            c3__seq-0: "00"
            c3__service-0: 0qaazv8dow-c3-c3-k8scass-cs-0010
            c3__subseq-0: "00"
            c3__updated-0: 02023-12-13T00_3A42_3A42.127Z0
            c3__updated_by-0: 0worker0
            c3__updated_from-0: 0qaazv8dow-c3-c3-k8spo-appleader-001-6648dbd5cbsq4hl0
        services:
          additionalSeedService: {}
          allPodsService: {}
          dcService: {}
          nodePortService: {}
          seedService: {}
      perNodeConfigInitContainerImage: mikefarah/yq:4
      perNodeConfigMapRef: {}
      podSecurityContext:
        fsGroup: 999
        runAsGroup: 999
        runAsUser: 999
      serviceAccount: c3cassandra
      size: 3
      stargate:
        allowStargateOnDataNodes: false
        containerImage:
          name: stargate-v1
          registry: prodqaacr.azurecr.io
          repository: c3.ai
          tag: 1.0.85-r2-202410132043
        heapSize: 256M
        metadata:
          commonLabels:
            azure.workload.identity/use: "true"
            c3__app-0: 0c30
            c3__app_id-0: 0qaazv8dow-c3-c30
            c3__cluster-0: 0qaazv8dow0
            c3__created-0: 02023-12-13T00_3A42_3A42.127Z0
            c3__created_by-0: 0worker0
            c3__created_from-0: 0qaazv8dow-c3-c3-k8spo-appleader-001-6648dbd5cbsq4hl0
            c3__env-0: 0c30
            c3__env_id-0: 0qaazv8dow-c30
            c3__func-0: 0k8sc3cass0
            c3__id-0: 0qaazv8dow-c3-c3-k8sc3cass0
            c3__namespace-0: 0qaazv8dow0
            c3__role-0: "00"
            c3__seq-0: "00"
            c3__service-0: 0qaazv8dow-c3-c3-k8scass-cs-0010
            c3__subseq-0: "00"
            c3__updated-0: 02023-12-13T00_3A42_3A42.127Z0
            c3__updated_by-0: 0worker0
            c3__updated_from-0: 0qaazv8dow-c3-c3-k8spo-appleader-001-6648dbd5cbsq4hl0
          pods: {}
          service: {}
        secretsProvider: internal
        serviceAccount: c3cassandra
        size: 1
        tolerations:
        - effect: NoSchedule
          key: cassandra
          operator: Equal
          value: "true"
      stopped: false
      storageConfig:
        cassandraDataVolumeClaimSpec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: 1Ti
          storageClassName: default
      tolerations:
      - effect: NoSchedule
        key: cassandra
        operator: Equal
        value: "true"
    jmxInitContainerImage:
      name: busybox
      registry: docker.io
      tag: 1.34.1
    metadata:
      pods: {}
      services:
        additionalSeedService: {}
        allPodsService: {}
        dcService: {}
        nodePortService: {}
        seedService: {}
    perNodeConfigInitContainerImage: mikefarah/yq:4
    resources:
      limits:
        cpu: "8"
        memory: 64Gi
      requests:
        cpu: "8"
        memory: 64Gi
    serverType: cassandra
    serverVersion: 4.0.13
    superuserSecretRef: {}
    telemetry:
      mcac:
        enabled: true
        metricFilters:
        - deny:org.apache.cassandra.metrics.Table
        - deny:org.apache.cassandra.metrics.table
        - allow:org.apache.cassandra.metrics.table.live_ss_table_count
        - allow:org.apache.cassandra.metrics.Table.LiveSSTableCount
        - allow:org.apache.cassandra.metrics.table.live_disk_space_used
        - allow:org.apache.cassandra.metrics.table.LiveDiskSpaceUsed
        - allow:org.apache.cassandra.metrics.Table.Pending
        - allow:org.apache.cassandra.metrics.Table.Memtable
        - allow:org.apache.cassandra.metrics.Table.Compaction
        - allow:org.apache.cassandra.metrics.table.read
        - allow:org.apache.cassandra.metrics.table.write
        - allow:org.apache.cassandra.metrics.table.range
        - allow:org.apache.cassandra.metrics.table.coordinator
        - allow:org.apache.cassandra.metrics.table.dropped_mutations
        - allow:org.apache.cassandra.metrics.Table.TombstoneScannedHistogram
        - allow:org.apache.cassandra.metrics.table.tombstone_scanned_histogram
      prometheus:
        enabled: true
  medusa:
    cassandraUserSecretRef: {}
    certificatesSecretRef: {}
    containerImage:
      name: c3-cassandra-medusa-fips
      pullPolicy: Always
      registry: prodqaacr.azurecr.io
      repository: c3.ai
      tag: 0.21.0-r5-202407181745
    medusaConfigurationRef: {}
    storageProperties:
      bucketName: qaazv8dow
      concurrentTransfers: 0
      maxBackupAge: 7
      maxBackupCount: 0
      multiPartUploadThreshold: 104857600
      prefix: c3cassandra-backup
      storageProvider: azure_blobs
      storageSecretRef:
        name: cs-1855ea8cc2-medusa-bucket-key
      transferMaxBandwidth: 50MB/s
  reaper:
    ServiceAccountName: c3cassandra
    autoScheduling:
      enabled: true
      initialDelayPeriod: PT15S
      percentUnrepairedThreshold: 10
      periodBetweenPolls: PT10M
      repairType: AUTO
      scheduleSpreadPeriod: PT6H
      timeBeforeFirstSchedule: PT5M
    cassandraUserSecretRef:
      name: cs-1855ea8cc2-superuser
    containerImage:
      name: cassandra-reaper-jre-bcfips
      registry: prodqaacr.azurecr.io
      repository: c3.ai
      tag: 3.5.0-202403142052
    deploymentMode: PER_DC
    heapSize: 2Gi
    httpManagement:
      enabled: false
    initContainerImage:
      name: cassandra-reaper-jre-bcfips
      registry: prodqaacr.azurecr.io
      repository: c3.ai
      tag: 3.5.0-202403142052
    jmxUserSecretRef: {}
    keyspace: reaper_db
    metadata:
      pods:
        labels:
          azure.workload.identity/use: "true"
          c3__app-0: 0c30
          c3__app_id-0: 0qaazv8dow-c3-c30
          c3__cluster-0: 0qaazv8dow0
          c3__created-0: 02023-12-13T00_3A42_3A42.127Z0
          c3__created_by-0: 0worker0
          c3__created_from-0: 0qaazv8dow-c3-c3-k8spo-appleader-001-6648dbd5cbsq4hl0
          c3__env-0: 0c30
          c3__env_id-0: 0qaazv8dow-c30
          c3__func-0: 0k8sc3cass0
          c3__id-0: 0qaazv8dow-c3-c3-k8sc3cass0
          c3__namespace-0: 0qaazv8dow0
          c3__role-0: "00"
          c3__seq-0: "00"
          c3__service-0: 0qaazv8dow-c3-c3-k8scass-cs-0010
          c3__subseq-0: "00"
          c3__updated-0: 02023-12-13T00_3A42_3A42.127Z0
          c3__updated_by-0: 0worker0
          c3__updated_from-0: 0qaazv8dow-c3-c3-k8spo-appleader-001-6648dbd5cbsq4hl0
      service: {}
    secretsProvider: internal
    tolerations:
    - effect: NoSchedule
      key: cassandra
      operator: Equal
      value: "true"
  secretsProvider: internal
status:
  conditions:
  - lastTransitionTime: "2023-08-15T18:58:42Z"
    status: "True"
    type: CassandraInitialized
  datacenters:
    cs-1855ea8cc2:
      cassandra:
        cassandraOperatorProgress: Ready
        conditions:
        - lastTransitionTime: "2023-12-13T07:23:09Z"
          message: ""
          reason: ""
          status: "True"
          type: Healthy
        - lastTransitionTime: "2023-08-15T18:58:40Z"
          message: ""
          reason: ""
          status: "False"
          type: Stopped
        - lastTransitionTime: "2023-08-15T18:58:40Z"
          message: ""
          reason: ""
          status: "False"
          type: ReplacingNodes
        - lastTransitionTime: "2024-11-22T07:12:08Z"
          message: ""
          reason: ""
          status: "False"
          type: Updating
        - lastTransitionTime: "2023-08-15T18:58:40Z"
          message: ""
          reason: ""
          status: "False"
          type: RollingRestart
        - lastTransitionTime: "2023-08-15T18:58:40Z"
          message: ""
          reason: ""
          status: "False"
          type: Resuming
        - lastTransitionTime: "2023-08-15T18:58:40Z"
          message: ""
          reason: ""
          status: "False"
          type: ScalingDown
        - lastTransitionTime: "2023-08-15T18:58:40Z"
          message: ""
          reason: ""
          status: "True"
          type: Valid
        - lastTransitionTime: "2023-08-15T18:58:40Z"
          message: ""
          reason: ""
          status: "True"
          type: Initialized
        - lastTransitionTime: "2023-08-15T18:58:40Z"
          message: ""
          reason: ""
          status: "True"
          type: Ready
        - lastTransitionTime: "2024-11-22T00:02:31Z"
          message: ""
          reason: ""
          status: "False"
          type: RequiresUpdate
        datacenterName: ""
        lastServerNodeStarted: "2024-11-22T07:11:38Z"
        nodeStatuses:
          cs-1855ea8cc2-cs-1855ea8cc2-default-sts-0:
            hostID: 975804b2-abad-425f-a780-ba8f66bec4be
          cs-1855ea8cc2-cs-1855ea8cc2-default-sts-1:
            hostID: 41c647a0-7057-460c-bf84-6d187175aefe
          cs-1855ea8cc2-cs-1855ea8cc2-default-sts-2:
            hostID: 475846a4-c5da-49e7-9dd5-35838abb81db
        observedGeneration: 64
        quietPeriod: "2024-11-22T07:12:14Z"
        superUserUpserted: "2024-11-22T07:12:09Z"
        usersUpserted: "2024-11-22T07:12:09Z"
      reaper:
        conditions:
        - lastTransitionTime: "2024-11-22T07:01:08Z"
          status: "True"
          type: Ready
        progress: Running
      stargate:
        availableReplicas: 1
        conditions:
        - lastTransitionTime: "2024-11-22T07:13:21Z"
          status: "True"
          type: Ready
        deploymentRefs:
        - cs-1855ea8cc2-cs-1855ea8cc2-default-stargate-deployment
        progress: Running
        readyReplicas: 1
        readyReplicasRatio: 1/1
        replicas: 1
        serviceRef: cs-1855ea8cc2-cs-1855ea8cc2-stargate-service
        updatedReplicas: 1
  error: None

MedusaBackupSchedule:

apiVersion: medusa.k8ssandra.io/v1alpha1
kind: MedusaBackupSchedule
metadata:
  creationTimestamp: "2023-12-13T05:13:26Z"
  generation: 1
  labels:
    app: c3aiops
    ops.c3.ai/parent-resource: qaazv8dow-c3cassandra-cs-1855ea8cc2
    role: c3aiops-C3Cassandra
  name: cs-1855ea8cc2-cs-1855ea8cc2
  namespace: qaazv8dow
  resourceVersion: "348632318"
  uid: 3c1a38e6-39dd-41ee-a0de-431f1cbfd2a4
spec:
  backupSpec:
    backupType: differential
    cassandraDatacenter: cs-1855ea8cc2
  cronSchedule: '@daily'
status:
  lastExecution: "2024-11-22T00:00:00Z"
  nextSchedule: "2024-11-23T00:00:00Z"

  • K8ssandra Operator Logs:
Error:
15m         Warning   FailedCreate           statefulset/cs-1855ea8cc2-cs-1855ea8cc2-default-sts                               create Pod cs-1855ea8cc2-cs-1855ea8cc2-default-sts-0 in StatefulSet cs-1855ea8cc2-cs-1855ea8cc2-default-sts failed error: Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://c3aiops-k8ssandra-operator-webhook-service.c3-opsadmin.svc:443/mutate-v1-pod-secrets-inject?timeout=10s": no endpoints available for service "c3aiops-k8ssandra-operator-webhook-service"
2024-11-22T04:25:56.431Z        INFO    Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference        {"controller": "medusabackupjob", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaBackupJob", "MedusaBackupJob": {"name":"cs-1855ea8cc2-cs-1855ea8cc2-1717977600","namespace":"qaazv8dow"}, "namespace": "qaazv8dow", "name": "cs-1855ea8cc2-cs-1855ea8cc2-1717977600", "reconcileID": "cec567f9-0bc6-4c46-9c55-5ee233259314"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x55f16f305904]

goroutine 597 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:119 +0x1e5
panic({0x55f16f77d520?, 0x55f170644c80?})
        runtime/panic.go:770 +0x132
github.com/k8ssandra/k8ssandra-operator/controllers/medusa.(*MedusaBackupJobReconciler).createMedusaBackup(0xc000846480, {0x55f16f9d1440, 0xc0007c78f0}, 0xc00089b040, 0x0, {{0x55f16f9d58e8?, 0xc0007c7950?}, 0x55f170746ce0?})
        github.com/k8ssandra/k8ssandra-operator/controllers/medusa/medusabackupjob_controller.go:275 +0x5a4
github.com/k8ssandra/k8ssandra-operator/controllers/medusa.(*MedusaBackupJobReconciler).Reconcile(0xc000846480, {0x55f16f9d1440, 0xc0007c78f0}, {{{0xc000ca6356, 0x9}, {0xc000cc2420, 0x26}}})
        github.com/k8ssandra/k8ssandra-operator/controllers/medusa/medusabackupjob_controller.go:171 +0x569
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x55f16f9d1440?, {0x55f16f9d1440?, 0xc0007c78f0?}, {{{0xc000ca6356?, 0x55f16f6bee00?}, {0xc000cc2420?, 0x10?}}})
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000188a00, {0x55f16f9d1478, 0xc0001f3400}, {0x55f16f7dea00, 0xc000b161c0})
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323 +0x345
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000188a00, {0x55f16f9d1478, 0xc0001f3400})
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274 +0x1c9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 140
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:231 +0x50c

Logs once k8ssandra-operator reached a Running state after the above crash:

2024-11-22T06:32:06.089Z        ERROR   Failed to fetch datacenter pods {"controller": "k8ssandracluster", "controllerGroup": "k8ssandra.io", "controllerKind": "K8ssandraCluster", "K8ssandraCluster": {"name":"cs-1855ea8cc2","namespace":"qaazv8dow"}, "namespace": "qaazv8dow", "name": "cs-1855ea8cc2", "reconcileID": "f52dcd0e-f2fa-4de2-9cd8-a5cf249a4dd8", "K8ssandraCluster": "qaazv8dow/cs-1855ea8cc2", "CassandraDatacenter": "qaazv8dow/cs-1855ea8cc2", "K8SContext": "", "error": "no pods in READY state found in datacenter cs-1855ea8cc2"}
github.com/k8ssandra/k8ssandra-operator/pkg/cassandra.(*defaultManagementApiFacade).ListKeyspaces
        github.com/k8ssandra/k8ssandra-operator/pkg/cassandra/management.go:192
github.com/k8ssandra/k8ssandra-operator/pkg/cassandra.(*defaultManagementApiFacade).EnsureKeyspaceReplication
        github.com/k8ssandra/k8ssandra-operator/pkg/cassandra/management.go:291
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).updateReplicationOfSystemKeyspaces
        github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra/schemas.go:161
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).checkSchemas
        github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra/schemas.go:43
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).reconcileDatacenters
        github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra/datacenters.go:207
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).reconcile
        github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra/k8ssandracluster_controller.go:144
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).Reconcile
        github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra/k8ssandracluster_controller.go:92
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
2024-11-22T06:32:06.090Z        ERROR   Failed to update replication    {"controller": "k8ssandracluster", "controllerGroup": "k8ssandra.io", "controllerKind": "K8ssandraCluster", "K8ssandraCluster": {"name":"cs-1855ea8cc2","namespace":"qaazv8dow"}, "namespace": "qaazv8dow", "name": "cs-1855ea8cc2", "reconcileID": "f52dcd0e-f2fa-4de2-9cd8-a5cf249a4dd8", "K8ssandraCluster": "qaazv8dow/cs-1855ea8cc2", "CassandraDatacenter": "qaazv8dow/cs-1855ea8cc2", "K8SContext": "", "keyspace": "system_traces", "error": "no pods in READY state found in datacenter cs-1855ea8cc2"}
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).updateReplicationOfSystemKeyspaces
        github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra/schemas.go:165
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).checkSchemas
        github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra/schemas.go:43
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).reconcileDatacenters
        github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra/datacenters.go:207
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).reconcile
        github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra/k8ssandracluster_controller.go:144
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).Reconcile
        github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra/k8ssandracluster_controller.go:92
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
2024-11-22T06:32:06.090Z        DEBUG   events  no pods in READY state found in datacenter cs-1855ea8cc2        {"type": "Warning", "object": {"kind":"K8ssandraCluster","namespace":"qaazv8dow","name":"cs-1855ea8cc2","uid":"6ffc08c2-d358-4f3b-a102-e99dbd396651","apiVersion":"k8ssandra.io/v1alpha1","resourceVersion":"348839464"}, "reason": "Reconcile Error"}
2024-11-22T06:32:06.106Z        INFO    updated k8ssandracluster status {"controller": "k8ssandracluster", "controllerGroup": "k8ssandra.io", "controllerKind": "K8ssandraCluster", "K8ssandraCluster": {"name":"cs-1855ea8cc2","namespace":"qaazv8dow"}, "namespace": "qaazv8dow", "name": "cs-1855ea8cc2", "reconcileID": "f52dcd0e-f2fa-4de2-9cd8-a5cf249a4dd8", "K8ssandraCluster": "qaazv8dow/cs-1855ea8cc2"}
2024-11-22T06:32:06.106Z        ERROR   Reconciler error        {"controller": "k8ssandracluster", "controllerGroup": "k8ssandra.io", "controllerKind": "K8ssandraCluster", "K8ssandraCluster": {"name":"cs-1855ea8cc2","namespace":"qaazv8dow"}, "namespace": "qaazv8dow", "name": "cs-1855ea8cc2", "reconcileID": "f52dcd0e-f2fa-4de2-9cd8-a5cf249a4dd8", "error": "no pods in READY state found in datacenter cs-1855ea8cc2"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235

Anything else we need to know?:
I checked the code and I believe all k8ssandra-operator versions are impacted, not only 1.14

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: K8OP-294

@c3-clement c3-clement added the bug Something isn't working label Nov 22, 2024
@burmanm
Copy link
Contributor

burmanm commented Nov 22, 2024

@rzvoncek This happens because we can exit here: https://github.com/k8ssandra/k8ssandra-operator/blob/main/controllers/medusa/medusabackupjob_controller.go#L251

We didn't find any pod with the backupSummary, but we also didn't encounter any errors. So the createMedusaBackup is called with a nil pointer for backupSummary.

@c3-clement
Copy link
Contributor Author

I updated the issue with K8ssandraCluster and MedusaBackupSchedule manifests

@c3-clement
Copy link
Contributor Author

For more context, cluster admin entirely deleted the 1.29 Nodepool containing Cassandra pods, and then re-created a 1.30 nodepool.

That explains why StatefulSet had 0 pod running, causing the crash in medusabackupjob_controller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants