All nats-streaming pods in CrashLoopBackOff state #40

popaaaandrei · 2019-04-05T10:53:12Z

Was't resilience supposed to be the great benefit of deploying NATS clusters? I came this morning to the office and found ALL nats-streaming-1-* pods in CrashLoopBackOff with around 500 restarts, meanwhile ALL messages have been obviously lost.

nats-cluster-1-1                                1/1     Running            0          41h
nats-cluster-1-2                                1/1     Running            0          42h
nats-cluster-1-3                                1/1     Running            0          41h
nats-operator-5b47bc4f8-77glm                   1/1     Running            0          42h
nats-streaming-1-1                              0/1     CrashLoopBackOff   490        42h
nats-streaming-1-2                              0/1     CrashLoopBackOff   486        41h
nats-streaming-1-3                              0/1     CrashLoopBackOff   490        42h
nats-streaming-operator-59647b496-v4vv5         1/1     Running            0          42h

$ kubectl logs nats-streaming-1-1
[1] 2019/04/05 10:43:45.406013 [INF] STREAM: Starting nats-streaming-server[nats-streaming-1] version 0.12.2
[1] 2019/04/05 10:43:45.406058 [INF] STREAM: ServerID: PlHofW9bI3tXiJYkkRkQCQ
[1] 2019/04/05 10:43:45.406061 [INF] STREAM: Go version: go1.11.6
[1] 2019/04/05 10:43:45.406064 [INF] STREAM: Git commit: [4489c46]
[1] 2019/04/05 10:43:45.422431 [INF] STREAM: Recovering the state...
[1] 2019/04/05 10:43:45.422755 [INF] STREAM: No recovered state
[1] 2019/04/05 10:43:45.422838 [INF] STREAM: Cluster Node ID : "nats-streaming-1-1"
[1] 2019/04/05 10:43:45.422847 [INF] STREAM: Cluster Log Path: nats-streaming-1/"nats-streaming-1-1"
[1] 2019/04/05 10:43:50.531934 [INF] STREAM: Shutting down.
[1] 2019/04/05 10:43:50.532450 [FTL] STREAM: Failed to start: failed to join Raft group nats-streaming-1

Even if I delete all pods it still doesn't recover. I have to delete the whole natsstreamingcluster.streaming.nats.io/nats-streaming-1 and recreate it to make it work.

The text was updated successfully, but these errors were encountered:

wallyqs · 2019-04-05T11:50:27Z

Do you have a persistence volume for the replicas? Could you share more info about the deployment, for example on which cloud is it running?

popaaaandrei · 2019-04-05T12:46:55Z

Thank you for responding.
The setup is GKE (v1.12.6-gke.7) + nats-operator + nats-streaming-operator updated to the last releases.
I don't think there is a PV, I use the standard stuff created through the operator, but at some point I will need to add persistency. This is still on a dev environment so we can play with various configs.

---
apiVersion: "nats.io/v1alpha2"
kind: "NatsCluster"
metadata:
  name: "nats-cluster-1"
spec:
  size: 3
---
apiVersion: "streaming.nats.io/v1alpha1"
kind: "NatsStreamingCluster"
metadata:
  name: "nats-streaming-1"
spec:
  size: 3
  natsSvc: "nats-cluster-1"

wallyqs · 2019-04-05T13:28:51Z

Thanks for the info. On GKE do you have automatic node upgrades enabled? That drains the nodes and restarts all instances in a way that I think could affect the quorum of the cluster if only using local disk.

popaaaandrei · 2019-04-05T18:07:25Z

I have Automatic node upgrades = Disabled, Automatic node repair = Enabled on that cluster. But I remember that I upgraded k8s manually 4 days ago. The thing is after I upgraded the nodes, I checked that all the pods are in running state, so this happened after that.

Quentin-M · 2019-04-24T19:46:08Z

Not having individual PV per pod makes the whole operator worthless.

Quentin-M · 2019-04-24T19:56:36Z

Also because you are relying on creating individual pods (with no anti-affinity even), rather than a statefulset - you can't actually even create PVCs/PVs that would carefully match the scheduling that Kubernetes does.

IHMO People should probably just stop relying on the operator and use a proper statefulset straight up.

jdoig mentioned this issue Jan 20, 2020

What's the current state of STAN clustered on GCP? #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All nats-streaming pods in CrashLoopBackOff state #40

All nats-streaming pods in CrashLoopBackOff state #40

popaaaandrei commented Apr 5, 2019

wallyqs commented Apr 5, 2019

popaaaandrei commented Apr 5, 2019 •

edited

Loading

wallyqs commented Apr 5, 2019

popaaaandrei commented Apr 5, 2019

Quentin-M commented Apr 24, 2019

Quentin-M commented Apr 24, 2019

All nats-streaming pods in CrashLoopBackOff state #40

All nats-streaming pods in CrashLoopBackOff state #40

Comments

popaaaandrei commented Apr 5, 2019

wallyqs commented Apr 5, 2019

popaaaandrei commented Apr 5, 2019 • edited Loading

wallyqs commented Apr 5, 2019

popaaaandrei commented Apr 5, 2019

Quentin-M commented Apr 24, 2019

Quentin-M commented Apr 24, 2019

popaaaandrei commented Apr 5, 2019 •

edited

Loading