Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All nats-streaming pods in CrashLoopBackOff state #40

Open
popaaaandrei opened this issue Apr 5, 2019 · 6 comments
Open

All nats-streaming pods in CrashLoopBackOff state #40

popaaaandrei opened this issue Apr 5, 2019 · 6 comments

Comments

@popaaaandrei
Copy link

Was't resilience supposed to be the great benefit of deploying NATS clusters? I came this morning to the office and found ALL nats-streaming-1-* pods in CrashLoopBackOff with around 500 restarts, meanwhile ALL messages have been obviously lost.

nats-cluster-1-1                                1/1     Running            0          41h
nats-cluster-1-2                                1/1     Running            0          42h
nats-cluster-1-3                                1/1     Running            0          41h
nats-operator-5b47bc4f8-77glm                   1/1     Running            0          42h
nats-streaming-1-1                              0/1     CrashLoopBackOff   490        42h
nats-streaming-1-2                              0/1     CrashLoopBackOff   486        41h
nats-streaming-1-3                              0/1     CrashLoopBackOff   490        42h
nats-streaming-operator-59647b496-v4vv5         1/1     Running            0          42h
$ kubectl logs nats-streaming-1-1
[1] 2019/04/05 10:43:45.406013 [INF] STREAM: Starting nats-streaming-server[nats-streaming-1] version 0.12.2
[1] 2019/04/05 10:43:45.406058 [INF] STREAM: ServerID: PlHofW9bI3tXiJYkkRkQCQ
[1] 2019/04/05 10:43:45.406061 [INF] STREAM: Go version: go1.11.6
[1] 2019/04/05 10:43:45.406064 [INF] STREAM: Git commit: [4489c46]
[1] 2019/04/05 10:43:45.422431 [INF] STREAM: Recovering the state...
[1] 2019/04/05 10:43:45.422755 [INF] STREAM: No recovered state
[1] 2019/04/05 10:43:45.422838 [INF] STREAM: Cluster Node ID : "nats-streaming-1-1"
[1] 2019/04/05 10:43:45.422847 [INF] STREAM: Cluster Log Path: nats-streaming-1/"nats-streaming-1-1"
[1] 2019/04/05 10:43:50.531934 [INF] STREAM: Shutting down.
[1] 2019/04/05 10:43:50.532450 [FTL] STREAM: Failed to start: failed to join Raft group nats-streaming-1

Even if I delete all pods it still doesn't recover. I have to delete the whole natsstreamingcluster.streaming.nats.io/nats-streaming-1 and recreate it to make it work.

@wallyqs
Copy link
Member

wallyqs commented Apr 5, 2019

Do you have a persistence volume for the replicas? Could you share more info about the deployment, for example on which cloud is it running?

@popaaaandrei
Copy link
Author

popaaaandrei commented Apr 5, 2019

Thank you for responding.
The setup is GKE (v1.12.6-gke.7) + nats-operator + nats-streaming-operator updated to the last releases.
I don't think there is a PV, I use the standard stuff created through the operator, but at some point I will need to add persistency. This is still on a dev environment so we can play with various configs.

---
apiVersion: "nats.io/v1alpha2"
kind: "NatsCluster"
metadata:
  name: "nats-cluster-1"
spec:
  size: 3
---
apiVersion: "streaming.nats.io/v1alpha1"
kind: "NatsStreamingCluster"
metadata:
  name: "nats-streaming-1"
spec:
  size: 3
  natsSvc: "nats-cluster-1"

@wallyqs
Copy link
Member

wallyqs commented Apr 5, 2019

Thanks for the info. On GKE do you have automatic node upgrades enabled? That drains the nodes and restarts all instances in a way that I think could affect the quorum of the cluster if only using local disk.

@popaaaandrei
Copy link
Author

I have Automatic node upgrades = Disabled, Automatic node repair = Enabled on that cluster. But I remember that I upgraded k8s manually 4 days ago. The thing is after I upgraded the nodes, I checked that all the pods are in running state, so this happened after that.

@Quentin-M
Copy link

Not having individual PV per pod makes the whole operator worthless.

@Quentin-M
Copy link

Also because you are relying on creating individual pods (with no anti-affinity even), rather than a statefulset - you can't actually even create PVCs/PVs that would carefully match the scheduling that Kubernetes does.

IHMO People should probably just stop relying on the operator and use a proper statefulset straight up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants