Cluster in permanent failure state if all pods crash #24

pvanderlinden · 2018-12-05T16:45:17Z

If all pods go down, the cluster will never go up (in this case I was testing on a single node: minikube, and the node crashed)

The log will look permanently like this:

[1] 2018/12/05 16:43:14.755467 [INF] STREAM: Starting nats-streaming-server[stan] version 0.11.2
[1] 2018/12/05 16:43:14.755701 [INF] STREAM: ServerID: hX4G9R76qCMRvR1HThDCd5
[1] 2018/12/05 16:43:14.755745 [INF] STREAM: Go version: go1.11.1
[1] 2018/12/05 16:43:14.762222 [INF] STREAM: Recovering the state...
[1] 2018/12/05 16:43:14.762387 [INF] STREAM: No recovered state
[1] 2018/12/05 16:43:14.762463 [INF] STREAM: Cluster Node ID : "stan-2"
[1] 2018/12/05 16:43:14.762494 [INF] STREAM: Cluster Log Path: stan/"stan-2"
[1] 2018/12/05 16:43:19.908446 [INF] STREAM: Shutting down.
[1] 2018/12/05 16:43:19.910918 [FTL] STREAM: Failed to start: failed to join Raft group stan

The text was updated successfully, but these errors were encountered:

pvanderlinden · 2018-12-05T16:47:51Z

It seems to be the same issue as I discovered here:
nats-io/nats-operator#104
stan-1 still exists but in failure mode:

time="2018-12-05T16:42:51Z" level=info msg="Missing pods for 'nats-io/stan' cluster (size=2/3), creating 1 pods..."
time="2018-12-05T16:42:56Z" level=info msg="Missing pods for 'nats-io/stan' cluster (size=2/3), creating 1 pods..."
time="2018-12-05T16:43:01Z" level=info msg="Missing pods for 'nats-io/stan' cluster (size=2/3), creating 1 pods..."
time="2018-12-05T16:43:06Z" level=info msg="Missing pods for 'nats-io/stan' cluster (size=2/3), creating 1 pods..."
time="2018-12-05T16:43:06Z" level=info msg="Creating pod 'nats-io/stan-2'"

pvanderlinden · 2018-12-05T16:50:14Z

After deleting the existing permanent failure pod, it will still not recover, and all fail to join the Raft group as it doesn't exist yet. The only way to fix this atm is to delete the actual request for a cluster, let it terminate, then recreate.

wallyqs · 2018-12-05T17:22:43Z

Yes currently if all pods crash and using the raft cluster then it will not be able to recover since quorum was lost and there won't be a leader able to bootstrap the cluster.

pvanderlinden · 2018-12-06T09:37:44Z

There is no way to fix a cluster once the quorum is lost, you have to destroy all data and start from zero?

pvanderlinden · 2018-12-06T11:13:21Z

There is probably also an issue because it's currently not possible to use a PV for the store and/or raft store, as you can only make one pvc for all pods due to #23

phynias · 2019-01-11T22:26:27Z

i had the same problem and i had to delete and recreate the whole thing several times to get it to suddenly work. is there something specific i can do to get it working the first time?

popaaaandrei · 2019-04-05T10:42:19Z

Was't resilience supposed to be the great benefit of all this? I came this morning to the office and found ALL nats-streaming-1-* pods in CrashLoopBackOff with around 500 restarts, meanwhile ALL messages have been obviously lost.

Quentin-M · 2019-04-24T19:44:57Z

This is borderline ridiculous.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster in permanent failure state if all pods crash #24

Cluster in permanent failure state if all pods crash #24

pvanderlinden commented Dec 5, 2018

pvanderlinden commented Dec 5, 2018

pvanderlinden commented Dec 5, 2018

wallyqs commented Dec 5, 2018

pvanderlinden commented Dec 6, 2018

pvanderlinden commented Dec 6, 2018

phynias commented Jan 11, 2019

popaaaandrei commented Apr 5, 2019

Quentin-M commented Apr 24, 2019

Cluster in permanent failure state if all pods crash #24

Cluster in permanent failure state if all pods crash #24

Comments

pvanderlinden commented Dec 5, 2018

pvanderlinden commented Dec 5, 2018

pvanderlinden commented Dec 5, 2018

wallyqs commented Dec 5, 2018

pvanderlinden commented Dec 6, 2018

pvanderlinden commented Dec 6, 2018

phynias commented Jan 11, 2019

popaaaandrei commented Apr 5, 2019

Quentin-M commented Apr 24, 2019