Endpoint status and snapshot API doesn't work after the quorum loss and live etcd instances restarted #18511

zhuchenwang · 2024-08-28T19:39:15Z

zhuchenwang
Aug 28, 2024

Hi team,

We deploy our own Etcd cluster as statefulset in the k8s cluster. If there is a quorum loss, we spin up a recovery job which calls the endpoint status API of the live Etcd pods to get the highest index, and call the snapshot API to that pod to take the snapshot and then restore from the snapshot.

This works well until we found that after the quorum loss and the remaining live Etcd pods were restarted, the endpoint status API was not available because the Etcd pod wanted to push it's local configuration but there was no quorum. The snapshot API also didn't work.

Is this an expected behavior? Also, any suggestion to the recovery process is appreciated.

Thanks,

Zhuchen

jmhbnz · 2024-09-15T21:34:04Z

jmhbnz
Sep 15, 2024
Maintainer

Hi @zhuchenwang - Thanks for your question. Please refer to https://etcd.io/docs/v3.5/op-guide/recovery/#snapshotting-the-keyspace for guidance on etcd disaster recovery.

Specifically:

Snapshotting the keyspace

Recovering a cluster first needs a snapshot of the keyspace from an etcd member. A snapshot may either be taken from a live member with the etcdctl snapshot save command or by copying the member/snap/db file from an etcd data directory.

In situations were the snapshot API is not serving you can copy the db files directly for subsequent use to restore the cluster as covered in the guide.

2 replies

zhuchenwang Sep 16, 2024
Author

Hi @jmhbnz - Thanks for the explanation. Yes, we just restore from the db file as a fallback. However, in the V2 API, the etcdctl command that restores the cluster using the db file directly and the V3 API was using the gRPC call. Given the limitation of the gRPC call, is there a plan to add the command to etcdctl in V3 api to restore the cluster using the db file as well?

jmhbnz Sep 16, 2024
Maintainer

As detailed in https://etcd.io/docs/v3.5/op-guide/recovery/#restoring-from-snapshot restoring the db from snapshot is performed with etcdutl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpoint status and snapshot API doesn't work after the quorum loss and live etcd instances restarted #18511

{{title}}

Replies: 1 comment 2 replies

{{title}}

Snapshotting the keyspace

{{title}}

{{title}}

Select a reply

Endpoint status and snapshot API doesn't work after the quorum loss and live etcd instances restarted #18511

zhuchenwang Aug 28, 2024

Replies: 1 comment · 2 replies

jmhbnz Sep 15, 2024 Maintainer

Snapshotting the keyspace

zhuchenwang Sep 16, 2024 Author

jmhbnz Sep 16, 2024 Maintainer

zhuchenwang
Aug 28, 2024

Replies: 1 comment 2 replies

jmhbnz
Sep 15, 2024
Maintainer

zhuchenwang Sep 16, 2024
Author

jmhbnz Sep 16, 2024
Maintainer