Leader Node hung after Follower Defragmentation. #16694

krishna281803special · 2023-10-05T12:02:55Z

krishna281803special
Oct 5, 2023

Hi All,

I have 5 Node ETCD Cluster (Ver:3.5.6, For One Node:3.5.7) with Very good Hardware (32CPU && 250GB RAM each node). We have 1000+ Clients/Agents connects to ETCD for Writes/Reads.

Issues:
I am using patroni autofailover tool which uses etcd as it's datastore.
patroni agents/clients will be interacting with the etcd hosts.

As per our understanding the agents make write/read connections to etcd, if any follower node is not available, the agents try to make connection to another node immediately and retries until it gets connection.
Sometimes the leader etcd node hangs completely and not appearing in "etcdctl endpoint status"
a) If Leader ETCD is hung due to any reason (We see network overloaded msg sometimes), The leadership is not moving and cluster becomes unstable.
b) When I stop leader node manually using systemctl then leadership is moving to another node and cluster is back to normal state.

We only see leader node connections errors in the log but there is no info why it got hung or what lead to it.
We would like to know how we can debug the leader node etcd hung issue. (We have enabled Debug Mode)

When this issue occurred we were running defragmentation on rolling basis.

Run defragmentation on all follower nodes
Check connectivity to all nodes
Make one of the follower as leader
Run defragmentation on old leader

The leader node was unreachable during the second step

Thank you very much in Advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader Node hung after Follower Defragmentation. #16694

{{title}}

Replies: 0 comments

Select a reply

Leader Node hung after Follower Defragmentation. #16694

krishna281803special Oct 5, 2023

Replies: 0 comments

krishna281803special
Oct 5, 2023