Leader Node hung after Follower Defragmentation. #16694
krishna281803special
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi All,
I have 5 Node ETCD Cluster (Ver:3.5.6, For One Node:3.5.7) with Very good Hardware (32CPU && 250GB RAM each node). We have 1000+ Clients/Agents connects to ETCD for Writes/Reads.
Issues:
I am using patroni autofailover tool which uses etcd as it's datastore.
patroni agents/clients will be interacting with the etcd hosts.
As per our understanding the agents make write/read connections to etcd, if any follower node is not available, the agents try to make connection to another node immediately and retries until it gets connection.
Sometimes the leader etcd node hangs completely and not appearing in "etcdctl endpoint status"
a) If Leader ETCD is hung due to any reason (We see network overloaded msg sometimes), The leadership is not moving and cluster becomes unstable.
b) When I stop leader node manually using systemctl then leadership is moving to another node and cluster is back to normal state.
We only see leader node connections errors in the log but there is no info why it got hung or what lead to it.
We would like to know how we can debug the leader node etcd hung issue. (We have enabled Debug Mode)
When this issue occurred we were running defragmentation on rolling basis.
The leader node was unreachable during the second step
Thank you very much in Advance.
Beta Was this translation helpful? Give feedback.
All reactions