-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11380. Fixing the error message of node decommission to be more comprehensive #7155
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe only if we provide Inservice nodes user is still confused about state of other nodes. e.g: 3 In service but overall 9 . Then what about this 6 nodes why it has not got selected and leads to misinterpretation by user.
Providing information of state of other node which are not considered will help rule out this confusion.
Linking our discussion from the jira here for visibility. |
@siddhantsangwan @krishnaasawa1 Please review now. I have modified the error message based on review comments and discussions on the jira. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VarshaRaviCV thanks for working on this. I have added some comments for decommissioning, the same apply for maintenance as well.
...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java
Outdated
Show resolved
Hide resolved
...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java
Outdated
Show resolved
Hide resolved
...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java
Outdated
Show resolved
Hide resolved
What is the expected behavior when the max number of nodes are already in decommissioning and maintenance and we try to decom or maintenance mode more nodes? Currently the error message only accounts for in service or unhealthy, but not existing out of service nodes. Also what is the expected message if we try to decom/maintenance a node already in that mode? Ideally we could add tests for these messages as well. We don't need a full string match, but checking that the message has things like "3 Unhealthy", "2 Healthy", etc for various scenarios would be good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are similar messages for when we try to put a node to maintenance in checkIfMaintenancePossible()
. We can update them too (in this PR or in another one, either is fine)
...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java
Outdated
Show resolved
Hide resolved
" nodes of which " + numMaintenance + " nodes were valid. Cluster has " + inServiceTotal + | ||
" IN-SERVICE nodes, " + minInService + " of which are required for minimum replication. "; | ||
" nodes out of " + inServiceTotal + " IN-SERVICE HEALTHY and " + unHealthyTotal + | ||
" UNHEALTHY nodes. Cannot enter maintenance mode as a minimum of " + minInService + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same suggestions as above
@VarshaRaviCV can we please see the latest error message that's shown on the CLI after your updates? |
@siddhantsangwan Here are few of the error messages for different scenarios as per latest changes. sh-4.4$ ozone admin datanode decommission ozone-datanode-1 ozone-datanode-5 ozone-datanode-3 sh-4.4$ ozone admin datanode decommission ozone-datanode-4 ozone-datanode-5 sh-4.4$ ozone admin datanode maintenance ozone-datanode-2 ozone-datanode-4 ozone-datanode-5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @VarshaRaviCV thanks for improving this! Pending green CI.
From this comment, can you share the output for decommissioning an already decommissioning or in maintenance node and decommissioning a nonexistent node? We should also not merge this without unit tests on the numbers in the error message. From what I see current tests in |
I'm expecting these messages to show up if the datanode is in maintenance or missing.
|
@errose28 If the node is in decommissioning state or is already decommissioned, the logs will show the below error message.
The CLI however will not show any error.
I will add validation of node count in error message in the existing tests for insufficient nodes. |
…ests under TestNodeDecommissionManager
@siddhantsangwan @Tejaskriya @errose28 I have added assertions for node count in error message in the existing tests of TestNodeDecommissionManager.java. Please check once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @VarshaRaviCV for the patch.
@@ -398,10 +398,12 @@ private synchronized boolean checkIfDecommissionPossible(List<DatanodeDetails> d | |||
if (opState != NodeOperationalState.IN_SERVICE) { | |||
numDecom--; | |||
validDns.remove(dn); | |||
LOG.warn("Cannot decommission {} because it is not IN-SERVICE", dn.getHostName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this message (and similar one for maintenance) could be improved by including the actual opState
the node is in.
Also, please consider replacing dn.getHostName()
with dn
to rely on:
ozone/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/protocol/DatanodeDetails.java
Lines 561 to 563 in f784a84
public String toString() { | |
return uuidString + "(" + hostName + "/" + ipAddress + ")"; | |
} |
This shows just a bit more info to make the node uniquely identifiable while keeping the hostname for convenience. (Applies to all similar logs.)
@VarshaRaviCV do you plan to update the patch based on my suggestion, or should we do it in a follow-up? |
@adoroszlai I will pick up the suggestion in a follow-up PR since this PR already has approvals. |
Thanks @VarshaRaviCV for the patch, @errose28, @krishnaasawa1, @siddhantsangwan, @Tejaskriya for the review. |
What changes were proposed in this pull request?
HDDS-11380. Fixing the error message of node decommission to be more comprehensive
Please describe your PR in detail:
Right now, when decommissioning fails quickly due to in-sufficient nodes, the error message says
This does not clearly explain, how many nodes are required for the decommission to proceed. So in this PR we will be changing the above message to clearly define how many more IN-SERVICE nodes are needed for decommissioning to proceed.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11380
How was this patch tested?
Ran the existing tests in TestNodeDecommissionManager.java
Also tested locally in docker
sh-4.4$ ozone admin datanode decommission ozone-datanode-4 ozone-datanode-5
Started decommissioning datanode(s):
ozone-datanode-4
ozone-datanode-5
Error: AllHosts: Insufficient nodes. Tried to decommission 2 nodes out of 3 IN-SERVICE HEALTHY and 2 not IN-SERVICE or not HEALTHY nodes. Cannot decommission as a minimum of 3 IN-SERVICE HEALTHY nodes are required to maintain replication after decommission.
Some nodes could not enter the decommission workflow
sh-4.4$ ozone admin datanode maintenance ozone-datanode-2 ozone-datanode-4 ozone-datanode-5
Entering maintenance mode on datanode(s):
ozone-datanode-2
ozone-datanode-4
ozone-datanode-5
Error: AllHosts: Insufficient nodes. Tried to start maintenance for 3 nodes out of 3 IN-SERVICE HEALTHY and 2 not IN-SERVICE or not HEALTHY nodes. Cannot enter maintenance mode as a minimum of 2 IN-SERVICE HEALTHY nodes are required to maintain replication after maintenance.
Some nodes could not start the maintenance workflow