HDDS-11380. Fixing the error message of node decommission to be more comprehensive #7155

VarshaRaviCV · 2024-09-04T08:59:14Z

What changes were proposed in this pull request?

HDDS-11380. Fixing the error message of node decommission to be more comprehensive

Please describe your PR in detail:
Right now, when decommissioning fails quickly due to in-sufficient nodes, the error message says

Insufficient nodes. Tried to decommission 1 nodes of which 1 nodes were valid. Cluster has 3 IN-SERVICE nodes, 3 of which are required for minimum replication.

This does not clearly explain, how many nodes are required for the decommission to proceed. So in this PR we will be changing the above message to clearly define how many more IN-SERVICE nodes are needed for decommissioning to proceed.

Insufficient nodes. Tried to decommission 2 nodes out of 4 IN-SERVICE HEALTHY and 1 UNHEALTHY nodes. Cannot decommission as a minimum of 3 IN-SERVICE HEALTHY nodes are required to maintain replication after decommission.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11380

How was this patch tested?

Ran the existing tests in TestNodeDecommissionManager.java
Also tested locally in docker

sh-4.4$ ozone admin datanode decommission ozone-datanode-4 ozone-datanode-5
Started decommissioning datanode(s):
ozone-datanode-4
ozone-datanode-5
Error: AllHosts: Insufficient nodes. Tried to decommission 2 nodes out of 3 IN-SERVICE HEALTHY and 2 not IN-SERVICE or not HEALTHY nodes. Cannot decommission as a minimum of 3 IN-SERVICE HEALTHY nodes are required to maintain replication after decommission.
Some nodes could not enter the decommission workflow

sh-4.4$ ozone admin datanode maintenance ozone-datanode-2 ozone-datanode-4 ozone-datanode-5
Entering maintenance mode on datanode(s):
ozone-datanode-2
ozone-datanode-4
ozone-datanode-5
Error: AllHosts: Insufficient nodes. Tried to start maintenance for 3 nodes out of 3 IN-SERVICE HEALTHY and 2 not IN-SERVICE or not HEALTHY nodes. Cannot enter maintenance mode as a minimum of 2 IN-SERVICE HEALTHY nodes are required to maintain replication after maintenance.
Some nodes could not start the maintenance workflow

…comprehensive

krishnaasawa1

I believe only if we provide Inservice nodes user is still confused about state of other nodes. e.g: 3 In service but overall 9 . Then what about this 6 nodes why it has not got selected and leads to misinterpretation by user.
Providing information of state of other node which are not considered will help rule out this confusion.

siddhantsangwan · 2024-09-04T13:10:17Z

Linking our discussion from the jira here for visibility.

VarshaRaviCV · 2024-09-05T08:28:56Z

@siddhantsangwan @krishnaasawa1 Please review now. I have modified the error message based on review comments and discussions on the jira.

siddhantsangwan

@VarshaRaviCV thanks for working on this. I have added some comments for decommissioning, the same apply for maintenance as well.

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java

errose28 · 2024-09-05T17:57:16Z

What is the expected behavior when the max number of nodes are already in decommissioning and maintenance and we try to decom or maintenance mode more nodes? Currently the error message only accounts for in service or unhealthy, but not existing out of service nodes. Also what is the expected message if we try to decom/maintenance a node already in that mode?

Ideally we could add tests for these messages as well. We don't need a full string match, but checking that the message has things like "3 Unhealthy", "2 Healthy", etc for various scenarios would be good.

Tejaskriya

There are similar messages for when we try to put a node to maintenance in checkIfMaintenancePossible(). We can update them too (in this PR or in another one, either is fine)

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java

Tejaskriya · 2024-09-06T09:34:42Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java

-                " nodes of which " + numMaintenance + " nodes were valid. Cluster has " + inServiceTotal +
-                " IN-SERVICE nodes, " + minInService + " of which are required for minimum replication. ";
+                " nodes out of " + inServiceTotal + " IN-SERVICE HEALTHY and " + unHealthyTotal +
+                " UNHEALTHY nodes. Cannot enter maintenance mode as a minimum of " + minInService +


Same suggestions as above

siddhantsangwan · 2024-09-10T09:03:55Z

@VarshaRaviCV can we please see the latest error message that's shown on the CLI after your updates?

VarshaRaviCV · 2024-09-10T09:21:57Z

@siddhantsangwan Here are few of the error messages for different scenarios as per latest changes.

sh-4.4$ ozone admin datanode decommission ozone-datanode-1 ozone-datanode-5 ozone-datanode-3
Started decommissioning datanode(s):
ozone-datanode-1
ozone-datanode-5
ozone-datanode-3
Error: AllHosts: Insufficient nodes. Tried to decommission 3 nodes out of 5 IN-SERVICE HEALTHY and 0 not IN-SERVICE or not HEALTHY nodes. Cannot decommission as a minimum of 3 IN-SERVICE HEALTHY nodes are required to maintain replication after decommission.
Some nodes could not enter the decommission workflow

sh-4.4$ ozone admin datanode decommission ozone-datanode-4 ozone-datanode-5
Started decommissioning datanode(s):
ozone-datanode-4
ozone-datanode-5
Error: AllHosts: Insufficient nodes. Tried to decommission 2 nodes out of 3 IN-SERVICE HEALTHY and 2 not IN-SERVICE or not HEALTHY nodes. Cannot decommission as a minimum of 3 IN-SERVICE HEALTHY nodes are required to maintain replication after decommission.
Some nodes could not enter the decommission workflow

sh-4.4$ ozone admin datanode maintenance ozone-datanode-2 ozone-datanode-4 ozone-datanode-5
Entering maintenance mode on datanode(s):
ozone-datanode-2
ozone-datanode-4
ozone-datanode-5
Error: AllHosts: Insufficient nodes. Tried to start maintenance for 3 nodes out of 3 IN-SERVICE HEALTHY and 2 not IN-SERVICE or not HEALTHY nodes. Cannot enter maintenance mode as a minimum of 2 IN-SERVICE HEALTHY nodes are required to maintain replication after maintenance.
Some nodes could not start the maintenance workflow

Tejaskriya

LGTM!

siddhantsangwan

LGTM. @VarshaRaviCV thanks for improving this! Pending green CI.

errose28 · 2024-09-10T16:09:07Z

From this comment, can you share the output for decommissioning an already decommissioning or in maintenance node and decommissioning a nonexistent node?

We should also not merge this without unit tests on the numbers in the error message. From what I see current tests in TestNodeDecommissionManager only cover success/failure of different scenarios.

siddhantsangwan · 2024-09-11T04:05:54Z

I'm expecting these messages to show up if the datanode is in maintenance or missing.

 } catch (NodeNotFoundException e) {
        // We already validated the host strings and retrieved the DnDetails
        // object from the node manager. Therefore we should never get a
        // NodeNotFoundException here expect if the node is remove in the
        // very short window between validation and starting decom. Therefore
        // log a warning and ignore the exception
        LOG.warn("The host {} was not found in SCM. Ignoring the request to " +
            "decommission it", dn.getHostName());
        errors.add(new DatanodeAdminError(dn.getHostName(),
            "The host was not found in SCM"));
      } catch (InvalidNodeStateException e) {
        errors.add(new DatanodeAdminError(dn.getHostName(), e.getMessage()));
      }
    }
    return errors;
  }

VarshaRaviCV · 2024-09-16T10:54:14Z

@errose28 If the node is in decommissioning state or is already decommissioned, the logs will show the below error message.

2024-09-16 16:17:39 2024-09-16 10:47:39,592 [IPC Server handler 6 on default port 9860] INFO node.NodeDecommissionManager: Force flag = false. Checking if decommission is possible for dns: [7b15ac8b-b11d-4235-a5cd-494f649b0553(ozone-datanode-1.ozone_default/172.18.0.8)]
2024-09-16 16:17:39 2024-09-16 10:47:39,592 [IPC Server handler 6 on default port 9860] WARN node.NodeDecommissionManager: Cannot decommission ozone-datanode-1.ozone_default because it is not IN-SERVICE
2024-09-16 16:17:39 2024-09-16 10:47:39,592 [IPC Server handler 6 on default port 9860] INFO node.NodeDecommissionManager: Start Decommission called on node 7b15ac8b-b11d-4235-a5cd-494f649b0553(ozone-datanode-1.ozone_default/172.18.0.8) in state DECOMMISSIONING. Nothing to do.

2024-09-16 16:18:19 2024-09-16 10:48:19,993 [IPC Server handler 9 on default port 9860] INFO node.NodeDecommissionManager: Force flag = false. Checking if decommission is possible for dns: [d067994d-ee5f-4735-912d-5a7b93121ac0(ozone-datanode-3.ozone_default/172.18.0.9)]
2024-09-16 16:18:19 2024-09-16 10:48:19,993 [IPC Server handler 9 on default port 9860] WARN node.NodeDecommissionManager: Cannot decommission ozone-datanode-3.ozone_default because it is not IN-SERVICE
2024-09-16 16:18:19 2024-09-16 10:48:19,993 [IPC Server handler 9 on default port 9860] INFO node.NodeDecommissionManager: Start Decommission called on node d067994d-ee5f-4735-912d-5a7b93121ac0(ozone-datanode-3.ozone_default/172.18.0.9) in state DECOMMISSIONED. Nothing to do.

The CLI however will not show any error.

sh-4.4$ ozone admin datanode decommission ozone-datanode-1
Started decommissioning datanode(s):
ozone-datanode-1

sh-4.4$ ozone admin datanode decommission ozone-datanode-3
Started decommissioning datanode(s):
ozone-datanode-3

I will add validation of node count in error message in the existing tests for insufficient nodes.

…ests under TestNodeDecommissionManager

VarshaRaviCV · 2024-09-17T07:17:38Z

@siddhantsangwan @Tejaskriya @errose28 I have added assertions for node count in error message in the existing tests of TestNodeDecommissionManager.java. Please check once.

adoroszlai

Thanks @VarshaRaviCV for the patch.

adoroszlai · 2024-10-17T08:10:19Z

...p-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeDecommissionManager.java

@@ -398,10 +398,12 @@ private synchronized boolean checkIfDecommissionPossible(List<DatanodeDetails> d
        if (opState != NodeOperationalState.IN_SERVICE) {
          numDecom--;
          validDns.remove(dn);
+          LOG.warn("Cannot decommission {} because it is not IN-SERVICE", dn.getHostName());


I think this message (and similar one for maintenance) could be improved by including the actual opState the node is in.

Also, please consider replacing dn.getHostName() with dn to rely on:

ozone/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/protocol/DatanodeDetails.java

Lines 561 to 563 in f784a84

public String toString() {

return uuidString + "(" + hostName + "/" + ipAddress + ")";

}

This shows just a bit more info to make the node uniquely identifiable while keeping the hostname for convenience. (Applies to all similar logs.)

adoroszlai · 2024-10-28T12:08:05Z

@VarshaRaviCV do you plan to update the patch based on my suggestion, or should we do it in a follow-up?

VarshaRaviCV · 2024-10-29T05:40:19Z

@adoroszlai I will pick up the suggestion in a follow-up PR since this PR already has approvals.

adoroszlai · 2024-10-29T07:05:03Z

Thanks @VarshaRaviCV for the patch, @errose28, @krishnaasawa1, @siddhantsangwan, @Tejaskriya for the review.

HDDS-11380. Fixing the error message of node decommission to be more …

b7adbd5

…comprehensive

krishnaasawa1 reviewed Sep 4, 2024

View reviewed changes

Fixing the error message based on review comments

314313b

siddhantsangwan reviewed Sep 5, 2024

View reviewed changes

Tejaskriya reviewed Sep 6, 2024

View reviewed changes

Making changes based on review comments

cad5208

Change the unhealthy keyword to not in-service or not healthy

608b9eb

Tejaskriya approved these changes Sep 10, 2024

View reviewed changes

siddhantsangwan approved these changes Sep 10, 2024

View reviewed changes

Adding node count assertions for decomm and maintenance in existing t…

cbd007f

…ests under TestNodeDecommissionManager

siddhantsangwan requested a review from errose28 September 25, 2024 08:45

adoroszlai reviewed Oct 17, 2024

View reviewed changes

adoroszlai merged commit 980b960 into apache:master Oct 29, 2024
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-11380. Fixing the error message of node decommission to be more comprehensive #7155

HDDS-11380. Fixing the error message of node decommission to be more comprehensive #7155

VarshaRaviCV commented Sep 4, 2024 •

edited

Loading

krishnaasawa1 left a comment

siddhantsangwan commented Sep 4, 2024

VarshaRaviCV commented Sep 5, 2024

siddhantsangwan left a comment

errose28 commented Sep 5, 2024

Tejaskriya left a comment

Tejaskriya Sep 6, 2024

siddhantsangwan commented Sep 10, 2024

VarshaRaviCV commented Sep 10, 2024 •

edited

Loading

Tejaskriya left a comment

siddhantsangwan left a comment

errose28 commented Sep 10, 2024

siddhantsangwan commented Sep 11, 2024

VarshaRaviCV commented Sep 16, 2024

VarshaRaviCV commented Sep 17, 2024

adoroszlai left a comment

adoroszlai Oct 17, 2024

adoroszlai commented Oct 28, 2024

VarshaRaviCV commented Oct 29, 2024

adoroszlai commented Oct 29, 2024

	public String toString() {
	return uuidString + "(" + hostName + "/" + ipAddress + ")";
	}

HDDS-11380. Fixing the error message of node decommission to be more comprehensive #7155

HDDS-11380. Fixing the error message of node decommission to be more comprehensive #7155

Conversation

VarshaRaviCV commented Sep 4, 2024 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

krishnaasawa1 left a comment

Choose a reason for hiding this comment

siddhantsangwan commented Sep 4, 2024

VarshaRaviCV commented Sep 5, 2024

siddhantsangwan left a comment

Choose a reason for hiding this comment

errose28 commented Sep 5, 2024

Tejaskriya left a comment

Choose a reason for hiding this comment

Tejaskriya Sep 6, 2024

Choose a reason for hiding this comment

siddhantsangwan commented Sep 10, 2024

VarshaRaviCV commented Sep 10, 2024 • edited Loading

Tejaskriya left a comment

Choose a reason for hiding this comment

siddhantsangwan left a comment

Choose a reason for hiding this comment

errose28 commented Sep 10, 2024

siddhantsangwan commented Sep 11, 2024

VarshaRaviCV commented Sep 16, 2024

VarshaRaviCV commented Sep 17, 2024

adoroszlai left a comment

Choose a reason for hiding this comment

adoroszlai Oct 17, 2024

Choose a reason for hiding this comment

adoroszlai commented Oct 28, 2024

VarshaRaviCV commented Oct 29, 2024

adoroszlai commented Oct 29, 2024

VarshaRaviCV commented Sep 4, 2024 •

edited

Loading

VarshaRaviCV commented Sep 10, 2024 •

edited

Loading