Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11380. Fixing the error message of node decommission to be more comprehensive #7155

Merged
merged 5 commits into from
Oct 29, 2024

Conversation

VarshaRaviCV
Copy link
Contributor

@VarshaRaviCV VarshaRaviCV commented Sep 4, 2024

What changes were proposed in this pull request?

HDDS-11380. Fixing the error message of node decommission to be more comprehensive

Please describe your PR in detail:
Right now, when decommissioning fails quickly due to in-sufficient nodes, the error message says

Insufficient nodes. Tried to decommission 1 nodes of which 1 nodes were valid. Cluster has 3 IN-SERVICE nodes, 3 of which are required for minimum replication.

This does not clearly explain, how many nodes are required for the decommission to proceed. So in this PR we will be changing the above message to clearly define how many more IN-SERVICE nodes are needed for decommissioning to proceed.

Insufficient nodes. Tried to decommission 2 nodes out of 4 IN-SERVICE HEALTHY and 1 UNHEALTHY nodes. Cannot decommission as a minimum of 3 IN-SERVICE HEALTHY nodes are required to maintain replication after decommission.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11380

How was this patch tested?

Ran the existing tests in TestNodeDecommissionManager.java
Also tested locally in docker

sh-4.4$ ozone admin datanode decommission ozone-datanode-4 ozone-datanode-5
Started decommissioning datanode(s):
ozone-datanode-4
ozone-datanode-5
Error: AllHosts: Insufficient nodes. Tried to decommission 2 nodes out of 3 IN-SERVICE HEALTHY and 2 not IN-SERVICE or not HEALTHY nodes. Cannot decommission as a minimum of 3 IN-SERVICE HEALTHY nodes are required to maintain replication after decommission.
Some nodes could not enter the decommission workflow

sh-4.4$ ozone admin datanode maintenance ozone-datanode-2 ozone-datanode-4 ozone-datanode-5
Entering maintenance mode on datanode(s):
ozone-datanode-2
ozone-datanode-4
ozone-datanode-5
Error: AllHosts: Insufficient nodes. Tried to start maintenance for 3 nodes out of 3 IN-SERVICE HEALTHY and 2 not IN-SERVICE or not HEALTHY nodes. Cannot enter maintenance mode as a minimum of 2 IN-SERVICE HEALTHY nodes are required to maintain replication after maintenance.
Some nodes could not start the maintenance workflow

Copy link

@krishnaasawa1 krishnaasawa1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe only if we provide Inservice nodes user is still confused about state of other nodes. e.g: 3 In service but overall 9 . Then what about this 6 nodes why it has not got selected and leads to misinterpretation by user.
Providing information of state of other node which are not considered will help rule out this confusion.

@siddhantsangwan
Copy link
Contributor

Linking our discussion from the jira here for visibility.

@VarshaRaviCV
Copy link
Contributor Author

@siddhantsangwan @krishnaasawa1 Please review now. I have modified the error message based on review comments and discussions on the jira.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VarshaRaviCV thanks for working on this. I have added some comments for decommissioning, the same apply for maintenance as well.

@errose28
Copy link
Contributor

errose28 commented Sep 5, 2024

What is the expected behavior when the max number of nodes are already in decommissioning and maintenance and we try to decom or maintenance mode more nodes? Currently the error message only accounts for in service or unhealthy, but not existing out of service nodes. Also what is the expected message if we try to decom/maintenance a node already in that mode?

Ideally we could add tests for these messages as well. We don't need a full string match, but checking that the message has things like "3 Unhealthy", "2 Healthy", etc for various scenarios would be good.

Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are similar messages for when we try to put a node to maintenance in checkIfMaintenancePossible(). We can update them too (in this PR or in another one, either is fine)

" nodes of which " + numMaintenance + " nodes were valid. Cluster has " + inServiceTotal +
" IN-SERVICE nodes, " + minInService + " of which are required for minimum replication. ";
" nodes out of " + inServiceTotal + " IN-SERVICE HEALTHY and " + unHealthyTotal +
" UNHEALTHY nodes. Cannot enter maintenance mode as a minimum of " + minInService +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same suggestions as above

@siddhantsangwan
Copy link
Contributor

@VarshaRaviCV can we please see the latest error message that's shown on the CLI after your updates?

@VarshaRaviCV
Copy link
Contributor Author

VarshaRaviCV commented Sep 10, 2024

@siddhantsangwan Here are few of the error messages for different scenarios as per latest changes.

sh-4.4$ ozone admin datanode decommission ozone-datanode-1 ozone-datanode-5 ozone-datanode-3
Started decommissioning datanode(s):
ozone-datanode-1
ozone-datanode-5
ozone-datanode-3
Error: AllHosts: Insufficient nodes. Tried to decommission 3 nodes out of 5 IN-SERVICE HEALTHY and 0 not IN-SERVICE or not HEALTHY nodes. Cannot decommission as a minimum of 3 IN-SERVICE HEALTHY nodes are required to maintain replication after decommission.
Some nodes could not enter the decommission workflow

sh-4.4$ ozone admin datanode decommission ozone-datanode-4 ozone-datanode-5
Started decommissioning datanode(s):
ozone-datanode-4
ozone-datanode-5
Error: AllHosts: Insufficient nodes. Tried to decommission 2 nodes out of 3 IN-SERVICE HEALTHY and 2 not IN-SERVICE or not HEALTHY nodes. Cannot decommission as a minimum of 3 IN-SERVICE HEALTHY nodes are required to maintain replication after decommission.
Some nodes could not enter the decommission workflow

sh-4.4$ ozone admin datanode maintenance ozone-datanode-2 ozone-datanode-4 ozone-datanode-5
Entering maintenance mode on datanode(s):
ozone-datanode-2
ozone-datanode-4
ozone-datanode-5
Error: AllHosts: Insufficient nodes. Tried to start maintenance for 3 nodes out of 3 IN-SERVICE HEALTHY and 2 not IN-SERVICE or not HEALTHY nodes. Cannot enter maintenance mode as a minimum of 2 IN-SERVICE HEALTHY nodes are required to maintain replication after maintenance.
Some nodes could not start the maintenance workflow

Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @VarshaRaviCV thanks for improving this! Pending green CI.

@errose28
Copy link
Contributor

From this comment, can you share the output for decommissioning an already decommissioning or in maintenance node and decommissioning a nonexistent node?

We should also not merge this without unit tests on the numbers in the error message. From what I see current tests in TestNodeDecommissionManager only cover success/failure of different scenarios.

@siddhantsangwan
Copy link
Contributor

I'm expecting these messages to show up if the datanode is in maintenance or missing.

 } catch (NodeNotFoundException e) {
        // We already validated the host strings and retrieved the DnDetails
        // object from the node manager. Therefore we should never get a
        // NodeNotFoundException here expect if the node is remove in the
        // very short window between validation and starting decom. Therefore
        // log a warning and ignore the exception
        LOG.warn("The host {} was not found in SCM. Ignoring the request to " +
            "decommission it", dn.getHostName());
        errors.add(new DatanodeAdminError(dn.getHostName(),
            "The host was not found in SCM"));
      } catch (InvalidNodeStateException e) {
        errors.add(new DatanodeAdminError(dn.getHostName(), e.getMessage()));
      }
    }
    return errors;
  }

@VarshaRaviCV
Copy link
Contributor Author

@errose28 If the node is in decommissioning state or is already decommissioned, the logs will show the below error message.

2024-09-16 16:17:39 2024-09-16 10:47:39,592 [IPC Server handler 6 on default port 9860] INFO node.NodeDecommissionManager: Force flag = false. Checking if decommission is possible for dns: [7b15ac8b-b11d-4235-a5cd-494f649b0553(ozone-datanode-1.ozone_default/172.18.0.8)]
2024-09-16 16:17:39 2024-09-16 10:47:39,592 [IPC Server handler 6 on default port 9860] WARN node.NodeDecommissionManager: Cannot decommission ozone-datanode-1.ozone_default because it is not IN-SERVICE
2024-09-16 16:17:39 2024-09-16 10:47:39,592 [IPC Server handler 6 on default port 9860] INFO node.NodeDecommissionManager: Start Decommission called on node 7b15ac8b-b11d-4235-a5cd-494f649b0553(ozone-datanode-1.ozone_default/172.18.0.8) in state DECOMMISSIONING. Nothing to do.

2024-09-16 16:18:19 2024-09-16 10:48:19,993 [IPC Server handler 9 on default port 9860] INFO node.NodeDecommissionManager: Force flag = false. Checking if decommission is possible for dns: [d067994d-ee5f-4735-912d-5a7b93121ac0(ozone-datanode-3.ozone_default/172.18.0.9)]
2024-09-16 16:18:19 2024-09-16 10:48:19,993 [IPC Server handler 9 on default port 9860] WARN node.NodeDecommissionManager: Cannot decommission ozone-datanode-3.ozone_default because it is not IN-SERVICE
2024-09-16 16:18:19 2024-09-16 10:48:19,993 [IPC Server handler 9 on default port 9860] INFO node.NodeDecommissionManager: Start Decommission called on node d067994d-ee5f-4735-912d-5a7b93121ac0(ozone-datanode-3.ozone_default/172.18.0.9) in state DECOMMISSIONED. Nothing to do.

The CLI however will not show any error.

sh-4.4$ ozone admin datanode decommission ozone-datanode-1
Started decommissioning datanode(s):
ozone-datanode-1

sh-4.4$ ozone admin datanode decommission ozone-datanode-3
Started decommissioning datanode(s):
ozone-datanode-3

I will add validation of node count in error message in the existing tests for insufficient nodes.

@VarshaRaviCV
Copy link
Contributor Author

@siddhantsangwan @Tejaskriya @errose28 I have added assertions for node count in error message in the existing tests of TestNodeDecommissionManager.java. Please check once.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @VarshaRaviCV for the patch.

@@ -398,10 +398,12 @@ private synchronized boolean checkIfDecommissionPossible(List<DatanodeDetails> d
if (opState != NodeOperationalState.IN_SERVICE) {
numDecom--;
validDns.remove(dn);
LOG.warn("Cannot decommission {} because it is not IN-SERVICE", dn.getHostName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this message (and similar one for maintenance) could be improved by including the actual opState the node is in.

Also, please consider replacing dn.getHostName() with dn to rely on:

public String toString() {
return uuidString + "(" + hostName + "/" + ipAddress + ")";
}

This shows just a bit more info to make the node uniquely identifiable while keeping the hostname for convenience. (Applies to all similar logs.)

@adoroszlai
Copy link
Contributor

@VarshaRaviCV do you plan to update the patch based on my suggestion, or should we do it in a follow-up?

@VarshaRaviCV
Copy link
Contributor Author

@adoroszlai I will pick up the suggestion in a follow-up PR since this PR already has approvals.

@adoroszlai adoroszlai merged commit 980b960 into apache:master Oct 29, 2024
39 checks passed
@adoroszlai
Copy link
Contributor

Thanks @VarshaRaviCV for the patch, @errose28, @krishnaasawa1, @siddhantsangwan, @Tejaskriya for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants