HDDS-10377. Allow datanodes to do chunk level modifications to closed containers. #7111

aswinshakil · 2024-08-22T23:00:55Z

What changes were proposed in this pull request?

Right now we cannot write chunks for CLOSED QUASI_CLOSED and UNHEALTHY containers. As a part of the reconciliation, we need to reuse WriteChunkRequest to write to unopen containers. In this patch, I have added an API to write chunks to existing unopen containers.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10377

How was this patch tested?

Pending Tests

… containers.

errose28

For this change we shouldn't need any protos or new request types. We just need a method that can be called from within the datanode that will be passed the chunk info obtained from a read chunk call. The method can be private within KeyValueHandler because currently it will only be called by KeyValueHandler#reconcileContainer.

There's also an existing bug in ContainerData#updateWriteStats which is called from FilePerBlockStrategy#writeChunk where it will call incrWriteBytes to increase the volume's used space even for an overwrite. The volume's space should instead be adjusted for the new chunk size.

slfan1989 · 2024-08-24T02:02:02Z

@aswinshakil Thank you very much for providing this functionality! I would like to ask if we have a similar design for EC (Erasure Coding) containers. For 3-replica blocks, if we find that a block write operation has an issue, we can repair it using the other replicas. However, for EC blocks, it becomes more challenging to determine the true length of the block.

errose28 · 2024-08-26T15:05:49Z

Hi @slfan1989 this is being developed as part of the container reconciliation feature in HDDS-10239. This feature provides two high level functionalities for containers:

The ability to report their contents to SCM via a container level hash which can be compared to other replicas.
The ability to "reconcile" a container replica with its peers when that hash differs. This means making incremental updates to a container based on data a peer node has that the current node may be missing or have lost.

The current design document can be found here. In particular you can refer to the section on phases of implementation. We are currently implementing phase 1, which only applies to Ratis containers. Support for EC containers are in phase 3, which we have not planned for yet. This is because EC already has a reconciliation algorithm as described in (2) above, which is reconstruction.

For 3-replica blocks, if we find that a block write operation has an issue, we can repair it using the other replicas.

So in this case, the fix should be made in the reconstruction code path, since that is an existing way to repair EC containers after they have been closed.

However, for EC blocks, it becomes more challenging to determine the true length of the block.

EC and Ratis differ here. In Ratis the longest block length wins, because we have a quorum on the server side to commit the last write. In EC, the shortest block wins because it is up to the client to make sure all datanode replicas have committed the last issued write before the client commits that length back to the OM. If only a few datanodes commit, that stripe is invalid and not committed back to OM.

errose28 · 2024-08-28T22:26:04Z

@aswinshakil thanks for the changes. Overall just calling ChunkManager#writeChunk should handle the case for the file update, and we should propagate any exceptions.

For additional functionality, the helper should determine whether overwrite is required or not based on the position of the chunkInfo relative to the current entry. I think this is what the TODO currently indicates. We may also want to verify that we do not leave gaps in the file. Additionally, we need to handle the case where we are appending to a chunk or block with new data this replica missed during write, and actually need to add checksums to the DB because they don't exist on this replica.

kerneltime · 2024-08-29T04:44:58Z

Seems fine, will give it one more look tomorrow.

slfan1989 · 2024-09-04T00:42:41Z

Hi @slfan1989 this is being developed as part of the container reconciliation feature in HDDS-10239. This feature provides two high level functionalities for containers:

The ability to report their contents to SCM via a container level hash which can be compared to other replicas.

The ability to "reconcile" a container replica with its peers when that hash differs. This means making incremental updates to a container based on data a peer node has that the current node may be missing or have lost.

The current design document can be found here. In particular you can refer to the section on phases of implementation. We are currently implementing phase 1, which only applies to Ratis containers. Support for EC containers are in phase 3, which we have not planned for yet. This is because EC already has a reconciliation algorithm as described in (2) above, which is reconstruction.

For 3-replica blocks, if we find that a block write operation has an issue, we can repair it using the other replicas.

So in this case, the fix should be made in the reconstruction code path, since that is an existing way to repair EC containers after they have been closed.

However, for EC blocks, it becomes more challenging to determine the true length of the block.

EC and Ratis differ here. In Ratis the longest block length wins, because we have a quorum on the server side to commit the last write. In EC, the shortest block wins because it is up to the client to make sure all datanode replicas have committed the last issued write before the client commits that length back to the OM. If only a few datanodes commit, that stripe is invalid and not committed back to OM.

@errose28 Thank you very much for your response! the content is very thorough and complete.

errose28 · 2024-09-16T18:50:55Z