Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10377. Allow datanodes to do chunk level modifications to closed containers. #7111

Draft
wants to merge 4 commits into
base: HDDS-10239-container-reconciliation
Choose a base branch
from

Conversation

aswinshakil
Copy link
Member

What changes were proposed in this pull request?

Right now we cannot write chunks for CLOSED QUASI_CLOSED and UNHEALTHY containers. As a part of the reconciliation, we need to reuse WriteChunkRequest to write to unopen containers. In this patch, I have added an API to write chunks to existing unopen containers.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10377

How was this patch tested?

  • Pending Tests

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this change we shouldn't need any protos or new request types. We just need a method that can be called from within the datanode that will be passed the chunk info obtained from a read chunk call. The method can be private within KeyValueHandler because currently it will only be called by KeyValueHandler#reconcileContainer.

There's also an existing bug in ContainerData#updateWriteStats which is called from FilePerBlockStrategy#writeChunk where it will call incrWriteBytes to increase the volume's used space even for an overwrite. The volume's space should instead be adjusted for the new chunk size.

@slfan1989
Copy link
Contributor

@aswinshakil Thank you very much for providing this functionality! I would like to ask if we have a similar design for EC (Erasure Coding) containers. For 3-replica blocks, if we find that a block write operation has an issue, we can repair it using the other replicas. However, for EC blocks, it becomes more challenging to determine the true length of the block.

@errose28
Copy link
Contributor

Hi @slfan1989 this is being developed as part of the container reconciliation feature in HDDS-10239. This feature provides two high level functionalities for containers:

  1. The ability to report their contents to SCM via a container level hash which can be compared to other replicas.
  2. The ability to "reconcile" a container replica with its peers when that hash differs. This means making incremental updates to a container based on data a peer node has that the current node may be missing or have lost.

The current design document can be found here. In particular you can refer to the section on phases of implementation. We are currently implementing phase 1, which only applies to Ratis containers. Support for EC containers are in phase 3, which we have not planned for yet. This is because EC already has a reconciliation algorithm as described in (2) above, which is reconstruction.

For 3-replica blocks, if we find that a block write operation has an issue, we can repair it using the other replicas.

So in this case, the fix should be made in the reconstruction code path, since that is an existing way to repair EC containers after they have been closed.

However, for EC blocks, it becomes more challenging to determine the true length of the block.

EC and Ratis differ here. In Ratis the longest block length wins, because we have a quorum on the server side to commit the last write. In EC, the shortest block wins because it is up to the client to make sure all datanode replicas have committed the last issued write before the client commits that length back to the OM. If only a few datanodes commit, that stripe is invalid and not committed back to OM.

@errose28
Copy link
Contributor

@aswinshakil thanks for the changes. Overall just calling ChunkManager#writeChunk should handle the case for the file update, and we should propagate any exceptions.

For additional functionality, the helper should determine whether overwrite is required or not based on the position of the chunkInfo relative to the current entry. I think this is what the TODO currently indicates. We may also want to verify that we do not leave gaps in the file. Additionally, we need to handle the case where we are appending to a chunk or block with new data this replica missed during write, and actually need to add checksums to the DB because they don't exist on this replica.

@kerneltime
Copy link
Contributor

Seems fine, will give it one more look tomorrow.

@slfan1989
Copy link
Contributor

Hi @slfan1989 this is being developed as part of the container reconciliation feature in HDDS-10239. This feature provides two high level functionalities for containers:

  1. The ability to report their contents to SCM via a container level hash which can be compared to other replicas.
  2. The ability to "reconcile" a container replica with its peers when that hash differs. This means making incremental updates to a container based on data a peer node has that the current node may be missing or have lost.

The current design document can be found here. In particular you can refer to the section on phases of implementation. We are currently implementing phase 1, which only applies to Ratis containers. Support for EC containers are in phase 3, which we have not planned for yet. This is because EC already has a reconciliation algorithm as described in (2) above, which is reconstruction.

For 3-replica blocks, if we find that a block write operation has an issue, we can repair it using the other replicas.

So in this case, the fix should be made in the reconstruction code path, since that is an existing way to repair EC containers after they have been closed.

However, for EC blocks, it becomes more challenging to determine the true length of the block.

EC and Ratis differ here. In Ratis the longest block length wins, because we have a quorum on the server side to commit the last write. In EC, the shortest block wins because it is up to the client to make sure all datanode replicas have committed the last issued write before the client commits that length back to the OM. If only a few datanodes commit, that stripe is invalid and not committed back to OM.

@errose28 Thank you very much for your response! the content is very thorough and complete.

Comment on lines 1203 to 1208
private void checkContainerClose(KeyValueContainer kvContainer)
throws StorageContainerException {

final State containerState = kvContainer.getContainerState();
if (containerState == State.QUASI_CLOSED || containerState == State.CLOSED
|| containerState == State.UNHEALTHY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Line length was increased to 120 so you might need to update your IDE settings to avoid overly aggressive wrapping.

DispatcherContext dispatcherContext = DispatcherContext.getHandleWriteChunk();

// Set CHUNK_OVERWRITE to overwrite existing chunk.
chunkInfo.addMetadata(OzoneConsts.CHUNK_OVERWRITE, "true");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's allow the caller to specify whether this is an overwrite with a boolean flag or something similar passed in to this method. The reconciliation algorithm will know if it is adding new data or replacing existing data when it calls this method. This can avoid any surprises that might come up if we always assume overwrite even if that is not the intent.

Comment on lines 1004 to 1005
chunks.add(chunkInfo.getProtoBufMessage());
blockData.setChunks(chunks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this list is implicitly sorted by offset at the time of writing. It's possible the chunk we are patching is in the middle, so will need to re-sort this. This should also show up when tests are added to read data that has been patched with this method.

// To be set from the Replica's BCSId
blockData.setBlockCommitSequenceId(dispatcherContext.getLogIndex());

blockManager.putBlock(kvContainer, blockData);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endOfBlock boolean is implicitly set to true here. I think that's a quirk on the write path we don't need to worry about in this case and can explicitly set it to false.

metrics.incContainerBytesStats(Type.WriteChunk, chunkInfo.getLen());
metrics.incContainerOpsLatencies(Type.WriteChunk, Time.monotonicNowNanos() - writeChunkStartTime);

if (blockData != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we separate the write chunk and put block methods?

blockData.setChunks(chunks);

// To be set from the Replica's BCSId
blockData.setBlockCommitSequenceId(dispatcherContext.getLogIndex());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always be 0 because it was never set in the DispatcherContext. In the big picture of how this is used, the client should get the block data's BCSID when it reads the chunk and pass it in to this method, probably already as a part of the BlockData parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants