-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-10377. Allow datanodes to do chunk level modifications to closed containers. #7111
base: HDDS-10239-container-reconciliation
Are you sure you want to change the base?
HDDS-10377. Allow datanodes to do chunk level modifications to closed containers. #7111
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this change we shouldn't need any protos or new request types. We just need a method that can be called from within the datanode that will be passed the chunk info obtained from a read chunk call. The method can be private within KeyValueHandler
because currently it will only be called by KeyValueHandler#reconcileContainer
.
There's also an existing bug in ContainerData#updateWriteStats
which is called from FilePerBlockStrategy#writeChunk
where it will call incrWriteBytes
to increase the volume's used space even for an overwrite. The volume's space should instead be adjusted for the new chunk size.
@aswinshakil Thank you very much for providing this functionality! I would like to ask if we have a similar design for EC (Erasure Coding) containers. For 3-replica blocks, if we find that a block write operation has an issue, we can repair it using the other replicas. However, for EC blocks, it becomes more challenging to determine the true length of the block. |
Hi @slfan1989 this is being developed as part of the container reconciliation feature in HDDS-10239. This feature provides two high level functionalities for containers:
The current design document can be found here. In particular you can refer to the section on phases of implementation. We are currently implementing phase 1, which only applies to Ratis containers. Support for EC containers are in phase 3, which we have not planned for yet. This is because EC already has a reconciliation algorithm as described in (2) above, which is reconstruction.
So in this case, the fix should be made in the reconstruction code path, since that is an existing way to repair EC containers after they have been closed.
EC and Ratis differ here. In Ratis the longest block length wins, because we have a quorum on the server side to commit the last write. In EC, the shortest block wins because it is up to the client to make sure all datanode replicas have committed the last issued write before the client commits that length back to the OM. If only a few datanodes commit, that stripe is invalid and not committed back to OM. |
@aswinshakil thanks for the changes. Overall just calling For additional functionality, the helper should determine whether overwrite is required or not based on the position of the chunkInfo relative to the current entry. I think this is what the TODO currently indicates. We may also want to verify that we do not leave gaps in the file. Additionally, we need to handle the case where we are appending to a chunk or block with new data this replica missed during write, and actually need to add checksums to the DB because they don't exist on this replica. |
Seems fine, will give it one more look tomorrow. |
@errose28 Thank you very much for your response! the content is very thorough and complete. |
private void checkContainerClose(KeyValueContainer kvContainer) | ||
throws StorageContainerException { | ||
|
||
final State containerState = kvContainer.getContainerState(); | ||
if (containerState == State.QUASI_CLOSED || containerState == State.CLOSED | ||
|| containerState == State.UNHEALTHY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Line length was increased to 120 so you might need to update your IDE settings to avoid overly aggressive wrapping.
DispatcherContext dispatcherContext = DispatcherContext.getHandleWriteChunk(); | ||
|
||
// Set CHUNK_OVERWRITE to overwrite existing chunk. | ||
chunkInfo.addMetadata(OzoneConsts.CHUNK_OVERWRITE, "true"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's allow the caller to specify whether this is an overwrite with a boolean flag or something similar passed in to this method. The reconciliation algorithm will know if it is adding new data or replacing existing data when it calls this method. This can avoid any surprises that might come up if we always assume overwrite even if that is not the intent.
chunks.add(chunkInfo.getProtoBufMessage()); | ||
blockData.setChunks(chunks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this list is implicitly sorted by offset at the time of writing. It's possible the chunk we are patching is in the middle, so will need to re-sort this. This should also show up when tests are added to read data that has been patched with this method.
// To be set from the Replica's BCSId | ||
blockData.setBlockCommitSequenceId(dispatcherContext.getLogIndex()); | ||
|
||
blockManager.putBlock(kvContainer, blockData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The endOfBlock
boolean is implicitly set to true here. I think that's a quirk on the write path we don't need to worry about in this case and can explicitly set it to false.
metrics.incContainerBytesStats(Type.WriteChunk, chunkInfo.getLen()); | ||
metrics.incContainerOpsLatencies(Type.WriteChunk, Time.monotonicNowNanos() - writeChunkStartTime); | ||
|
||
if (blockData != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we separate the write chunk and put block methods?
blockData.setChunks(chunks); | ||
|
||
// To be set from the Replica's BCSId | ||
blockData.setBlockCommitSequenceId(dispatcherContext.getLogIndex()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will always be 0 because it was never set in the DispatcherContext
. In the big picture of how this is used, the client should get the block data's BCSID when it reads the chunk and pass it in to this method, probably already as a part of the BlockData parameter.
What changes were proposed in this pull request?
Right now we cannot write chunks for
CLOSED
QUASI_CLOSED
andUNHEALTHY
containers. As a part of the reconciliation, we need to reuseWriteChunkRequest
to write to unopen containers. In this patch, I have added an API to write chunks to existing unopen containers.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10377
How was this patch tested?