-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently #7047
Conversation
d9297a4
to
6120e7e
Compare
I'll rebase when finishing the test on this branch. |
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/KeyOutputStream.java
Show resolved
Hide resolved
@@ -379,7 +380,7 @@ BlockOutputStreamEntry getCurrentStreamEntry() { | |||
* @return the new current open stream to write to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update javadoc for the new parameter. When should it be true/false?
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/KeyOutputStream.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/fs/ozone/TestHSync.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/BlockOutputStreamEntry.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/BlockOutputStreamEntry.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/io/KeyOutputStream.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java
Outdated
Show resolved
Hide resolved
9ff2a84
to
7c2441c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apart from a few questions it looks good.
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/storage/BlockOutputStream.java
Outdated
Show resolved
Hide resolved
...a/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
Show resolved
Hide resolved
|
Did a rebase and reran the repeating test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
...a/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java
Outdated
Show resolved
Hide resolved
seems to be failing consistently: TestHSync.testUncommittedBlocks:515 expected: but was: I wonder if it's related to the new commit just merged #7074 |
writes = runConcurrentWriteHSyncWithException(file, out, data, syncerThreads, errors, errorInjector); | ||
} | ||
validateWrittenFile(file, fs, data, writes); | ||
fs.delete(file, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems to be failing consistently: TestHSync.testUncommittedBlocks:515 expected: but was:
I wonder if it's related to the new commit just merged #7074
To address the failure in TestHsync#testUncommittedBlocks
, ensure that the deletedTable
is empty after fs.delete(file, false)
;
Rebase to master and add a line after fs.delete(file, false);
to wait KeyDeletingService
to clean up the deletedTable
.
waitForEmptyDeletedTable();
…_FILE_INCONSISTENCY
f65df56
to
7a19ccf
Compare
Got it. Fixed as @chungen0126 suggested. |
Merged. Thanks @duongkame for the patch and @smengcl @chungen0126 for reviews! |
* master: HDDS-11081. Use thread-local instance of FileSystem in Freon tests (#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (#7047)
…an-on-error * HDDS-10239-container-reconciliation: (428 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
…rrupt-files * HDDS-10239-container-reconciliation: (61 commits) HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091) HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089) Mark TestPipelineManagerMXBean#testPipelineInfo as flaky Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081) HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075) HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047) HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076) HDDS-11331. Fix Datanode unable to report for a long time (apache#7090) HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102) HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103) HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974) HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035) HDDS-9790. Add tests for Overview page (apache#6983) HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074) HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098) HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099) ... Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
What changes were proposed in this pull request?
When a block (or
BlockOutputStream
, I'll use them interchangeably in this description) becomes faulty (any exception happens when sending requests to the pipeline),KeyOutputStream
will allocate a new Block (as the current block), and hand off all pending changes in theBufferPool
to the current block. This is done byKeyOutputStream
'shandleException
.Two essential changes are needed when allowing KeyOutputStream to be used by multiple threads concurrently.
handleException
can be called multiple times for a given block. This is different from whenKeyOutputStream
is fully synchronized. The secondhandleException
on a block call will see the block has been closed already and some assertion that is true in full synchronization is no longer stand.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11239
How was this patch tested?
New tests were added to verify the failure and the fix: TestHSync#testConcurrentExceptionHandling
Before fix: https://github.com/duongkame/ozone/actions/runs/10308913291
After fix:
run 1: 20 x 1 https://github.com/duongkame/ozone/actions/runs/10427687646
run 2: 10x 10 https://github.com/duongkame/ozone/actions/runs/10428271747
Regression test on normal concurrent hsync: https://github.com/duongkame/ozone/actions/runs/10428273927