Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved handling of parquet writes for read-modify-write #4212

Merged
merged 35 commits into from
Aug 10, 2023

Conversation

malhotrashivam
Copy link
Contributor

@malhotrashivam malhotrashivam commented Jul 20, 2023

Bug fix, closes #2831

Currently, while writing a parquet file, we overwrite if there is any existing files at that path. And in case the write fails (for whatever reason), we delete the partially written file. This is particularly bad for the read-modify-write scenario where we open a file handle to an existing parquet file, do some table operations and then try to write it at the same path. In the current case, we can overwrite the file while we are reading from it which will lead to a bad state.

In this PR, we are changing the write functionality by first writing to a temporary file in the same directory. Once the write completes, we place the file at its destination path and swap out if there is any existing files at that path.

@malhotrashivam malhotrashivam added parquet Related to the Parquet integration NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. labels Jul 20, 2023
@malhotrashivam malhotrashivam added this to the July 2023 milestone Jul 20, 2023
@malhotrashivam malhotrashivam self-assigned this Jul 20, 2023
@malhotrashivam malhotrashivam added the bug Something isn't working label Jul 31, 2023
for (Map.Entry<String, GroupingColumnWritingInfo> entry : groupingColumnsWritingInfoMap.entrySet()) {
final String groupingColumnName = entry.getKey();
final Table auxiliaryTable = groupingAsTable(t, groupingColumnName);
final String parquetColumnName = entry.getValue().parquetColumnName;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General thing to think about:
ParquetWriteInstructions are used at multiple levels. This code could do with a cleanup around control flow and the responsibilities of each layer.

}
throw e;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we didn't have to declare this in throws. Your existing version is better.

@malhotrashivam malhotrashivam changed the title [WIP] Improved handling of failed parquet writes (WIP) Improved handling of failed parquet writes Aug 9, 2023
@malhotrashivam malhotrashivam changed the title (WIP) Improved handling of failed parquet writes Improved handling of failed parquet writes Aug 10, 2023
@malhotrashivam malhotrashivam merged commit 57b429e into deephaven:main Aug 10, 2023
10 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 10, 2023
@malhotrashivam malhotrashivam changed the title Improved handling of failed parquet writes Improved handling of parquet writes for read-modify-write Aug 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. parquet Related to the Parquet integration
Projects
None yet
2 participants