-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support to read partitioned parquet files from S3 #5206
Conversation
71f917e
to
c16ac2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python LGTM
7e913f2
to
2512562
Compare
2512562
to
3dd9fb9
Compare
@@ -849,7 +849,11 @@ private static Pair<TableDefinition, ParquetInstructions> infer( | |||
allColumns.add(ColumnDefinition.fromGenericType(partitionKey, dataType, null, | |||
ColumnDefinition.ColumnType.Partitioning)); | |||
} | |||
allColumns.addAll(schemaInfo.getFirst()); | |||
// Only read non-partitioning columns from the parquet files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed because the parquet files written by Iceberg had partitioning columns data included in the parquet files. So was leading to duplicate columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure we aren't breaking any existing behavior w/ this change...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are passing and I am able to read all the files in deephaven-examples.
I will wait for Ryan to have a final comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises a question for partitioned writing: should we include partitioning columns redundantly in the parquet files we write?
.../table/src/main/java/io/deephaven/parquet/table/layout/ParquetKeyValuePartitionedLayout.java
Outdated
Show resolved
Hide resolved
b6876cd
...in/java/io/deephaven/engine/table/impl/locations/local/URIStreamKeyValuePartitionLayout.java
Outdated
Show resolved
Hide resolved
...in/java/io/deephaven/engine/table/impl/locations/local/URIStreamKeyValuePartitionLayout.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to review s3-related files.
Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelsProvider.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/deephaven/engine/table/impl/locations/local/FileKeyValuePartitionLayout.java
Outdated
Show resolved
Hide resolved
...in/java/io/deephaven/engine/table/impl/locations/local/URIStreamKeyValuePartitionLayout.java
Outdated
Show resolved
Hide resolved
...in/java/io/deephaven/engine/table/impl/locations/local/URIStreamKeyValuePartitionLayout.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Outdated
Show resolved
Hide resolved
...quet/table/src/main/java/io/deephaven/parquet/table/layout/ParquetFlatPartitionedLayout.java
Outdated
Show resolved
Hide resolved
.../table/src/main/java/io/deephaven/parquet/table/layout/ParquetKeyValuePartitionedLayout.java
Outdated
Show resolved
Hide resolved
.../table/src/main/java/io/deephaven/parquet/table/layout/ParquetKeyValuePartitionedLayout.java
Outdated
Show resolved
Hide resolved
.../table/src/main/java/io/deephaven/parquet/table/layout/ParquetKeyValuePartitionedLayout.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3SeekableChannelProviderPlugin.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
...in/java/io/deephaven/engine/table/impl/locations/local/URIStreamKeyValuePartitionLayout.java
Show resolved
Hide resolved
Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelsProvider.java
Show resolved
Hide resolved
...in/java/io/deephaven/engine/table/impl/locations/local/URIStreamKeyValuePartitionLayout.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3ChannelContext.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3SeekableByteChannel.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3SeekableChannelProvider.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3SeekableChannelProvider.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3SeekableChannelProvider.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3SeekableChannelProvider.java
Outdated
Show resolved
Hide resolved
...rc/main/java/io/deephaven/engine/table/impl/locations/local/FileKeyValuePartitionLayout.java
Show resolved
Hide resolved
...le/src/main/java/io/deephaven/engine/table/impl/locations/local/KeyValuePartitionLayout.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3ChannelContext.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3ChannelContext.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3ChannelContext.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3ChannelContext.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3ChannelContext.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3ChannelContext.java
Outdated
Show resolved
Hide resolved
extensions/s3/src/main/java/io/deephaven/extensions/s3/S3SeekableChannelProvider.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing last commit before refactor
py/server/tests/test_parquet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing unit tests for the new functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @chipkent, the new changes that I added since your last review don't add any new functionality in the python code. I just deprecated a number of APIs in the Java code and used the new APIs in the python code. So the existing tests completely cover all the cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the python code in this PR to the state that you last approved and merged this PR.
All the follow up work is now being done in #5358.
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Outdated
Show resolved
Hide resolved
extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java
Outdated
Show resolved
Hide resolved
I have reverted the last two comments and will move the refactoring for ParquetTools to a separate PR. All the pending refactoring related comments will be solved as part of that PR (#5358). For this PR, I have marked the four new methods as deprecated, and these four will be deleted in the next refactoring PR. |
Labels indicate documentation is required. Issues for documentation have been opened: Community: deephaven/deephaven-docs-community#189 |
Closes #5065
Breaking Change: Renamed
KeyValuePartitionLayout
toFileKeyValuePartitionLayout
.Reason:
KeyValuePartitionLayout
is now more generic and has common functionalities which are used by bothFileKeyValuePartitionLayout
and a newURIStreamKeyValuePartitionLayout
. As the name suggestes,FileKeyValuePartitionLayout
is used to process key-value partitioned data accessed through javaFile
objects, whereasURIStreamKeyValuePartitionLayout
is used to processURI
s.