Added support to read partitioned parquet files from S3 #5206

malhotrashivam · 2024-02-28T23:53:57Z

Breaking Change: Renamed KeyValuePartitionLayout to FileKeyValuePartitionLayout.
Reason: KeyValuePartitionLayout is now more generic and has common functionalities which are used by both FileKeyValuePartitionLayout and a new URIStreamKeyValuePartitionLayout. As the name suggestes, FileKeyValuePartitionLayout is used to process key-value partitioned data accessed through java File objects, whereas URIStreamKeyValuePartitionLayout is used to process URIs.

chipkent

Python LGTM

chipkent

Python LGTM

malhotrashivam · 2024-03-22T22:19:51Z

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java

@@ -849,7 +849,11 @@ private static Pair<TableDefinition, ParquetInstructions> infer(
            allColumns.add(ColumnDefinition.fromGenericType(partitionKey, dataType, null,
                    ColumnDefinition.ColumnType.Partitioning));
        }
-        allColumns.addAll(schemaInfo.getFirst());
+        // Only read non-partitioning columns from the parquet files


This was needed because the parquet files written by Iceberg had partitioning columns data included in the parquet files. So was leading to duplicate columns.

I want to make sure we aren't breaking any existing behavior w/ this change...

The tests are passing and I am able to read all the files in deephaven-examples.
I will wait for Ryan to have a final comment.

This raises a question for partitioned writing: should we include partitioning columns redundantly in the parquet files we write?

.../table/src/main/java/io/deephaven/parquet/table/layout/ParquetKeyValuePartitionedLayout.java

...in/java/io/deephaven/engine/table/impl/locations/local/URIStreamKeyValuePartitionLayout.java

rcaudy

Still need to review s3-related files.

Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelsProvider.java

...rc/main/java/io/deephaven/engine/table/impl/locations/local/FileKeyValuePartitionLayout.java

...in/java/io/deephaven/engine/table/impl/locations/local/URIStreamKeyValuePartitionLayout.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java

...quet/table/src/main/java/io/deephaven/parquet/table/layout/ParquetFlatPartitionedLayout.java

.../table/src/main/java/io/deephaven/parquet/table/layout/ParquetKeyValuePartitionedLayout.java

extensions/s3/src/main/java/io/deephaven/extensions/s3/S3SeekableChannelProviderPlugin.java

rcaudy

.

Base/src/main/java/io/deephaven/base/FileUtils.java

...in/java/io/deephaven/engine/table/impl/locations/local/URIStreamKeyValuePartitionLayout.java

Util/channel/src/main/java/io/deephaven/util/channel/SeekableChannelsProvider.java

...in/java/io/deephaven/engine/table/impl/locations/local/URIStreamKeyValuePartitionLayout.java

extensions/s3/src/main/java/io/deephaven/extensions/s3/S3ChannelContext.java

extensions/s3/src/main/java/io/deephaven/extensions/s3/S3SeekableByteChannel.java

extensions/s3/src/main/java/io/deephaven/extensions/s3/S3SeekableChannelProvider.java

...rc/main/java/io/deephaven/engine/table/impl/locations/local/FileKeyValuePartitionLayout.java

...le/src/main/java/io/deephaven/engine/table/impl/locations/local/KeyValuePartitionLayout.java

extensions/s3/src/main/java/io/deephaven/extensions/s3/S3ChannelContext.java

extensions/s3/src/main/java/io/deephaven/extensions/s3/S3SeekableChannelProvider.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java

rcaudy

Reviewing last commit before refactor

chipkent · 2024-04-12T15:52:40Z

py/server/tests/test_parquet.py

Missing unit tests for the new functionality

Hi @chipkent, the new changes that I added since your last review don't add any new functionality in the python code. I just deprecated a number of APIs in the Java code and used the new APIs in the python code. So the existing tests completely cover all the cases.

I reverted the python code in this PR to the state that you last approved and merged this PR.
All the follow up work is now being done in #5358.

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetInstructions.java

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java

This reverts commit 78dd6ac.

This reverts commit 8c68079.

malhotrashivam · 2024-04-12T21:46:40Z

I have reverted the last two comments and will move the refactoring for ParquetTools to a separate PR. All the pending refactoring related comments will be solved as part of that PR (#5358).

For this PR, I have marked the four new methods as deprecated, and these four will be deleted in the next refactoring PR.

deephaven-internal · 2024-04-12T23:32:45Z

Labels indicate documentation is required. Issues for documentation have been opened:

Community: deephaven/deephaven-docs-community#189

Initial commit

7734255

malhotrashivam added parquet Related to the Parquet integration DocumentationNeeded ReleaseNotesNeeded Release notes are needed s3 labels Feb 28, 2024

malhotrashivam added this to the 2. February 2024 (end of month) milestone Feb 28, 2024

malhotrashivam self-assigned this Feb 28, 2024

malhotrashivam changed the title ~~[WIP] Added support to read flat partitioned parquet files from S3~~ [WIP] Added support to read flat partitioned directory of parquet files from S3 Feb 28, 2024

malhotrashivam requested review from chipkent, jmao-denver and rcaudy as code owners March 1, 2024 23:48

Working state, can be optimized further

c16ac2f

malhotrashivam force-pushed the sm-pq-s3-flat branch from 71f917e to c16ac2f Compare March 1, 2024 23:49

Changed interface of flat partitioned reader

e8c4cda

pete-petey modified the milestones: 2. February 2024 (end of month), 1. March 2024 Mar 11, 2024

chipkent reviewed Mar 11, 2024

View reviewed changes

Added key value partitioned parquet reader

d15dc6a

chipkent reviewed Mar 12, 2024

View reviewed changes

malhotrashivam added 3 commits March 13, 2024 15:05

Fixed some issues with partitioned reading

7df60e2

Fixed reading partitioned data with partitioned column in data

d232f39

Merge branch 'main' into sm-pq-s3-flat

f9df8f1

malhotrashivam force-pushed the sm-pq-s3-flat branch from 7e913f2 to 2512562 Compare March 18, 2024 22:30

Minor improvements after rebasing

3dd9fb9

malhotrashivam force-pushed the sm-pq-s3-flat branch from 2512562 to 3dd9fb9 Compare March 18, 2024 23:59

malhotrashivam changed the title ~~[WIP] Added support to read flat partitioned directory of parquet files from S3~~ [WIP] Added support to read partitioned parquet files from S3 Mar 19, 2024

malhotrashivam added 2 commits March 19, 2024 10:41

WIP commit

cde1feb

Seperated URI List processing to a separete class

2838495

malhotrashivam commented Mar 22, 2024

View reviewed changes

malhotrashivam requested a review from devinrsmith March 22, 2024 22:22

Review with Ryan Part 1

b6876cd

malhotrashivam dismissed stale reviews from chipkent and devinrsmith via b6876cd April 8, 2024 20:02

malhotrashivam commented Apr 8, 2024

View reviewed changes

...in/java/io/deephaven/engine/table/impl/locations/local/URIStreamKeyValuePartitionLayout.java Outdated Show resolved Hide resolved

...in/java/io/deephaven/engine/table/impl/locations/local/URIStreamKeyValuePartitionLayout.java Outdated Show resolved Hide resolved

Review with Ryan part 2

11d0b68

rcaudy reviewed Apr 8, 2024

View reviewed changes

malhotrashivam added 3 commits April 9, 2024 10:37

Review with Ryan part 3

9ec7e3e

Review contd.

05a294d

Merge branch 'main' into sm-pq-s3-flat

b9a1b87

rcaudy reviewed Apr 9, 2024

View reviewed changes

More review comments resolved

b0b3022

malhotrashivam commented Apr 9, 2024

View reviewed changes

...rc/main/java/io/deephaven/engine/table/impl/locations/local/FileKeyValuePartitionLayout.java Show resolved Hide resolved

rcaudy reviewed Apr 10, 2024

View reviewed changes

malhotrashivam added 2 commits April 10, 2024 18:03

Resolved more comments

1383850

Deprecated all File overloads from ParquetTools

8c68079

malhotrashivam commented Apr 10, 2024

View reviewed changes

extensions/parquet/table/src/main/java/io/deephaven/parquet/table/ParquetTools.java Outdated Show resolved Hide resolved

Resolved some javadoc issues

78dd6ac

rcaudy reviewed Apr 11, 2024

View reviewed changes

chipkent reviewed Apr 12, 2024

View reviewed changes

malhotrashivam added the breaking label Apr 12, 2024

rcaudy reviewed Apr 12, 2024

View reviewed changes

malhotrashivam added 3 commits April 12, 2024 16:36

Revert "Resolved some javadoc issues"

6fa3629

This reverts commit 78dd6ac.

Revert "Deprecated all File overloads from ParquetTools"

4ab6a31

This reverts commit 8c68079.

Tagged the new methods as Deprecated

79bd133

rcaudy approved these changes Apr 12, 2024

View reviewed changes

malhotrashivam merged commit bf6fcdb into deephaven:main Apr 12, 2024
15 checks passed

github-actions bot locked and limited conversation to collaborators Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support to read partitioned parquet files from S3 #5206

Added support to read partitioned parquet files from S3 #5206

malhotrashivam commented Feb 28, 2024 •

edited

Loading

chipkent left a comment

chipkent left a comment

malhotrashivam Mar 22, 2024

devinrsmith Mar 22, 2024

malhotrashivam Mar 27, 2024

rcaudy Apr 8, 2024

rcaudy left a comment

rcaudy left a comment

rcaudy left a comment

chipkent Apr 12, 2024

malhotrashivam Apr 12, 2024

malhotrashivam Apr 13, 2024

malhotrashivam commented Apr 12, 2024 •

edited

Loading

deephaven-internal commented Apr 12, 2024

Added support to read partitioned parquet files from S3 #5206

Added support to read partitioned parquet files from S3 #5206

Conversation

malhotrashivam commented Feb 28, 2024 • edited Loading

chipkent left a comment

Choose a reason for hiding this comment

chipkent left a comment

Choose a reason for hiding this comment

malhotrashivam Mar 22, 2024

Choose a reason for hiding this comment

devinrsmith Mar 22, 2024

Choose a reason for hiding this comment

malhotrashivam Mar 27, 2024

Choose a reason for hiding this comment

rcaudy Apr 8, 2024

Choose a reason for hiding this comment

rcaudy left a comment

Choose a reason for hiding this comment

rcaudy left a comment

Choose a reason for hiding this comment

rcaudy left a comment

Choose a reason for hiding this comment

chipkent Apr 12, 2024

Choose a reason for hiding this comment

malhotrashivam Apr 12, 2024

Choose a reason for hiding this comment

malhotrashivam Apr 13, 2024

Choose a reason for hiding this comment

malhotrashivam commented Apr 12, 2024 • edited Loading

deephaven-internal commented Apr 12, 2024

malhotrashivam commented Feb 28, 2024 •

edited

Loading

malhotrashivam commented Apr 12, 2024 •

edited

Loading