forked from delta-io/delta
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Kernel] Parquet writer
TableClient
APIs and default implementation (…
…delta-io#2626) Add the following API to `ParquetHandler` to support writing Parquet files. ``` /** * Write the given data batches to a Parquet files. Try to keep the Parquet file size to given * size. If the current file exceeds this size close the current file and start writing to a new * file. * <p> * * @param directoryPath Path to the directory where the Parquet should be written into. * @param dataIter Iterator of data batches to write. * @param maxFileSize Target maximum size of the created Parquet file in bytes. * @param statsColumns List of columns to collect statistics for. The statistics collection is * optional. If the implementation does not support statistics collection, * it is ok to return no statistics. * @return an iterator of {@link DataFileStatus} containing the status of the written files. * Each status contains the file path and the optionally collected statistics for the file * It is the responsibility of the caller to close the iterator. * * @throws IOException if an I/O error occurs during the file writing. This may leave some files * already written in the directory. It is the responsibility of the caller * to clean up. * @SInCE 3.2.0 */ CloseableIterator<DataFileStatus> writeParquetFiles( String directoryPath, CloseableIterator<FilteredColumnarBatch> dataIter, long maxFileSize, List<Column> statsColumns) throws IOException; ``` The default implementation of the above interface uses `parquet-mr` library. ## How was this patch tested? Added support for all Delta types except the `timestamp_ntz`. Tested writing different data types with variations of nested levels, null/non-null values and target file size. ## Followup work * Support 2-level structures for array and map type data writing * Support INT64 format timestamp writing * Decimal legacy format (always binary) support * Uniform support to add field id for intermediate elements in `MAP`, `LIST` data types.
- Loading branch information
1 parent
665aa7d
commit 4ecfa45
Showing
14 changed files
with
1,997 additions
and
149 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
94 changes: 94 additions & 0 deletions
94
kernel/kernel-api/src/main/java/io/delta/kernel/utils/DataFileStatistics.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
/* | ||
* Copyright (2023) The Delta Lake Project Authors. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package io.delta.kernel.utils; | ||
|
||
import java.util.Collections; | ||
import java.util.Map; | ||
|
||
import io.delta.kernel.expressions.Column; | ||
import io.delta.kernel.expressions.Literal; | ||
|
||
/** | ||
* Statistics about data file in a Delta Lake table. | ||
*/ | ||
public class DataFileStatistics { | ||
private final long numRecords; | ||
private final Map<Column, Literal> minValues; | ||
private final Map<Column, Literal> maxValues; | ||
private final Map<Column, Long> nullCounts; | ||
|
||
/** | ||
* Create a new instance of {@link DataFileStatistics}. | ||
* | ||
* @param numRecords Number of records in the data file. | ||
* @param minValues Map of column to minimum value of it in the data file. If the data file has | ||
* all nulls for the column, the value will be null or not present in the | ||
* map. | ||
* @param maxValues Map of column to maximum value of it in the data file. If the data file has | ||
* all nulls for the column, the value will be null or not present in the | ||
* map. | ||
* @param nullCounts Map of column to number of nulls in the data file. | ||
*/ | ||
public DataFileStatistics( | ||
long numRecords, | ||
Map<Column, Literal> minValues, | ||
Map<Column, Literal> maxValues, | ||
Map<Column, Long> nullCounts) { | ||
this.numRecords = numRecords; | ||
this.minValues = Collections.unmodifiableMap(minValues); | ||
this.maxValues = Collections.unmodifiableMap(maxValues); | ||
this.nullCounts = Collections.unmodifiableMap(nullCounts); | ||
} | ||
|
||
/** | ||
* Get the number of records in the data file. | ||
* | ||
* @return Number of records in the data file. | ||
*/ | ||
public long getNumRecords() { | ||
return numRecords; | ||
} | ||
|
||
/** | ||
* Get the minimum values of the columns in the data file. The map may contain statistics for | ||
* only a subset of columns in the data file. | ||
* | ||
* @return Map of column to minimum value of it in the data file. | ||
*/ | ||
public Map<Column, Literal> getMinValues() { | ||
return minValues; | ||
} | ||
|
||
/** | ||
* Get the maximum values of the columns in the data file. The map may contain statistics for | ||
* only a subset of columns in the data file. | ||
* | ||
* @return Map of column to minimum value of it in the data file. | ||
*/ | ||
public Map<Column, Literal> getMaxValues() { | ||
return maxValues; | ||
} | ||
|
||
/** | ||
* Get the number of nulls of columns in the data file. The map may contain statistics for only | ||
* a subset of columns in the data file. | ||
* | ||
* @return Map of column to number of nulls in the data file. | ||
*/ | ||
public Map<Column, Long> getNullCounts() { | ||
return nullCounts; | ||
} | ||
} |
54 changes: 54 additions & 0 deletions
54
kernel/kernel-api/src/main/java/io/delta/kernel/utils/DataFileStatus.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
/* | ||
* Copyright (2023) The Delta Lake Project Authors. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package io.delta.kernel.utils; | ||
|
||
import java.util.Optional; | ||
|
||
/** | ||
* Extends {@link FileStatus} to include additional details such as column level statistics | ||
* of the data file in the Delta Lake table. | ||
*/ | ||
public class DataFileStatus extends FileStatus { | ||
|
||
private final Optional<DataFileStatistics> statistics; | ||
|
||
/** | ||
* Create a new instance of {@link DataFileStatus}. | ||
* | ||
* @param path Fully qualified file path. | ||
* @param size File size in bytes. | ||
* @param modificationTime Last modification time of the file in epoch milliseconds. | ||
* @param statistics Optional column and file level statistics in the data file. | ||
*/ | ||
public DataFileStatus( | ||
String path, | ||
long size, | ||
long modificationTime, | ||
Optional<DataFileStatistics> statistics) { | ||
super(path, size, modificationTime); | ||
this.statistics = statistics; | ||
} | ||
|
||
/** | ||
* Get the statistics of the data file encapsulated in this object. | ||
* | ||
* @return Statistics of the file. | ||
*/ | ||
public Optional<DataFileStatistics> getStatistics() { | ||
return statistics; | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.