Optimize creating indexes when writing partitioned table #5292

malhotrashivam · 2024-03-25T21:54:54Z

In the current code when writing a source table in a key-value partitioned fashion, if user provides indexing columns, we compute the constituent tables to write and then add an index on each of the provided column for each constituent table.

This can be optimized by intersecting with any existing indexes on the original table. We can perform a transform to intersect the index tables' row sets with the constituent row sets, then filter the empty row sets.

Pseudocode for intersecting (courtesy of @rcaudy), probably with bad parentheses, etc:

partitionedSource = sourceTable.partitionBy(partitioningColumns)
for dataIndex in sourceTable indexes:
    indexTable = dataIndex.table()
    partitionedIndexTable = partitionedSource.transform(c ->
        // Do this with `FunctionalColumns` to avoid compiles
        indexTable
            .update(List.of(new FunctionalColumn(dataIndex.rowSetColumnName(), RowSet.class, dataIndex.rowSetColumnName(), RowSet.class, dataIndexRowSet -> dataIndexRowSet.intersect(c.getRowSet()))))
            .updateView(List.of(new FunctionalColumn(dataIndex.rowSetColumnName(), RowSet.class, "__NON_EMPTY__", Boolean.class, RowSet::isNonempty)))
            .where(Filter.isTrue(ColumnName.of("__NON_EMPTY__"))))
    // We *could* add `StandaloneDataIndex`es to the constituents, or just write from the partitioned index table

(Found during #5105)

The text was updated successfully, but these errors were encountered:

malhotrashivam added feature request New feature or request triage parquet Related to the Parquet integration NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. labels Mar 25, 2024

malhotrashivam added this to the 5. Backlog milestone Mar 25, 2024

malhotrashivam assigned lbooker42 and malhotrashivam Mar 25, 2024

rcaudy removed the triage label Mar 25, 2024

rcaudy modified the milestones: 5. Backlog, 4. Unscheduled Mar 25, 2024

malhotrashivam mentioned this issue Mar 25, 2024

Added support to write metadata files in parquet #5105

Merged

pete-petey added the 2023_unscheduled label Aug 26, 2024

pete-petey modified the milestones: 4. Unscheduled, 5. Backlog Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize creating indexes when writing partitioned table #5292

Optimize creating indexes when writing partitioned table #5292

malhotrashivam commented Mar 25, 2024 •

edited

Loading

Optimize creating indexes when writing partitioned table #5292

Optimize creating indexes when writing partitioned table #5292

Comments

malhotrashivam commented Mar 25, 2024 • edited Loading

malhotrashivam commented Mar 25, 2024 •

edited

Loading