Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize creating indexes when writing partitioned table #5292

Open
malhotrashivam opened this issue Mar 25, 2024 · 0 comments
Open

Optimize creating indexes when writing partitioned table #5292

malhotrashivam opened this issue Mar 25, 2024 · 0 comments
Assignees
Labels
2023_unscheduled feature request New feature or request NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. parquet Related to the Parquet integration
Milestone

Comments

@malhotrashivam
Copy link
Contributor

malhotrashivam commented Mar 25, 2024

In the current code when writing a source table in a key-value partitioned fashion, if user provides indexing columns, we compute the constituent tables to write and then add an index on each of the provided column for each constituent table.

This can be optimized by intersecting with any existing indexes on the original table. We can perform a transform to intersect the index tables' row sets with the constituent row sets, then filter the empty row sets.

Pseudocode for intersecting (courtesy of @rcaudy), probably with bad parentheses, etc:

partitionedSource = sourceTable.partitionBy(partitioningColumns)
for dataIndex in sourceTable indexes:
    indexTable = dataIndex.table()
    partitionedIndexTable = partitionedSource.transform(c ->
        // Do this with `FunctionalColumns` to avoid compiles
        indexTable
            .update(List.of(new FunctionalColumn(dataIndex.rowSetColumnName(), RowSet.class, dataIndex.rowSetColumnName(), RowSet.class, dataIndexRowSet -> dataIndexRowSet.intersect(c.getRowSet()))))
            .updateView(List.of(new FunctionalColumn(dataIndex.rowSetColumnName(), RowSet.class, "__NON_EMPTY__", Boolean.class, RowSet::isNonempty)))
            .where(Filter.isTrue(ColumnName.of("__NON_EMPTY__"))))
    // We *could* add `StandaloneDataIndex`es to the constituents, or just write from the partitioned index table

(Found during #5105)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023_unscheduled feature request New feature or request NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. parquet Related to the Parquet integration
Projects
None yet
Development

No branches or pull requests

4 participants