Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve merge performance by using predicate non-partition columns min/max for prefiltering #2513

Merged

Conversation

JonasDev1
Copy link
Contributor

Description

This pr improves the merging performance by adding min/max filters to the early filter.
The number of files scanned from the target file table is reduced by using the table statistics.
I have extended the early filter for this purpose. This filter is responsible for pre-filtering the target table.
Previously, the early filter only consisted of partition columns by filtering for all unique values from the source. Now the non-partition columns are also used by aggregating the min/max values from the source and adding a between expression to the early filter.

It is also automatically part of the conflict detection based on the predicate.

I added a property extended_early_filter to make this advanced filtering optional. I don't know if this is important, and maybe we can replace the bool with an enum. What do you think about this?

Example:

Merge into table t with partition date

Predicate: source.date = target.date and source.timestamp = target.timestamp and source.id = target.id and frob > 42

Early filter before: date = '2024-‚05-14' and frob > 42
Early filter now: date = '2024-05-14' and timestamp BETWEEN '…15:00' AND '…15:05' and id BETWEEN 'A' AND 'B' and frob > 42

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label May 14, 2024
@JonasDev1 JonasDev1 changed the title feat: Improve merge performance by using predicate non-partition columns min/max for prefiltering feat: improve merge performance by using predicate non-partition columns min/max for prefiltering May 14, 2024
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@JonasDev1
Copy link
Contributor Author

#2411

@ion-elgreco ion-elgreco requested a review from Blajda May 14, 2024 15:03
@ion-elgreco
Copy link
Collaborator

@JonasDev1 why did you make the advanced filtering optional?

If this provides better performance across the board, we should enable it always (so also for python bindings)

@JonasDev1
Copy link
Contributor Author

My concern was if you want to do e.g. merges via columns with null this would not work, but I think that it would not work without the advanced filtering either as is equal for null is not defined in sql.

Spark has an extra null safe operator <=> for this, which is not in Datafusion available yet.

@JonasDev1
Copy link
Contributor Author

What about the review?

I can of course also remove the flag again

@ion-elgreco
Copy link
Collaborator

My concern was if you want to do e.g. merges via columns with null this would not work, but I think that it would not work without the advanced filtering either as is equal for null is not defined in sql.

Spark has an extra null safe operator <=> for this, which is not in Datafusion available yet.

My main issue is that it might work or not work based on the contents of the data. I think that's a bit tricky because a person needs to be aware of the contents of their data they are trying to write.

<=> for this, which is not in Datafusion available yet. @JonasDev1 can you maybe raise an issue about this upstream in datafusion, so we get this first

@thomasfrederikhoeck
Copy link
Contributor

@JonasDev1 do you know if the issue was raised in DataFusion? I would love to see this feature to really enhance large table performance in the cases where partition is not meaningful due to high cardinality so it is very nice that you have initiated it.

@JonasDev1
Copy link
Contributor Author

@ion-elgreco I have tested the merge with null values in merge predicate with extend filtering and without. In both cases the merge doesn't work as expected and will lead to duplicate rows. Therefore this pull request will create no changes and the behaviour is simillar to Spark.

The null-safe operation would be a nice extension, but is not bounded to this pull request

@ion-elgreco
Copy link
Collaborator

@JonasDev1 can you resolve the merge conflicts? Then I'll do another round of reviewing

@JonasDev1 JonasDev1 requested a review from hntd187 as a code owner July 30, 2024 12:27
@JonasDev1
Copy link
Contributor Author

@ion-elgreco Done

Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankss, and sorry for the delay! @JonasDev1

@ion-elgreco ion-elgreco added this pull request to the merge queue Aug 1, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Aug 1, 2024
@ion-elgreco
Copy link
Collaborator

@rtyler can't a branch be merged if it can't be rebased anymore?

@JonasDev1 are you perhaps able to manually rebase and force push it?

@ion-elgreco ion-elgreco added this pull request to the merge queue Aug 1, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Aug 1, 2024
@ion-elgreco ion-elgreco added this to the python v0.19 milestone Aug 3, 2024
@ion-elgreco ion-elgreco force-pushed the merge-non-partition-col-filtering branch from 32743e4 to c2cf650 Compare August 3, 2024 09:54
@ion-elgreco ion-elgreco added this pull request to the merge queue Aug 3, 2024
Merged via the queue into delta-io:main with commit e92ec86 Aug 3, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants