Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Batch resolve DeltaMergeAction in ResolveMergeInto #3366

Merged
merged 10 commits into from
Jul 19, 2024

Conversation

c27kwan
Copy link
Contributor

@c27kwan c27kwan commented Jul 11, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

In this PR, we batch resolve DeltaMergeActions in ResolveMergeInto to avoid redundant invocations of the analyzer.

For example, if we have the clause

WHEN NOT MATCHED THEN INSERT (c1, c2) VALUES (src.c1, 22) 

Previously, we would call resolveSingleExprOrFail which invokes the analyzer 4 times: 1 call to resolve c1, 1 call to resolve src.c1, 1 call to resolve c2, and 1 call to resolve 22.
With this PR, we batch the resolution of the target column name parts ([c1, c2]) against the target relation and the resolution of assignment expression ([src.c1, 22]) together. We can resolve a Merge clause with 2 calls using batching. This helps us with the analyzer performance on wide tables.

How was this patch tested?

Existing tests pass.

Does this PR introduce any user-facing changes?

No.

@c27kwan c27kwan changed the title [Spark] Add memoization to ResolveMergeInto [Spark] Batch resolve DeltaMergeAction in ResolveMergeInto Jul 16, 2024
@@ -502,7 +502,7 @@ trait MergeIntoSchemaEvolutionBaseTests {
testEvolution(s"case-sensitive insert")(
targetData = Seq((0, 0), (1, 10), (3, 30)).toDF("key", "value"),
sourceData = Seq((1, 1), (2, 2)).toDF("key", "VALUE"),
clauses = insert("(key, value, VALUE) VALUES (s.key, s.value, s.VALUE)") :: Nil,
clauses = insert("(key, value) VALUES (s.key, s.value)") :: Nil,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were two different possible errors here. It can throw a different message dependent on whether we batch. I removed the ambiguity so that the error is always the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a TODO for the other test, since it's an unrelated new test and we didn't have coverage for it before.

@c27kwan
Copy link
Contributor Author

c27kwan commented Jul 16, 2024

I got rid of the memoization because batching will make memoization at that level obsolete. The opportunity for memoization within merge at that level is probably not very useful. We don't resolve the same expression multiple times against the same plans, unless we have many different conditional when matched/when not matched clauses.

Copy link
Contributor

@larsk-db larsk-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@allisonport-db allisonport-db merged commit 2bfc2f2 into delta-io:master Jul 19, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants