Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK] Optimize : SELECT COUNT(*) FROM Table WHERE partitition=1 #3345

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

7mming7
Copy link

@7mming7 7mming7 commented Jul 9, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Running the query "SELECT COUNT(*), MIN(X), MAX(X) FROM table WHERE partition_column = 1" takes a lot of time for big tables, Spark scan all the parquet files just to return the number of rows and min max values, that information is available from Delta Logs
Resolves #1916

How was this patch tested?

Created unit tests to validate the optimization works

Does this PR introduce any user-facing changes?

Only performance improvement

@7mming7
Copy link
Author

7mming7 commented Jul 10, 2024

cc @felipepessoto Can you help with review or guidance?

@felipepessoto
Copy link
Contributor

@7mming7 I'm travelling on vacation, it will take some time to review it. Anyway, you will need an approval from one of the maintainers.

@7mming7
Copy link
Author

7mming7 commented Jul 22, 2024

@felipepessoto I see. Have a nice vacation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] optimize COUNT(*) on partitioned tables
2 participants