Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Dynamic spark.databricks.delta.snapshotPartitions based on size of snapshot #3351

Open
2 of 8 tasks
santosh-d3vpl3x opened this issue Jul 9, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@santosh-d3vpl3x
Copy link

santosh-d3vpl3x commented Jul 9, 2024

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

Currently, the spark.databricks.delta.snapshotPartitions value is static. The idea is to make this value dependent on the size of snapshot such that we can have well sized partitions.

Motivation

Delta computes the snapshot to understand which parquet files to read and caches the snapshot to make planning and execution performant. The cached number of partitions depend upon spark.databricks.delta.snapshotPartitions. For bigger tables, the default value of 50 might be saner but for smaller tables, this usually results in partition sizes of few bytes. This does not play well with dynamic allocation. It is not recommended to kill an executor that has cached partition on it, by default spark sets decommission time to infinity for such executors. Many a times, this leads to idle executor staying alive just because it has few bytes of delta cache. Converse is also true, for a bigger snapshot, the value might be too small making the job fail. This value should be abstracted away from users.

Further details

A naïve approach: can we leverage AQE here? Perhaps, introduce a configuration that directly deal with the size of snapshot and remove the num partitions.

A simpler win could be to allow users to also configure storage level for their caches.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • No. I cannot contribute this feature at this time.
@santosh-d3vpl3x santosh-d3vpl3x added the enhancement New feature or request label Jul 9, 2024
@Kimahriman
Copy link
Contributor

You can set the storage level for the snapshot cache: #1000, but this would also be a good feature. I've thought about doing this for a while but never got to it. You have all the file info for the files needed to load the snapshot I believe, so you can do some stats based on the total size of those files perhaps.

@santosh-d3vpl3x
Copy link
Author

santosh-d3vpl3x commented Jul 11, 2024

TIL! I couldn't find any reference to that config anywhere, thanks!

I wish delta config also used version to indicate since which version the config existed. Which version of delta has this change? Looks like it is 1.2.0.

@Kimahriman
Copy link
Contributor

Yeah a lot of the configs you just kinda need to know about or dig through the source code to find and see what they do. I specifically added that for the dynamic allocation issue you mentioned hah. I'm not sure why they don't have a version on the config, the only real way to tell is to look at what tag that commit made it into which it looks like you did, 1.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants