Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8025][VL] Respect config kSpillReadBufferSize and add spill compression codec #8045

Merged
merged 1 commit into from
Dec 14, 2024

Conversation

jinchengchenghh
Copy link
Contributor

@jinchengchenghh jinchengchenghh commented Nov 26, 2024

Respect Spark config UNSAFE_SORTER_SPILL_READER_BUFFER_SIZE. as Velox kSpillReadBufferSize.
Add the config spark.gluten.sql.columnar.backend.velox.spillCompressionCodec, if not set, use Spark config spark.io.compression.codec instead.
Respect Spark spark.shuffle.spill.diskWriteBufferSize as Velox kSpillWriteBufferSize.

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Nov 26, 2024
Copy link

#8025

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@@ -1585,13 +1590,6 @@ object GlutenConfig {
.bytesConf(ByteUnit.BYTE)
.createWithDefaultString("100G")

val COLUMNAR_VELOX_MAX_SPILL_WRITE_BUFFER_SIZE =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put all configurations in GlutenConfig.scala even it's not documented in Gluten doc.

Can you add spillreadbuffersize as well?

@@ -732,7 +734,10 @@ object GlutenConfig {
COLUMNAR_MEMORY_BACKTRACE_ALLOCATION.defaultValueString),
(
GLUTEN_COLUMNAR_TO_ROW_MEM_THRESHOLD.key,
GLUTEN_COLUMNAR_TO_ROW_MEM_THRESHOLD.defaultValue.get.toString)
GLUTEN_COLUMNAR_TO_ROW_MEM_THRESHOLD.defaultValue.get.toString),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check if spark.unsafe.sorter.spill.read.ahead.enabled is enabled or not. If not, close the spill read adhead buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileInputStream implement decides it must read some buffer ahead even if spark.unsafe.sorter.spill.read.ahead.enabled is disabled. We could add a new class FileInputReader to read the file as need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if we set the buffer size to 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not check the config value reasonable.
It may cause dead cycle because it read file 0 byte one time.
https://github.com/facebookincubator/velox/blob/main/velox/common/file/FileInputStream.cpp#L207

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's ignore it for now and open an enhancement issue on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added, facebookincubator/velox#11673.
It won't affect us now, because Spark config requires the value is more than default value.

private[spark] val UNSAFE_SORTER_SPILL_READER_BUFFER_SIZE =
    ConfigBuilder("spark.unsafe.sorter.spill.reader.buffer.size")
      .internal()
      .version("2.1.0")
      .bytesConf(ByteUnit.BYTE)
      .checkValue(v => 1024 * 1024 <= v && v <= MAX_BUFFER_SIZE_BYTES,
        s"The value must be in allowed range [1,048,576, ${MAX_BUFFER_SIZE_BYTES}].")
      .createWithDefault(1024 * 1024)

I will add the FileReadStream to Velox to support disable read ahead.

@FelixYBW
Copy link
Contributor

some spark tunning parameters about spill, FYI

https://mageswaran1989.medium.com/spark-optimizations-for-advanced-users-spark-cheat-sheet-d74464618c20

Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

2 similar comments
Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@FelixYBW
Copy link
Contributor

@jinchengchenghh can you rebase. Let's merge it

Copy link

Run Gluten Clickhouse CI on x86

@jinchengchenghh
Copy link
Contributor Author

Can you help approve and merge this PR? Thanks! @FelixYBW

@FelixYBW FelixYBW merged commit b1211a8 into apache:main Dec 14, 2024
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants