[CORE] Gluten should honor the spark configs as much as possible #8043

FelixYBW · 2024-11-26T01:33:19Z

Description

During the analysis of spill in issue #8025, we noted some issues are common between Gluten and Vanilla spark, like the spill read/write buffer size. Some configuration are even not documented in Spark like spark.unsafe.sorter.spill.reader.buffer.size.

Furthermore, I noted Gluten doesn't have a list of honored spark configurations. We should create such a list in documents. @zhouyuan

FelixYBW · 2024-11-26T04:26:29Z

@marin-ma which of following shuffle configurations are used by Gluten? Can you help to fill the missing one, feel free to correct. Gluten Config is the config renamed by Gluten. if so we should either remove the Gluten config or set Gluten Config's default value as spark config's value.

Spark Config	Respected by Gluten	Transparent to Gluten
spark.reducer.maxSizeInFlight	Y	Y
spark.reducer.maxReqsInFlight	Y	Y
spark.reducer.maxBlocksInFlightPerAddress	Y	Y
spark.shuffle.compress	Y
spark.io.compression.codec	Only if `spark.gluten.sql.columnar.shuffle.codec` is not set
spark.shuffle.file.buffer	Y
spark.shuffle.unsafe.file.output.buffer	N
spark.shuffle.spill.diskWriteBufferSize	N Gluten uses fixed 16k buffer
spark.shuffle.io.maxRetries	Y	Y
spark.shuffle.io.numConnectionsPerPeer	Y	Y
spark.shuffle.io.preferDirectBufs	Y	Y
spark.shuffle.io.retryWait	Y	Y
spark.shuffle.io.backLog	Y	Y
spark.shuffle.io.connectionTimeout	Y	Y
spark.shuffle.io.connectionCreationTimeout	Y	Y
spark.shuffle.service.enabled	Y	Y
spark.shuffle.service.port	Y	Y
spark.shuffle.service.name	Y	Y
spark.shuffle.service.index.cache.size	Y	Y
spark.shuffle.service.removeShuffle	Y	Y
spark.shuffle.maxChunksBeingTransferred	Y	Y
spark.shuffle.sort.bypassMergeThreshold	N
spark.shuffle.sort.io.plugin.class	N
spark.shuffle.spill.compress	N Gluten uses spark.shuffle.compress
spark.shuffle.accurateBlockThreshold	Y	Y
spark.shuffle.registration.timeout	Y	Y
spark.shuffle.registration.maxAttempts	Y	Y
spark.shuffle.reduceLocality.enabled	Y	Y
spark.shuffle.mapOutput.minSizeForBroadcast	Y	Y
spark.shuffle.mapOutput.dispatcher.numThreads	Y	Y
spark.shuffle.detectCorrupt	Y	Y
spark.shuffle.detectCorrupt.useExtraMemory	Y	Y
spark.shuffle.useOldFetchProtocol	Y	Y
spark.shuffle.readHostLocalDisk	Y	Y
spark.files.io.connectionTimeout	Y	Y
spark.files.io.connectionCreationTimeout	Y	Y
spark.shuffle.checksum.enabled	N Gluten currently doesn't support it
spark.shuffle.checksum.algorithm	N Gluten currently doesn't support it
spark.shuffle.service.fetch.rdd.enabled	Y	Y
spark.shuffle.service.db.enabled	Y	Y
spark.shuffle.service.db.backend	Y	Y

yikf · 2024-11-26T07:06:44Z

@FelixYBW thank you for initiating this matter, and at the same time, I noticed some configuration related to table write. Please correct me if missing something.

Spark Config	Respected by Gluten	Gluten Config	Transparent to Gluten
spark.sql.parquet.compression.codec	Y	N/A	Y

the Parquet configuration options in the table options are also respected by Gluten, as follows:

Options	Respected by Gluten	Gluten Config	Transparent to Gluten
compression	Y	N/A	Y
parquet.compression	Y	N/A	Y
parquet.block.size	Y	spark.gluten.sql.columnar.parquet.write.blockSize	Y
parquet.block.rows	Y	spark.gluten.sql.native.parquet.write.blockRows	Y
parquet.gzip.windowSize	Y	N/A	Y

In addition to the above configuration, any other Spark configuration related to table write will not be respected by Gluten at the moment.

marin-ma · 2024-11-26T07:34:05Z

@FelixYBW Updated the table. PTAL. Thanks!

FelixYBW · 2024-11-26T17:54:44Z

spark.shuffle.file.buffer

Thank you! Can you submit a PR to support the configs:

spark.shuffle.file.buffer
spark.shuffle.spill.diskWriteBufferSize
spark.shuffle.spill.compress

jinchengchenghh · 2024-11-27T08:34:32Z

We may also need to document the config Spark and Velox mapping.

zhouyuan · 2024-11-28T06:19:36Z

looks like a long tail task again. There are also some gaps on the data loading part: the configs on HDFS, Iceberg, Hudi and DeltaLake. As these are newly implemented in native C++.
For other operators(other than Spill and table scan) and expressions(other than ANSI) it should be aligned with Spark.

FelixYBW · 2024-11-28T19:03:14Z

looks like a long tail task again. There are also some gaps on the data loading part: the configs on HDFS, Iceberg, Hudi and DeltaLake. As these are newly implemented in native C++. For other operators(other than Spill and table scan) and expressions(other than ANSI) it should be aligned with Spark.

Let's start from spark3.5's configurations here. GO through the category one by one.

https://spark.apache.org/docs/3.5.1/configuration.html

marin-ma · 2024-12-11T02:31:41Z

Thank you! Can you submit a PR to support the configs:

spark.shuffle.file.buffer

spark.shuffle.spill.diskWriteBufferSize

spark.shuffle.spill.compress

spark.shuffle.file.buffer [GLUTEN-7860][CORE] In shuffle writer, replace MemoryMappedFile to avoid OOM #7861
spark.shuffle.spill.diskWriteBufferSize [GLUTEN-8043] Use spark.shuffle.spill.diskWriteBufferSize in sort-based shuffle #8203
spark.shuffle.spill.compress -> @FelixYBW Not sure if we must support it in gluten. Spark’s sort-based shuffle does not currently respect this configuration by default https://github.com/apache/spark/blob/a3cf28ea73f0bd1147af6557954b329cad5226ea/core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java#L197-L201. See discussions in https://issues.apache.org/jira/browse/SPARK-3426

jinchengchenghh · 2024-12-11T05:45:38Z

I respect theses two config in spill in #8045
spark.shuffle.spill.diskWriteBufferSize
spark.shuffle.spill.compress

UnsafeSorterSpillReader receives TempLocalBlockId, so it respects this config.

jinchengchenghh · 2024-12-11T05:56:15Z

Spark control compress here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/serializer/SerializerManager.scala#L117

Shuffle block is TempShuffleBlockId, so it should respect spark.shuffle.compress, now it has been honored.

marin-ma · 2024-12-11T07:30:34Z

UnsafeSorterSpillReader receives TempLocalBlockId, so it respects this config.

Seems like the naming of spark.shuffle.spill.compress is quite confusing. It controls the spill compression behavior of UnsafeSorterSpillWriter, which is used by sort, agg, etc., but is not used by shuffle.

jinchengchenghh · 2024-12-12T01:34:56Z

Yes, the name is incorrect.

FelixYBW · 2024-12-13T01:13:57Z

Let's fix this. Only shuffle's spill use it.

FelixYBW added the enhancement New feature or request label Nov 26, 2024

zhztheplayer changed the title ~~[Core] Gluten should honor the spark configs as much as possible~~ [CORE] Gluten should honor the spark configs as much as possible Nov 26, 2024

FelixYBW mentioned this issue Nov 26, 2024

[GLUTEN-8039][VL] Native writer should respect table properties #8040

Merged

github-actions bot mentioned this issue Dec 11, 2024

[GLUTEN-8043] Use spark.shuffle.spill.diskWriteBufferSize in sort-based shuffle #8203

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE] Gluten should honor the spark configs as much as possible #8043

[CORE] Gluten should honor the spark configs as much as possible #8043

FelixYBW commented Nov 26, 2024 •

edited

Loading

FelixYBW commented Nov 26, 2024 •

edited by marin-ma

Loading

yikf commented Nov 26, 2024

marin-ma commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

jinchengchenghh commented Nov 27, 2024

zhouyuan commented Nov 28, 2024

FelixYBW commented Nov 28, 2024

marin-ma commented Dec 11, 2024

jinchengchenghh commented Dec 11, 2024 •

edited

Loading

jinchengchenghh commented Dec 11, 2024

marin-ma commented Dec 11, 2024

jinchengchenghh commented Dec 12, 2024

FelixYBW commented Dec 13, 2024

[CORE] Gluten should honor the spark configs as much as possible #8043

[CORE] Gluten should honor the spark configs as much as possible #8043

Comments

FelixYBW commented Nov 26, 2024 • edited Loading

Description

FelixYBW commented Nov 26, 2024 • edited by marin-ma Loading

yikf commented Nov 26, 2024

marin-ma commented Nov 26, 2024

FelixYBW commented Nov 26, 2024

jinchengchenghh commented Nov 27, 2024

zhouyuan commented Nov 28, 2024

FelixYBW commented Nov 28, 2024

marin-ma commented Dec 11, 2024

jinchengchenghh commented Dec 11, 2024 • edited Loading

jinchengchenghh commented Dec 11, 2024

marin-ma commented Dec 11, 2024

jinchengchenghh commented Dec 12, 2024

FelixYBW commented Dec 13, 2024

FelixYBW commented Nov 26, 2024 •

edited

Loading

FelixYBW commented Nov 26, 2024 •

edited by marin-ma

Loading

jinchengchenghh commented Dec 11, 2024 •

edited

Loading