Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE] Gluten should honor the spark configs as much as possible #8043

Open
FelixYBW opened this issue Nov 26, 2024 · 13 comments
Open

[CORE] Gluten should honor the spark configs as much as possible #8043

FelixYBW opened this issue Nov 26, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@FelixYBW
Copy link
Contributor

FelixYBW commented Nov 26, 2024

Description

During the analysis of spill in issue #8025, we noted some issues are common between Gluten and Vanilla spark, like the spill read/write buffer size. Some configuration are even not documented in Spark like spark.unsafe.sorter.spill.reader.buffer.size.

Furthermore, I noted Gluten doesn't have a list of honored spark configurations. We should create such a list in documents. @zhouyuan

@FelixYBW FelixYBW added the enhancement New feature or request label Nov 26, 2024
@zhztheplayer zhztheplayer changed the title [Core] Gluten should honor the spark configs as much as possible [CORE] Gluten should honor the spark configs as much as possible Nov 26, 2024
@FelixYBW
Copy link
Contributor Author

FelixYBW commented Nov 26, 2024

@marin-ma which of following shuffle configurations are used by Gluten? Can you help to fill the missing one, feel free to correct. Gluten Config is the config renamed by Gluten. if so we should either remove the Gluten config or set Gluten Config's default value as spark config's value.

Spark Config Respected by Gluten Gluten Config Transparent to Gluten
spark.reducer.maxSizeInFlight Y   Y
spark.reducer.maxReqsInFlight Y   Y
spark.reducer.maxBlocksInFlightPerAddress Y   Y
spark.shuffle.compress Y    
spark.io.compression.codec Only if spark.gluten.sql.columnar.shuffle.codec is not set
spark.shuffle.file.buffer Y    
spark.shuffle.unsafe.file.output.buffer N    
spark.shuffle.spill.diskWriteBufferSize N Gluten uses fixed 16k buffer    
spark.shuffle.io.maxRetries Y   Y
spark.shuffle.io.numConnectionsPerPeer Y   Y
spark.shuffle.io.preferDirectBufs Y   Y
spark.shuffle.io.retryWait Y   Y
spark.shuffle.io.backLog Y   Y
spark.shuffle.io.connectionTimeout Y   Y
spark.shuffle.io.connectionCreationTimeout Y   Y
spark.shuffle.service.enabled Y   Y
spark.shuffle.service.port Y   Y
spark.shuffle.service.name Y   Y
spark.shuffle.service.index.cache.size Y   Y
spark.shuffle.service.removeShuffle Y   Y
spark.shuffle.maxChunksBeingTransferred Y   Y
spark.shuffle.sort.bypassMergeThreshold N    
spark.shuffle.sort.io.plugin.class N    
spark.shuffle.spill.compress N Gluten uses spark.shuffle.compress    
spark.shuffle.accurateBlockThreshold Y    Y
spark.shuffle.registration.timeout Y   Y
spark.shuffle.registration.maxAttempts Y   Y
spark.shuffle.reduceLocality.enabled Y   Y
spark.shuffle.mapOutput.minSizeForBroadcast Y    Y
spark.shuffle.mapOutput.dispatcher.numThreads Y Y
spark.shuffle.detectCorrupt Y   Y
spark.shuffle.detectCorrupt.useExtraMemory Y   Y
spark.shuffle.useOldFetchProtocol Y   Y
spark.shuffle.readHostLocalDisk Y   Y
spark.files.io.connectionTimeout Y   Y
spark.files.io.connectionCreationTimeout Y   Y
spark.shuffle.checksum.enabled N Gluten currently doesn't support it  
spark.shuffle.checksum.algorithm N Gluten currently doesn't support it  
spark.shuffle.service.fetch.rdd.enabled Y   Y
spark.shuffle.service.db.enabled Y   Y
spark.shuffle.service.db.backend Y   Y

@yikf
Copy link
Contributor

yikf commented Nov 26, 2024

@FelixYBW thank you for initiating this matter, and at the same time, I noticed some configuration related to table write. Please correct me if missing something.

Spark Config Respected by Gluten Gluten Config Transparent to Gluten
spark.sql.parquet.compression.codec Y N/A Y

the Parquet configuration options in the table options are also respected by Gluten, as follows:

Options Respected by Gluten Gluten Config Transparent to Gluten
compression Y N/A Y
parquet.compression Y N/A Y
parquet.block.size Y spark.gluten.sql.columnar.parquet.write.blockSize Y
parquet.block.rows Y spark.gluten.sql.native.parquet.write.blockRows Y
parquet.gzip.windowSize Y N/A Y

In addition to the above configuration, any other Spark configuration related to table write will not be respected by Gluten at the moment.

@marin-ma
Copy link
Contributor

@FelixYBW Updated the table. PTAL. Thanks!

@FelixYBW
Copy link
Contributor Author

spark.shuffle.file.buffer

Thank you! Can you submit a PR to support the configs:

  • spark.shuffle.file.buffer
  • spark.shuffle.spill.diskWriteBufferSize
  • spark.shuffle.spill.compress

@jinchengchenghh
Copy link
Contributor

We may also need to document the config Spark and Velox mapping.

@zhouyuan
Copy link
Contributor

looks like a long tail task again. There are also some gaps on the data loading part: the configs on HDFS, Iceberg, Hudi and DeltaLake. As these are newly implemented in native C++.
For other operators(other than Spill and table scan) and expressions(other than ANSI) it should be aligned with Spark.

@FelixYBW
Copy link
Contributor Author

looks like a long tail task again. There are also some gaps on the data loading part: the configs on HDFS, Iceberg, Hudi and DeltaLake. As these are newly implemented in native C++. For other operators(other than Spill and table scan) and expressions(other than ANSI) it should be aligned with Spark.

Let's start from spark3.5's configurations here. GO through the category one by one.

https://spark.apache.org/docs/3.5.1/configuration.html

@marin-ma
Copy link
Contributor

Thank you! Can you submit a PR to support the configs:

  • spark.shuffle.file.buffer
  • spark.shuffle.spill.diskWriteBufferSize
  • spark.shuffle.spill.compress

@jinchengchenghh
Copy link
Contributor

jinchengchenghh commented Dec 11, 2024

I respect theses two config in spill in #8045
spark.shuffle.spill.diskWriteBufferSize
spark.shuffle.spill.compress

UnsafeSorterSpillReader receives TempLocalBlockId, so it respects this config.

@jinchengchenghh
Copy link
Contributor

Spark control compress here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/serializer/SerializerManager.scala#L117

Shuffle block is TempShuffleBlockId, so it should respect spark.shuffle.compress, now it has been honored.

@marin-ma
Copy link
Contributor

UnsafeSorterSpillReader receives TempLocalBlockId, so it respects this config.

Seems like the naming of spark.shuffle.spill.compress is quite confusing. It controls the spill compression behavior of UnsafeSorterSpillWriter, which is used by sort, agg, etc., but is not used by shuffle.

@jinchengchenghh
Copy link
Contributor

Yes, the name is incorrect.

@FelixYBW
Copy link
Contributor Author

Let's fix this. Only shuffle's spill use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants