Set parallelism for the parallelize job in recursiveListDirs (#3708) · sumeet-db/delta@538e736

Commit

Set parallelism for the parallelize job in recursiveListDirs (delta-i…

…o#3708)

#### Which Delta project/connector is this regarding?

- [x] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

<!--
- Describe what this PR changes.
- Describe why we need the change.

If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->

`DeltaFileOperations.recursiveListDirs` calls `parallelize` without
specifying the parallelism. Hence, it always uses [the number of
available cores on a
cluster](https://github.com/apache/spark/blob/d2e8c1cb60e34a1c7e92374c07d682aa5ca79145/core/src/main/scala/org/apache/spark/SparkContext.scala#L1003).
When a cluster has many cores but `subDirs` is small, it will launch
many empty tasks.

This PR makes a small change to use
`subDirs.length.min(spark.sparkContext.defaultParallelism)` as the
parallelism so that when `subDirs` is smaller than the number of
available cores, it will not launch empty tasks.

## How was this patch tested?

Existing tests.

## Does this PR introduce _any_ user-facing changes?

Loading branch information

zsxwing authored Sep 23, 2024

1 parent a99f62b commit 538e736

spark/src/main/scala/org/apache/spark/sql/delta/util/DeltaFileOperations.scala

-Original file line number
+Diff line change
@@ Expand Up / @@ -243,7 +243,10 @@ object DeltaFileOperations extends DeltaLogging { @@
         import org.apache.spark.sql.delta.implicits._
         if (subDirs.isEmpty) return spark.emptyDataset[SerializableFileStatus]
         val listParallelism = fileListingParallelism.getOrElse(spark.sparkContext.defaultParallelism)
-        val dirsAndFiles = spark.sparkContext.parallelize(subDirs).mapPartitions { dirs =>
+        val subDirsParallelism = subDirs.length.min(spark.sparkContext.defaultParallelism)
+        val dirsAndFiles = spark.sparkContext.parallelize(
+            subDirs,
+            subDirsParallelism).mapPartitions { dirs =>
           val logStore = LogStore(SparkEnv.get.conf, hadoopConf.value.value)
           listUsingLogStore(
             logStore,
@@ Expand Down @@

0 comments on commit `538e736`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `538e736`

Commit

There are no files selected for viewing

0 comments on commit 538e736

0 comments on commit `538e736`