GitHub - ridvanab/spark-3.5-issue

An example project showcasing a possible issue with Spark AQE's dynamic cache repartitioning mechanism.

The assembled jar should be run with:

spark-submit --class ExampleApp --packages org.apache.spark:spark-avro_2.12:3.5.0 --deploy-mode cluster --master spark://spark-master:6066 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=file:///spark-event-log --conf spark.cores.max=3 --driver-cores 1 --driver-memory 1g --executor-cores 1 --executor-memory 1g /data/shared/test.jar

Pre-assembly - just point to the proper avro data paths.

Project:

spark 3.5.0
sbt
scala 2.12

Description: Due to AutoBroadcastJoin being disabled, Spark defaults to SortMergeJoin. Cluster mode of 2 workers with (1/1 driver) and (2x 1/1 executor).

In the given example, a self-join of parentDF has been performed which is then cached, then the result is joined with childDF.

Expected behaviour: Proper count with no data loss.

Actual behaviour: Data loss, lesser than expected count.

Observed behaviour:

Not reproducable on single executor flows
There seems to be a file-size treshold after which dataloss is observed (possibly implying that it happens when both workers start reading the same data)
Not reproducable by disabling spark.sql.optimizer.canChangeCachedPlanOutputPartitioning

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
project		project
src/main/scala		src/main/scala
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
eventLogs-app-20240207175940-0023.zip		eventLogs-app-20240207175940-0023.zip
test.jar		test.jar
testdata.zip		testdata.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

ridvanab/spark-3.5-issue

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages