-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spline doesn't track some in memory operations in pyspark #795
Comments
A quick investigation revealed the following:
It happens because when calling the mentioned More investigation is required to put a future proof fix to this, but as a quick workaround you may simply add pyspark ... \
--conf "spark.spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.funcNames=head,count,collect,collectAsList,collectToPython,toLocalIterator" |
@wajda Thank you for the respond. Now But it's still not tracked by spline with option:
I tried to run pyspark with option |
For some reason the You don't even need Spline for this investigation. Just create and register your own simple listener that prints out stuff. |
as for the
I call |
Got it. Tracking of |
I found that spline in pyspark doesn't track some in memory operations like collect, head and toPandas.
Operations count and show are tracked as expected.
I used spline with bundle-3.2 in test:
https://mvnrepository.com/artifact/za.co.absa.spline.agent.spark/spark-3.2-spline-agent-bundle_2.12/2.0.0
Here is my pyspark options:
JAVA_HOME=/Users/alexey.balyshev/Library/Java/JavaVirtualMachines/corretto-1.8.0_402/Contents/Home/ ~/spark-3.2.2-bin-hadoop3.2/bin/pyspark --master local --deploy-mode client --jars ~/Documents/spark-3.2-spline-agent-bundle_2.12-2.0.0.jar --num-executors 1 --conf "spark.executor.cores=1" --conf "spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener" --conf "spark.spline.lineageDispatcher=console" --conf "spark.spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled=true"
After show and count operations input I could see the execution plan in json format, but after collect, head and toPandas operations I got an empty output.
At the same time in spark-shell all in memory operations are tracked as expected.
The text was updated successfully, but these errors were encountered: