Spark executor overheads #543
Replies: 7 comments 3 replies
-
Yeah, as I was indicating - the churn through pandas is a really nasty time user. It's pretty easy to hack it into spark, using arrow entirely and skip pandas (and make the operation you see here vanishingly small): But you have to hack and re-roll your own spark distribution. |
Beta Was this translation helpful? Give feedback.
-
@lgray thanks for the link! We are going to implement the fix and see if the performance will be close to that with Dask executor. Personally, I would not be comfortable showing performance studies for native Spark, knowing that the timing is completely dominated by a feature that can be fixed so easily. Entire Dask vs. Spark comparison in this case just boils down to the effect of that feature. |
Beta Was this translation helpful? Give feedback.
-
@lgray I tried your suggestion, but I'm not sure it it's working - timing did not improve, and if I put Is there any way to check if these functions are actually getting called? |
Beta Was this translation helpful? Give feedback.
-
That means your spark build didn't catch the changes.
How'd you roll up the spark distribution?
…-L
On Thu, Aug 12, 2021 at 1:00 PM Dmitry Kondratyev ***@***.***> wrote:
@lgray <https://github.com/lgray> I tried your suggestion, but I'm not
sure it it's working - timing did not improve, and if I put print()
statements near the lines that we are hacking in pyspark, I don't see those
printouts in the output.
Is there any way to check if these functions are actually getting called?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#543 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIEYOLLTKUMPTRZPG5FWXLT4QD4TANCNFSM46IKQKDQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Beta Was this translation helpful? Give feedback.
-
So there's a secret python zip ball in the spark distribution that you have
to rebuild in order for this to take.
You need to:
- edit the code
- rebuild your own spark.tgz
- then deploy the *rebuilt* spark tgz
annoying - but it works!
…-L
On Thu, Aug 12, 2021 at 1:18 PM Dmitry Kondratyev ***@***.***> wrote:
We install Spark into user space as follows:
- extract spark-2.4.4-bin-hadoop2.7.tgz into a directory where I have
full access
- create a new module with LMOD, which knows the path to the location
where the spark distribution was extracted
- in the module settings, set up necessary envs like JAVA_HOME,
PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON
- load the module
- run the code with Spark executor
I checked that the code doesn't run before the module is loaded, but does
run after it is loaded. So I'm almost sure that it accesses the correct
location.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#543 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIEYOLRL7KED563PDPWZZ3T4QF7DANCNFSM46IKQKDQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Beta Was this translation helpful? Give feedback.
-
this didn't change much. Am I supposed to repack it somehow differently? |
Beta Was this translation helpful? Give feedback.
-
Follow the instructions here:
https://spark.apache.org/docs/2.4.4/building-spark.html#building-a-runnable-distribution
…-L
On Thu, Aug 12, 2021 at 2:15 PM Dmitry Kondratyev ***@***.***> wrote:
@lgray <https://github.com/lgray>
tar -cvzf spark-2.4.4-bin-hadoop2.7.tgz spark-2.4.4-bin-hadoop2.7 (version with edited files)
tar -xvzf spark-2.4.4-bin-hadoop2.7 (into other directory)
this didn't change much. Am I supposed to repack it somehow differently?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#543 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIEYOI67Q77YVYSFU5YFQ3T4QMT5ANCNFSM46IKQKDQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Beta Was this translation helpful? Give feedback.
-
@nsmith- @lgray @jpivarski @mcremone
Follow-up on our discussion about Spark executor:
in my tests (partition size 100k), ~50% of the entire processing time is spent on this line:
https://github.com/CoffeaTeam/coffea/blob/6d548538653e7003281a572f8eec5d68ca57b19f/coffea/processor/templates/spark.py.tmpl#L7
which I think is conversion from
pandas
(?) toawkward.Array
before the data is even loaded into a processor instance.I haven't completely figured out other overheads yet, but they are less significant.
Actual useful work with Spark executor currently takes 25-30% of total processing time.
Beta Was this translation helpful? Give feedback.
All reactions