-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add option to do a spark-submit with a SparkListener to gather events from Spark #113
Comments
Hi @pjfanning! I'm so glad you thought the talk was interesting :) For anybody else reading who wants to see it, they've told us it will be posted on Nov 3. What you've outlined here is a great suggestion! While I have not tried it myself yet, adding listeners through the spark-submit conf should already work through existing means, like this:
If that works out of the box, then getting that output bundled with the spark-bench output would be the logical next step. @pjfanning Is this something you'd be interested in investigating? Thanks again for your helpful suggestion! I am in shaky wifi territory for the next two days but will be back in regular communication after that :) |
@ecurtin I may not have much time over the coming weeks but if I do find some time, I'll try prototyping something. |
👍 |
I have a very early prototype at https://github.com/pjfanning/spark-bench/pull/2/files Running
The aim is to gather more metrics with the listener and to include them with the other benchmarks. |
a CSV file recording the task-durations of all tasks would be better. |
I was at Emily Curtin's Spark Summit Europe presentation today (which was very interesting). An attendee asked if Spark Bench gathered Spark executor metrics.
A SparkListener can be used to get benchmark data about how long was spent running tasks and how much data was shuffled (basically any data that can be seen in the Spark UI could be picked up and summarised).
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/scheduler/SparkListener.html
spark-submit --conf spark.extraListeners=com.mycompany.MetricsListener
https://github.com/LucaCanali/sparkMeasure has a spark listener that gathers metrics.
https://github.com/groupon/sparklint also has one.
One possible design would be to
Another approach would be to run spark with
spark.eventLog.enabled=true
(andspark.eventLog.dir
set) and parsing the json-lines output. https://github.com/groupon/sparklint also has code to summarise event logs to create metrics.The text was updated successfully, but these errors were encountered: