-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YARN build fixes #892
YARN build fixes #892
Conversation
Thank you for submitting this pull request. Unfortunately, the automated tests for this request have failed. Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/830/ |
Thank you for submitting this pull request. All automated tests for this request have passed. Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/831/ |
Jey, the goal of my change was for the "assembly" project to include all the code you'd want to run on Spark. This is why I removed an assembly form "yarn". Notice that assemblyProj depends on YARN if active. I think the problem is just that the running on YARN doc is wrong.. fix that to use the assembly.jar in assembly/. |
Also, example should depend on maybeYarn in case you want to run them on YARN. |
Applications do not need to depend on spark-yarn. They only need to specify spark-core as their dependency. |
Ah, okay, that makes sense. But for the assembly definitely just use the one in assembly/ instead. Part of the reason I did this is that I wanted to build fewer assemblies (only examples and assemblyProj). |
Makes sense. I'm updating my patch accordingly |
How do we want YARN users to build/run their apps? How about if we say that they must always create a single assembly that contains spark-yarn and their own code, and modify Then an example invocation would look like this:
|
I think we should ask some of the YARN people about that. One advantage of separating the Spark JAR is that it might be possible to put it in a standard path on HDFS and then YARN will cache it locally on each worker node, avoiding a long download. Then the user's JAR can be compiled with spark listed as "provided" and will contain only the user's own classes and other dependencies that aren't in the YARN JAR. @mridulm, @tgravescs any thoughts on this? |
@mateiz I have tried use spark-assembly jar, it seems org.apache.spark.deploy.yarn.Client is not built in it? |
@fxc123 did you build with the environment variable SPARK_YARN set to "true"? You need to do
It might also help to do sbt clean before. |
I do not like the idea of people having to include their own code with the spark assembly jar. That makes it impossible to deploy just a spark jar that multiple people can share and as Matei said it can be cached on the hadoop nodes and not have to be downloaded everytime. it would be nice to decide what we are doing with the assemblies though. The last time used mvn and tried to use the core assembly it didn't have the YarnClientImpl that was needed to run on yarn. So I used the yarn jar. But I don't think it included the repl stuff. |
@tgravescs the assembly built into the assembly/ directory now should have both YARN, the REPL, and all the user libraries in Spark. Try that one out. |
Does repl (i.e. spark-shell) work on YARN? How should I run it? |
There's a patch for it (#868), so although we probably won't merge that patch in 0.8.0, let's include it. We will merge it in 0.8.1. |
(More generally, I also wanted to build as few assembly JARs as possible because it's slow. One "runtime environment" one and one "examples" one is the least we could do.) |
@mateiz , I just tried using the assembly (assembly/target/scala-2.9.3/spark-assembly-0.8.0-SNAPSHOT-hadoop0.23.7.jar) with run-example (built using mvn -Phadoop2-yarn package) but it fails to run because its missing org/apache/hadoop/yarn/client/YarnClientImpl. This exists in the yarn spark-yarn-0.8.0-SNAPSHOT-shaded.jar. |
Thank you for submitting this pull request. All automated tests for this request have passed. Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/848/ |
This PR has been significantly updated, now it updates the YARN build docs and fixes the build under Maven with Hadoop 0.23.x |
Thank you for submitting this pull request. All automated tests for this request have passed. Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/849/ |
Looks good, thanks! |
This PR updates the YARN build docs and fixes the build under Maven with Hadoop 0.23.x