How to run Spark jobs with Spark on YARN? This often requires trial and error in order to make it work.
Hue is leveraging Apache Oozie to submit the jobs. It focuses on the yarn-client mode, as Oozie is already running the spark-summit command in a MapReduce2 task in the cluster. You can read more about the Spark modes here.
Here is how to get started successfully:
PySpark
Simple script with no dependency.
Script with a dependency on another script (e.g. hello imports hello2).
For more complex dependencies, like Panda, have a look at this documentation.
Jars (Java or Scala)
Add the jars as File dependency and specify the name of the main jar:
Another solution is to put your jars in the ‘lib’ directory in the workspace (‘Folder’ icon on the top right of the editor).
11 Comments
-
Sorry to trouble you,could you help me?When I build a workflow on oozie using pyspark and sparksql,I get message saying :
The specified datastore driver (“org.apache.derby.jdbc.EmbeddedDriver”) was not found in the CLASSPATH.
Though I added the derby-10.10.1.1.jar to /user/oozie/share/lib and other directory ,it still fails.
I’m using CDH 5.9.0 and Hue 3.11.-
Author
Could you added to a ‘lib’ directory in the workflow?
Or have the Oozie property ‘oozie.libpath’ point to it in HDFS?
-
-
Hey team – I am getting same error as Jiang Hao. I uploaded hive-site.xml to hdfs, and attached in the FILE Attribute on Spark Action. Seems like, the Spark-Action, is not reading the hive-site.xml configuration file.
-
Author
Could you include you hive-site.xml in a directory nameed ‘lib’ in the workspace of the workflow?
-
-
I am trying to run a sample python file using the spark notebook. It is running into below exception.
Please help to identify the issue.hue version – hue-3.9.0
livy version – hue-livy-3.9.0
spark version – spark-1.6.1[[email protected] logs]# tailf hue-mapr-livy_server-sn1.out
11:46:59.596 [ForkJoinPool-1-worker-77] ERROR c.c.hue.livy.server.SessionServlet$ – internal error
java.lang.NullPointerException: null
at com.cloudera.hue.livy.spark.SparkSubmitProcessBuilder.com$cloudera$hue$livy$spark$SparkSubmitProcessBuilder$$fromPath(SparkSubmitProcessBuilder.scala:292) ~[livy-assembly.jar:na]
at com.cloudera.hue.livy.spark.SparkSubmitProcessBuilder.start(SparkSubmitProcessBuilder.scala:270) ~[livy-assembly.jar:na]
at com.cloudera.hue.livy.server.batch.BatchSessionYarn$.apply(BatchSessionYarn.scala:41) ~[livy-assembly.jar:na]
at com.cloudera.hue.livy.server.batch.BatchSessionYarnFactory.create(BatchSessionYarnFactory.scala:31) ~[livy-assembly.jar:na]
at com.cloudera.hue.livy.server.batch.BatchSessionFactory.create(BatchSessionFactory.scala:28) ~[livy-assembly.jar:na]
at com.cloudera.hue.livy.server.batch.BatchSessionFactory.create(BatchSessionFactory.scala:26) ~[livy-assembly.jar:na]
at com.cloudera.hue.livy.server.SessionManager.create(SessionManager.scala:52) ~[livy-assembly.jar:na]
at com.cloudera.hue.livy.server.SessionServlet$$anonfun$17$$anon$2$$anonfun$18.apply(SessionServlet.scala:113) ~[livy-assembly.jar:na]
at com.cloudera.hue.livy.server.SessionServlet$$anonfun$17$$anon$2$$anonfun$18.apply(SessionServlet.scala:112) ~[livy-assembly.jar:na]
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) ~[livy-assembly.jar:na]
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) ~[livy-assembly.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1417) ~[livy-assembly.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:262) ~[livy-assembly.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975) ~[livy-assembly.jar:na]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) [livy-assembly.jar:na]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) [livy-assembly.jar:na]
10.10.70.180 – – – 09/Nov/2017:11:46:59 -0800 “POST /batches HTTP/1.1” 500 30-
The other options like Scala , PySpark are working fine. It is running yarn
-
Author
Sounds like a Livy error, ask on the project? livy.io
-
-
If I want to print some logs to debug my program, what should I do?
I have tried to print log via System.out.println or SLF4J log.info, but nothing can been seen in livy output log or something else. -
And it said “It focuses on the yarn-client mode”. Does it run applications in yarn client mode in default? Can I change the running mode?
-
if I run a spark job in command line like below:
/usr/hdp/current/spark2-client/bin/spark-submit –files ./jaas.conf,./fosun_test.keytab \
–driver-java-options “-Djava.security.auth.login.config=./jaas.conf” \
–conf “spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf” \
–packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 \
–class com.fonova.test.StartTest \
–queue default \
–master yarn \
–deploy-mode client \
–driver-memory 2g \
–executor-cores 2 \
–executor-memory 2g \
–num-executors 2 \
/home/fosun_test/test-2.0-SNAPSHOT-shaded.jarThen how could I supply arguments like :
–driver-java-options “-Djava.security.auth.login.config=./jaas.conf” \
–conf “spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf” \
these two lines?-
Author
On the top right of the Editor, you should see a little cog icons to open the option panel
-