How to Submit Spark jobs with Spark on YARN and Oozie

How to Submit Spark jobs with Spark on YARN and Oozie

How to run Spark jobs with Spark on YARN? This often requires trial and error in order to make it work.

Hue is leveraging Apache Oozie to submit the jobs. It focuses on the yarn-client mode, as Oozie is already running the spark-summit command in a MapReduce2 task in the cluster. You can read more about the Spark modes here.

Here is how to get started successfully:

PySpark

Simple script with no dependency.

oozie-pyspark-simple

Script with a dependency on another script (e.g. hello imports hello2).

oozie-pyspark-dependencies

For more complex dependencies, like Panda, have a look at this documentation.

 

Jars (Java or Scala)

Add the jars as File dependency and specify the name of the main jar:

spark-action-jar

Another solution is to put your jars in the ‘lib’ directory in the workspace (‘Folder’ icon on the top right of the editor).

oozie-spark-lib2

 

The latest Hue is improving the user experience and will provide an even simpler solution in Hue 4.

If you have any questions, feel free to comment here or on the hue-user list or @gethue!

7 Comments

  1. Jiang Hao 12 months ago

    Sorry to trouble you,could you help me?When I build a workflow on oozie using pyspark and sparksql,I get message saying :
    The specified datastore driver (“org.apache.derby.jdbc.EmbeddedDriver”) was not found in the CLASSPATH.
    Though I added the derby-10.10.1.1.jar to /user/oozie/share/lib and other directory ,it still fails.
    I’m using CDH 5.9.0 and Hue 3.11.

    • Author
      Hue Team 11 months ago

      Could you added to a ‘lib’ directory in the workflow?
      Or have the Oozie property ‘oozie.libpath’ point to it in HDFS?

  2. Krishna Thirumalasetty 9 months ago

    Hey team – I am getting same error as Jiang Hao. I uploaded hive-site.xml to hdfs, and attached in the FILE Attribute on Spark Action. Seems like, the Spark-Action, is not reading the hive-site.xml configuration file.

    • Author
      Hue Team 9 months ago

      Could you include you hive-site.xml in a directory nameed ‘lib’ in the workspace of the workflow?

  3. Gireesh 2 weeks ago

    I am trying to run a sample python file using the spark notebook. It is running into below exception.
    Please help to identify the issue.

    hue version – hue-3.9.0
    livy version – hue-livy-3.9.0
    spark version – spark-1.6.1

    [[email protected] logs]# tailf hue-mapr-livy_server-sn1.out

    11:46:59.596 [ForkJoinPool-1-worker-77] ERROR c.c.hue.livy.server.SessionServlet$ – internal error
    java.lang.NullPointerException: null
    at com.cloudera.hue.livy.spark.SparkSubmitProcessBuilder.com$cloudera$hue$livy$spark$SparkSubmitProcessBuilder$$fromPath(SparkSubmitProcessBuilder.scala:292) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.spark.SparkSubmitProcessBuilder.start(SparkSubmitProcessBuilder.scala:270) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.batch.BatchSessionYarn$.apply(BatchSessionYarn.scala:41) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.batch.BatchSessionYarnFactory.create(BatchSessionYarnFactory.scala:31) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.batch.BatchSessionFactory.create(BatchSessionFactory.scala:28) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.batch.BatchSessionFactory.create(BatchSessionFactory.scala:26) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.SessionManager.create(SessionManager.scala:52) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.SessionServlet$$anonfun$17$$anon$2$$anonfun$18.apply(SessionServlet.scala:113) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.SessionServlet$$anonfun$17$$anon$2$$anonfun$18.apply(SessionServlet.scala:112) ~[livy-assembly.jar:na]
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) ~[livy-assembly.jar:na]
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) ~[livy-assembly.jar:na]
    at scala.concurrent.forkjoin.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1417) ~[livy-assembly.jar:na]
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:262) ~[livy-assembly.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975) ~[livy-assembly.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) [livy-assembly.jar:na]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) [livy-assembly.jar:na]
    10.10.70.180 – – – 09/Nov/2017:11:46:59 -0800 “POST /batches HTTP/1.1” 500 30

    • Gireesh 2 weeks ago

      The other options like Scala , PySpark are working fine. It is running yarn

    • Author
      Hue Team 2 weeks ago

      Sounds like a Livy error, ask on the project? livy.io

Leave a reply

Your email address will not be published. Required fields are marked *

*