How to Submit Spark jobs with Spark on YARN and Oozie

How to Submit Spark jobs with Spark on YARN and Oozie

How to run Spark jobs with Spark on YARN? This often requires trial and error in order to make it work.

Hue is leveraging Apache Oozie to submit the jobs. It focuses on the yarn-client mode, as Oozie is already running the spark-summit command in a MapReduce2 task in the cluster. You can read more about the Spark modes here.

Here is how to get started successfully:

PySpark

Simple script with no dependency.

oozie-pyspark-simple

Script with a dependency on another script (e.g. hello imports hello2).

oozie-pyspark-dependencies

For more complex dependencies, like Panda, have a look at this documentation.

 

Jars (Java or Scala)

Add the jars as File dependency and specify the name of the main jar:

spark-action-jar

Another solution is to put your jars in the ‘lib’ directory in the workspace (‘Folder’ icon on the top right of the editor).

oozie-spark-lib2

 

The latest Hue is improving the user experience and will provide an even simpler solution in Hue 4.

If you have any questions, feel free to comment here or on the hue-user list or @gethue!

11 Comments

  1. Jiang Hao 2 years ago

    Sorry to trouble you,could you help me?When I build a workflow on oozie using pyspark and sparksql,I get message saying :
    The specified datastore driver (“org.apache.derby.jdbc.EmbeddedDriver”) was not found in the CLASSPATH.
    Though I added the derby-10.10.1.1.jar to /user/oozie/share/lib and other directory ,it still fails.
    I’m using CDH 5.9.0 and Hue 3.11.

    • Author
      Hue Team 2 years ago

      Could you added to a ‘lib’ directory in the workflow?
      Or have the Oozie property ‘oozie.libpath’ point to it in HDFS?

  2. Krishna Thirumalasetty 2 years ago

    Hey team – I am getting same error as Jiang Hao. I uploaded hive-site.xml to hdfs, and attached in the FILE Attribute on Spark Action. Seems like, the Spark-Action, is not reading the hive-site.xml configuration file.

    • Author
      Hue Team 2 years ago

      Could you include you hive-site.xml in a directory nameed ‘lib’ in the workspace of the workflow?

  3. Gireesh 1 year ago

    I am trying to run a sample python file using the spark notebook. It is running into below exception.
    Please help to identify the issue.

    hue version – hue-3.9.0
    livy version – hue-livy-3.9.0
    spark version – spark-1.6.1

    [[email protected] logs]# tailf hue-mapr-livy_server-sn1.out

    11:46:59.596 [ForkJoinPool-1-worker-77] ERROR c.c.hue.livy.server.SessionServlet$ – internal error
    java.lang.NullPointerException: null
    at com.cloudera.hue.livy.spark.SparkSubmitProcessBuilder.com$cloudera$hue$livy$spark$SparkSubmitProcessBuilder$$fromPath(SparkSubmitProcessBuilder.scala:292) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.spark.SparkSubmitProcessBuilder.start(SparkSubmitProcessBuilder.scala:270) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.batch.BatchSessionYarn$.apply(BatchSessionYarn.scala:41) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.batch.BatchSessionYarnFactory.create(BatchSessionYarnFactory.scala:31) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.batch.BatchSessionFactory.create(BatchSessionFactory.scala:28) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.batch.BatchSessionFactory.create(BatchSessionFactory.scala:26) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.SessionManager.create(SessionManager.scala:52) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.SessionServlet$$anonfun$17$$anon$2$$anonfun$18.apply(SessionServlet.scala:113) ~[livy-assembly.jar:na]
    at com.cloudera.hue.livy.server.SessionServlet$$anonfun$17$$anon$2$$anonfun$18.apply(SessionServlet.scala:112) ~[livy-assembly.jar:na]
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) ~[livy-assembly.jar:na]
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) ~[livy-assembly.jar:na]
    at scala.concurrent.forkjoin.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1417) ~[livy-assembly.jar:na]
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:262) ~[livy-assembly.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975) ~[livy-assembly.jar:na]
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) [livy-assembly.jar:na]
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) [livy-assembly.jar:na]
    10.10.70.180 – – – 09/Nov/2017:11:46:59 -0800 “POST /batches HTTP/1.1” 500 30

    • Gireesh 1 year ago

      The other options like Scala , PySpark are working fine. It is running yarn

    • Author
      Hue Team 1 year ago

      Sounds like a Livy error, ask on the project? livy.io

  4. Chen 11 months ago

    If I want to print some logs to debug my program, what should I do?
    I have tried to print log via System.out.println or SLF4J log.info, but nothing can been seen in livy output log or something else.

  5. Chen 11 months ago

    And it said “It focuses on the yarn-client mode”. Does it run applications in yarn client mode in default? Can I change the running mode?

  6. gabriele ran 2 months ago

    if I run a spark job in command line like below:
    /usr/hdp/current/spark2-client/bin/spark-submit –files ./jaas.conf,./fosun_test.keytab \
    –driver-java-options “-Djava.security.auth.login.config=./jaas.conf” \
    –conf “spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf” \
    –packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 \
    –class com.fonova.test.StartTest \
    –queue default \
    –master yarn \
    –deploy-mode client \
    –driver-memory 2g \
    –executor-cores 2 \
    –executor-memory 2g \
    –num-executors 2 \
    /home/fosun_test/test-2.0-SNAPSHOT-shaded.jar

    Then how could I supply arguments like :
    –driver-java-options “-Djava.security.auth.login.config=./jaas.conf” \
    –conf “spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf” \
    these two lines?

    • Author
      Hue Team 2 months ago

      On the top right of the Editor, you should see a little cog icons to open the option panel

Leave a reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.