Use the Spark Action in Oozie

Use the Spark Action in Oozie

Update September 2016: this post is getting replaced by http://gethue.com/how-to-schedule-spark-jobs-with-spark-on-yarn-and-oozie/

Hue offers a notebook for Hadoop and Spark, but here are the following steps that will successfully guide you to execute a Spark Action from the Oozie Editor.

Run job in Spark Local Mode

To submit a job locally, Spark Master can be one of the following

  • local: Run Spark locally with one worker thread.
  • local[k]: Run Spark locally with K worker threads.
  • local[*]: Run Spark with as many worker threads as logical cores on your machine.

Insert the Mode as client and provide local/HDFS jar path in Jars/py field. You would also need to specify the App name, Main class to the Jar and arguments (if any) by clicking on the ARGUMENTS+ button.

local

Note: Spark’s local mode doesn’t run with Kerberos.

Run job on Yarn

To submit a job on Yarn Cluster, you need to change Spark Master to yarn-cluster, Mode to cluster and give the compete HDFS path for the Jar in Jars/py files field.

cluster

Similarly, to submit a job on yarn-client, change Spark Master to yarn-clientMode to client, keeping rest of the fields same as above. Jar path can be local or HDFS.

yarn-client

 

Additional Spark-action properties can be set by clicking the settings button at the top right corner before you submit the job.

running

Note: If you see the error “Required executor memory (xxxxMB) is above the max threshold…”, please increase ‘yarn.scheduler.maximum-allocation-mb’ in Yarn config and restart Yarn service from CM.

Next version is going to include HUE-2645, that will make the UI simple and more intuitive. As usual feel free to comment on the hue-user list or @gethue!

33 Comments

  1. XiaoBendan 2 years ago

    I have a shell program like this to submit my spark task:
    #!/bin/bash
    DAY=$1
    if [ “$#” -ne 1 ]; then
    echo “Param \”Day\” required!”
    exit
    else
    echo “Solve ID for Day[$DAY], begining…”
    fi
    echo $DAY

    HBASE_HOME=/opt/cloudera/parcels/CDH/lib/hbase
    HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive
    export SPARK_CLASSPATH=”$HIVE_HOME/conf:$HBASE_HOME/conf/:$HBASE_HOME/hbase-client.jar:$HBASE_HOME/hbase-protocol.jar:$HBASE_HOME/lib/htrace-core.jar:$HBASE_HOME/lib/htrace-core-3.1.0-incubating.jar”&&
    spark-submit –class “com.gridsum.aud.AudienceIdentitySolverApp” –driver-cores 2 –driver-memory 2G –master yarn-client –executor-memory 15G –conf spark.shuffle.memoryFraction=0.50 –executor-cores 3 –num-executors 14 aud-id-recognize-1.0-SNAPSHOT-jar-with-dependencies.jar solveId.yml $DAY
    But how can I do to transplant this task to the OOZIE with hue??

    • XiaoBendan 2 years ago

      I donot know how to work the export and spark in the same action .

    • Hue Team 2 years ago

      It should work and / or you could try a Shell action with –proxy-user USERNAME do run it as the user you want.

  2. XiaoBendan 2 years ago

    How can I edit or delete my comment?? There is something wrong!!!

  3. Ben 2 years ago

    Does this work in Hue 3.7? CDH 5.4.8? I am having problems with it in yarn-client and yarn-cluster modes.

    • Hue Team 2 years ago

      It was tested in CDH5.5/5.7, Spark Action is Oozie is still experimental from what we saw

      • Ben 2 years ago

        I tested the Spark Action in CDH 5.5.2, and it works. It was just CDH 5.4.8 where it didn’t.

        • Hue Team 2 years ago

          Thanks for reporting!

  4. Vincent 2 years ago

    I tried in CDH 5.5.2, to launch in the workflow a spark program (org.apache.oozie.example.SparkFileCopy). It seems to work, because the spark job is well executed but my workflow doesn’t finish, the status is staying in “suspended” all the time.

    I try in “Dry run” too and I have the same issue.

    Have you any idea ?

    Thanks

    • Hue Team 2 years ago

      Did you look at the Oozie logs? Do you have any YARN worker? Any memory lacking?

  5. Vincent 2 years ago

    Yes you are right, I have this error in oozie logs, but I don’t see why for this oozie-spark-job it doesn’t work, because for others jobs it works fine :

    Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=mapred, access=READ, inode=”/user/history/done_intermediate/vincent.moreno/job_1457994974205_1851-1462378126124-vincent.moreno-oozie%3Alauncher%3AT%3Dspark%3AW%3DSpark%3AA%3Dspark%2Dd909%3AID%3D000-1462378163221-1-0-SUCCEEDED-root.vincent_dot_moreno-1462378132610.jhist”:vincent.moreno:supergroup:-rwxrwx—

  6. Vincent 2 years ago

    I tried to change the permissions on the folder during the job execution (776) and the job is running fine. Have you any idea why the files have not the good permissions ?
    Thanks a lot.

  7. fslan 1 year ago

    Hi All,

    I got an error when I tried to run a Spark Job with Oozie in HUE interface. The error is below, I have been searching this error on internet but haven’t found any useful information yet. Any of you has this error or knows how to solve it? If so, please help. Thanks.

    =================================================================

    >>> Invoking Spark class now >>>

    Intercepting System.exit(1)

    <<< Invocation of Main class completed <<<

    Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [1]

    Oozie Launcher failed, finishing Hadoop job gracefully

    Oozie Launcher, uploading action data to HDFS sequence file: hdfs://localhost:8020/user/training/oozie-oozi/0000021-160531235012261-oozie-oozi-W/spark-c3a2–spark/action-data.seq

    Oozie Launcher ends

    • Hue Team 1 year ago

      In the Job Browser, if you look at the looks of the Oozie launcher, what do you see?

  8. yashwanth 1 year ago

    I uploaded the jar file in HDFS and tried to run it. But, I got the exception below

    Warning: Local jar /user/yxr6907/sparkhbase-0.0.1-SNAPSHOT.jar does not exist, skipping.

    I also tried copying the same jar into my local folder and giving the local path. But, I got the same error.

    Am I doing anything wrong here?

  9. Abhishek Gupta 1 year ago

    Hi…i am able to run spark job using command line using below command.
    spark-submit \
    –master yarn-client \
    –class com.vw.hy.classname \
    –properties-file /etc/path/spark.conf \
    –files /etc/path/log4j.properties \
    –conf “spark.executor.extraJavaOptions=-Dconfig.resource=application.conf -Dlog4j.configuration=log4j.properties” \
    –conf “spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties” \
    –driver-java-options -Dconfig.file=/etc/path/application.conf \
    /opt/vw/path/to/jar/main.jar
    when i create oozie workflow, where do i put these options like conf, driver-java-options and files…i couldnt find where to put that in HUE either.

  10. Miles Y. 1 year ago

    Looks like CDH 5.7 / Hue 3.9 is yet to implement the following (mainly around log4j config):

    * Separate file upload list in UI for –files and –properties-file in particular for log4j.properties and app-specific config file – this seems to be added in your latest implementation?

    * Spark Action also doesn’t seem to support overriding Spark properties like command line: [–conf “spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties” –conf “spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties” ] . Adding to UI Properties field () seems to have no effect, nor adding the above verbatim to “Options list” field () – the former elicits message “Warning: Ignoring non-spark config property” in driver stderr.

    Thanks,
    Miles

  11. Thomas 1 year ago

    Hi,

    I’m running with a spark job on Yarn Cluster. The workflow starts but the spark jobs stays on the “ACCEPTED” status with “The application might not be running yet or there is no Node Manager or Container available”. This is the only thing running on the EMR cluster, no log files for the spark job and no errors in the workflow logs.

    • Hue Team 1 year ago

      Do you have at least one YARN node manager up?

      • Thomas 1 year ago

        Yes, I’ve got a little bit further now. It went to the running state but now It fails at 5% with the error java.lang.IllegalArgumentException: Error while instantiating ‘org.apache.spark.sql.internal.SessionState’. I did some testing and It seems it gives the error the moment I use something from spark.sql. Anyone had this issue before? I’ve ran this job as a step on an emr and with a spark-submit no problem

        • Saurab 12 months ago

          I am facing the same issue. Is your issue resolve with spark-sql?

  12. Ashish Kumar Singh 1 year ago

    Hello,

    I want to invoke the spark scala kafka consumer through oozie spark action
    spark-submit –class org.sabre.consumer.PSSConsumer \
    –master yarn-cluster \
    –driver-memory 4G \
    –executor-memory 3G \
    Consumer-0.0.1-SNAPSHOT.jar bqrhlc130:9092,bqrhlc140:9092 air_sell_segment_pss /data/airsell/segment/landing/pss

    Where first parameter is list of brokers “bqrhlc130:9092,bqrhlc140:9092”
    Second parameter is name of topic “air_sell_segment_pss”
    Third parameter is landing directory “/data/airsell/segment/landing/pss”

    please help me out where how to put the first,second and third parameter in workflow.
    i am using below job.properties
    nameNode=xxxxx
    jobTracker=yyyyy
    oozie.wf.application.path=/spark/sparkOozie/
    oozie.use.system.libpath=true
    master=yarn
    mode=client
    broker1=bqrhlc130:9092
    broker2=bqrhlc140:9092
    topic=air_sell_segment_pss

    this is my work flow.xml

    ${jobTracker}
    ${nameNode}

    mapred.compress.map.output
    true

    ${master}
    ${mode}
    PSSConsumer
    org.sabre.consumer.PSSConsumer
    ${nameNode}/spark/sparkOozie/lib/Consumer-0.0.1-SNAPSHOT.jar
    sparkopts=–driver-memory 1G –executor-memory 1G –num-executors 3 –conf
    spark.eventLog.dir=${nameNode}/user/spark/applicationHistory –conf
    spark.yarn.historyServer.address=http:${nameNode}:18088 –conf spark.eventLog.enabled=true
    ${broker1}
    ${broker2}
    ${topic}
    ${nameNode}/data/airsell/segment/landing/pss_test_oozie

  13. naveen 10 months ago

    Hi,

    We have around 50 sources.We have one spark jar file created to execute the process for all the sources for different frequencies.Each time jar file accepts one source and one frequency.

    We have to take the source name,frequeny information from HBASE table.

    We would like to schedule the job using HUE dash board.How to pass those values dynamically in HUE dashboard by creating the single worklfow.Kindly help.

  14. maxiulin 4 months ago

    hello,hue team, please tell me Can I use spark2 in oozie by hue?

  15. naveen 2 months ago

    hi i am not getting oozie workflow in my project.

Leave a reply

Your email address will not be published. Required fields are marked *

*