How to Submit Spark jobs with Spark on YARN and Oozie

23 August 2016 in Scheduling / Querying - 1 minute read

How to run Spark jobs with Spark on YARN? This often requires trial and error in order to make it work.

Hue is leveraging Apache Oozie to submit the jobs. It focuses on the yarn-client mode, as Oozie is already running the spark-summit command in a MapReduce2 task in the cluster. You can read more about the Spark modes here.

Here is how to get started successfully:

PySpark

Simple script with no dependency.

Script with a dependency on another script (e.g. hello imports hello2).

For more complex dependencies, like Panda, have a look at this documentation.

 

Jars (Java or Scala)

Add the jars as File dependency and specify the name of the main jar:

Another solution is to put your jars in the ‘lib’ directory in the workspace (‘Folder’ icon on the top right of the editor).

 

The latest Hue is improving the user experience and will provide an even simpler solution in Hue 4.

If you have any questions, feel free to comment here or on the hue-user list or @gethue!


comments powered by Disqus

More recent stories

25 December 2019
A more collaborating Datawarehousing Experience with SQL query sharing via links or gists
Read More
05 December 2019
Hue 4.6 and its improvements are out!
Read More
13 November 2019
Visually surfacing SQL information like Primary Keys, Foreign Keys, Views and Complex Types
Read More