How to use the Livy Spark REST Job Server API for doing some interactive Spark with curl

Published on 24 September 2015 in Development - 5 minutes read - Last modified on 19 April 2021

Livy is an open source REST interface for using Spark from anywhere.

Note: Livy is not supported in CDH, only in the upstream Hue community.

 

It supports executing snippets of code or programs in a Spark Context that runs locally or in YARN. This makes it ideal for building applications or Notebooks that can interact with Spark in real time. For example, it is currently used for powering the Spark snippets of the Hadoop Notebook in Hue.

In this post we see how we can execute some Spark 1.5 snippets in Python.

 

Livy sits between the remote users and the Spark cluster

 

Starting the REST server

Based on the README, we check out Livy's code. It is currently living in Hue repository for simplicity but hopefully will eventually graduate in its top project.

git clone [email protected]:cloudera/hue.git

Then we compile Livy with

cd hue/apps/spark/java

mvn -DskipTests clean package

Export these variables

export SPARK_HOME=/usr/lib/spark

export HADOOP_CONF_DIR=/etc/hadoop/conf

And start it

./bin/livy-server

Note: Livy defaults to Spark local mode, to use the YARN mode copy the configuration template file apps/spark/java/conf/livy-defaults.conf.tmpl into livy-defaults.conf and set the property:

livy.server.session.factory = yarn

 

Executing some Spark

As the REST server is running, we can communicate with it. We are on the same machine so will use ‘localhost’ as the address of Livy.

Let's list our open sessions

curl localhost:8998/sessions

{"from":0,"total":0,"sessions":[]}

Note

You can use

 | python -m json.tool

at the end of the command to prettify the output, e.g.:

curl localhost:8998/sessions/0 | python -m json.tool

 

There is zero session. We create an interactive PySpark session

curl -X POST -data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions

{"id":0,"state":"starting","kind":"pyspark","log":[]}

 

Sessions ids are incrementing numbers starting from 0. We can then reference the session later by its id.

Livy supports the three languages of Spark:

<td>
  <span style="font-weight: 400;">Languages</span>
</td>
<td>
  <span style="font-weight: 400;">Scala</span>
</td>
<td>
  <span style="font-weight: 400;">Python</span>
</td>
<td>
  <span style="font-weight: 400;">R</span>
</td>
Kinds
spark
pyspark
sparkr

 

We check the status of the session until its state becomes idle: it means it is ready to be execute snippet of PySpark:

curl localhost:8998/sessions/0 | python -m json.tool

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 1185 0 1185 0 0 72712 0 -:-:- -:-:- -:-:- 79000

{

"id": 5,

"kind": "pyspark",

"log": [

"15/09/03 17:44:14 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.",

"15/09/03 17:44:14 INFO ui.SparkUI: Started SparkUI at http://172.21.2.198:4040",

"15/09/03 17:44:14 INFO spark.SparkContext: Added JAR file:/home/romain/projects/hue/apps/spark/java-lib/livy-assembly.jar at http://172.21.2.198:33590/jars/livy-assembly.jar with timestamp 1441327454666",

"15/09/03 17:44:14 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.",

"15/09/03 17:44:14 INFO executor.Executor: Starting executor ID driver on host localhost",

"15/09/03 17:44:14 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54584.",

"15/09/03 17:44:14 INFO netty.NettyBlockTransferService: Server created on 54584",

"15/09/03 17:44:14 INFO storage.BlockManagerMaster: Trying to register BlockManager",

"15/09/03 17:44:14 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:54584 with 530.3 MB RAM, BlockManagerId(driver, localhost, 54584)",

"15/09/03 17:44:15 INFO storage.BlockManagerMaster: Registered BlockManager"

],

"state": "idle"

}

 

In YARN mode, Livy creates a remote Spark Shell in the cluster that can be accessed easily with REST

 

When the session state is idle, it means it is ready to accept statements! Lets compute 1 + 1

curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"1 + 1"}'

{"id":0,"state":"running","output":null}

We check the result of statement 0 when its state is available

curl localhost:8998/sessions/0/statements/0

{"id":0,"state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":"2"}}}

Note

If the statement is taking less than a few milliseconds, Livy returns the response directly in the response of the POST command.

Statements are incrementing and all share the same context, so we can have a sequences

curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"a = 10"}'

{"id":1,"state":"available","output":{"status":"ok","execution_count":1,"data":{"text/plain":""}}}

Spanning multiple statements

curl localhost:8998/sessions/5/statements -X POST -H 'Content-Type: application/json' -d '{"code":"a + 1"}'

{"id":2,"state":"available","output":{"status":"ok","execution_count":2,"data":{"text/plain":"11"}}}

 

Let's close the session to free up the cluster. Note that Livy will automatically inactive idle sessions after 1 hour (configurable).

curl localhost:8998/sessions/0 -X DELETE

{"msg":"deleted"}

 

Impersonation

Let's say we want to create a shell running as the user bob, this is particularly useful when multi users are sharing a Notebook server

curl -X POST -data '{"kind": "pyspark", "proxyUser": "bob"}' -H "Content-Type: application/json" localhost:8998/sessions

{"id":0,"state":"starting","kind":"pyspark","proxyUser":"bob","log":[]}

Do not forget to add the user running Hue (your current login in dev or hue in production) in the Hadoop proxy user list (/etc/hadoop/conf/core-site.xml):



hadoop.proxyuser.hue.hosts

*





hadoop.proxyuser.hue.groups

*



Additional properties

All the properties supported by spark shells like the number of executors, the memory, etc can be changed at session creation. Their format is the same as when typing spark-shell -h

curl -X POST -data '{"kind": "pyspark", "numExecutors": "3", "executorMemory": "2G"}' -H "Content-Type: application/json" localhost:8998/sessions

{"id":0,"state":"starting","kind":"pyspark","numExecutors":"3","executorMemory":"2G","log":[]} 

 

And that's it! Next time we will explore some more advanced features like the magic keywords for introspecting data or printing images. Then, we will detail how to do batch submissions in compiled Scala, Java or Python (i.e. jar or py files).

The architecture of Livy was presented for the first time at Big Data Scala by the Bay last August and next updates will be at the Spark meetup before Strata NYC and Spark Summit in Amsterdam.

 

Feel free to ask any questions about the architecture, usage of the server in the comments, @gethue or the hue-user list. And pull requests are always welcomed!

 

 


comments powered by Disqus

More recent stories

26 June 2024
Integrating Trino Editor in Hue: Supporting Data Mesh and SQL Federation
Read More
03 May 2023
Discover the power of Apache Ozone using the Hue File Browser
Read More