How to use the Livy Spark REST Job Server API for doing some interactive Spark with curl

How to use the Livy Spark REST Job Server API for doing some interactive Spark with curl

Livy is an open source REST interface for using Spark from anywhere.

Note: Livy is not supported in CDH, only in the upstream Hue community.

 

It supports executing snippets of code or programs in a Spark Context that runs locally or in YARN. This makes it ideal for building applications or Notebooks that can interact with Spark in real time. For example, it is currently used for powering the Spark snippets of the Hadoop Notebook in Hue.

In this post we see how we can execute some Spark 1.5 snippets in Python.

 

20150818_scalabythebay.012

Livy sits between the remote users and the Spark cluster

 

Starting the REST server

Based on the README, we check out Livy’s code. It is currently living in Hue repository for simplicity but hopefully will eventually graduate in its top project.

git clone [email protected]:cloudera/hue.git

Then we compile Livy with

cd hue/apps/spark/java
mvn -DskipTests clean package

Export these variables

export SPARK_HOME=/usr/lib/spark
export HADOOP_CONF_DIR=/etc/hadoop/conf

And start it

./bin/livy-server

Note: Livy defaults to Spark local mode, to use the YARN mode copy the configuration template file apps/spark/java/conf/livy-defaults.conf.tmpl into livy-defaults.conf and set the property:

livy.server.session.factory = yarn

 

Executing some Spark

As the REST server is running, we can communicate with it. We are on the same machine so will use ‘localhost’ as the address of Livy.

Let’s list our open sessions

curl localhost:8998/sessions

{"from":0,"total":0,"sessions":[]}

Note
You can use

 | python -m json.tool

at the end of the command to prettify the output, e.g.:

curl localhost:8998/sessions/0 | python -m json.tool

 

There is zero session. We create an interactive PySpark session

curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions

{"id":0,"state":"starting","kind":"pyspark","log":[]}

 

Sessions ids are incrementing numbers starting from 0. We can then reference the session later by its id.

Livy supports the three languages of Spark:

Kinds Languages
spark Scala
pyspark Python
sparkr R

 

We check the status of the session until its state becomes idle: it means it is ready to be execute snippet of PySpark:

curl localhost:8998/sessions/0 | python -m json.tool


% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                Dload  Upload   Total   Spent    Left  Speed

100  1185    0  1185    0     0  72712      0 --:--:-- --:--:-- --:--:-- 79000

{

    "id": 5,

    "kind": "pyspark",

    "log": [

       "15/09/03 17:44:14 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.",

       "15/09/03 17:44:14 INFO ui.SparkUI: Started SparkUI at http://172.21.2.198:4040",

       "15/09/03 17:44:14 INFO spark.SparkContext: Added JAR file:/home/romain/projects/hue/apps/spark/java-lib/livy-assembly.jar at http://172.21.2.198:33590/jars/livy-assembly.jar with timestamp 1441327454666",

       "15/09/03 17:44:14 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.",

       "15/09/03 17:44:14 INFO executor.Executor: Starting executor ID driver on host localhost",

       "15/09/03 17:44:14 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54584.",

       "15/09/03 17:44:14 INFO netty.NettyBlockTransferService: Server created on 54584",

       "15/09/03 17:44:14 INFO storage.BlockManagerMaster: Trying to register BlockManager",

       "15/09/03 17:44:14 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:54584 with 530.3 MB RAM, BlockManagerId(driver, localhost, 54584)",

       "15/09/03 17:44:15 INFO storage.BlockManagerMaster: Registered BlockManager"

    ],

    "state": "idle"

}

 

20150818_scalabythebay.024

In YARN mode, Livy creates a remote Spark Shell in the cluster that can be accessed easily with REST

 

When the session state is idle, it means it is ready to accept statements! Lets compute 1 + 1

curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"1 + 1"}'

{"id":0,"state":"running","output":null}

We check the result of statement 0 when its state is available

curl localhost:8998/sessions/0/statements/0

{"id":0,"state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":"2"}}}

Note

If the statement is taking less than a few milliseconds, Livy returns the response directly in the response of the POST command.

Statements are incrementing and all share the same context, so we can have a sequences

curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"a = 10"}'

{"id":1,"state":"available","output":{"status":"ok","execution_count":1,"data":{"text/plain":""}}}

Spanning multiple statements

curl localhost:8998/sessions/5/statements -X POST -H 'Content-Type: application/json' -d '{"code":"a + 1"}'

{"id":2,"state":"available","output":{"status":"ok","execution_count":2,"data":{"text/plain":"11"}}}

 

Let’s close the session to free up the cluster. Note that Livy will automatically inactive idle sessions after 1 hour (configurable).

curl localhost:8998/sessions/0 -X DELETE

{"msg":"deleted"}

 

Impersonation

Let’s say we want to create a shell running as the user bob, this is particularly useful when multi users are sharing a Notebook server

curl -X POST --data '{"kind": "pyspark", "proxyUser": "bob"}' -H "Content-Type: application/json" localhost:8998/sessions
{"id":0,"state":"starting","kind":"pyspark","proxyUser":"bob","log":[]}

Do not forget to add the user running Hue (your current login in dev or hue in production) in the Hadoop proxy user list (/etc/hadoop/conf/core-site.xml):

<property>
  <name>hadoop.proxyuser.hue.hosts</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.hue.groups</name>
  <value>*</value>
</property>

Additional properties

All the properties supported by spark shells like the number of executors, the memory, etc can be changed at session creation. Their format is the same as when typing spark-shell -h

curl -X POST --data '{"kind": "pyspark", "numExecutors": "3", "executorMemory": "2G"}' -H "Content-Type: application/json" localhost:8998/sessions
{"id":0,"state":"starting","kind":"pyspark","numExecutors":"3","executorMemory":"2G","log":[]} 

 

And that’s it! Next time we will explore some more advanced features like the magic keywords for introspecting data or printing images. Then, we will detail how to do batch submissions in compiled Scala, Java or Python (i.e. jar or py files).

The architecture of Livy was presented for the first time at Big Data Scala by the Bay last August and next updates will be at the Spark meetup before Strata NYC and Spark Summit in Amsterdam.

 

Feel free to ask any questions about the architecture, usage of the server in the comments, @gethue or the hue-user list. And pull requests are always welcomed!

 

 

46 Comments

  1. Peter Rudenko 2 years ago

    Does it support multiple spark contexts?

    • Hue Team 2 years ago

      Livy can manage multiple spark sessions, which each have their own contexts, but at the moment it doesn’t support a single session having multiple contexts.

  2. Ruslan 2 years ago

    That’s a great feature.
    Would Hue / Livy Server close a Spark Context once user closes a Spark Notebook page?
    Similarly as it closes e.g. connection to Impala when a page with SQL is closed.
    Otherwise I can see eventually we’ll have a lot of orphan sessions, and yarn resources will be exhausted quickly.

    • Hue Team 2 years ago

      You shouldn’t have to worry about that. Livy has two mechanisms to deal with this. First, closing a session will tear down the Spark Context. Second, there is an timer that will kill sessions if they haven’t received any activity in the past hour. This is configurable with the `livy.server.session.timeout` option.

      • Ruslan 2 years ago

        Thanks for explaining. That sounds great. Hadoop Notebooks is the most anticipated feature of Hue 3.9 release (we wait for CDH 5.5 to be released). Livy Server is available in 3.9 too, not just 3.10, right?

        • Hue Team 2 years ago

          Yes, Livy is available in 3.9. It’s in active development, so if you do run into any problems, please also check the master branch to see if we’ve already fixed whatever problems you may encounter.

  3. sumit 2 years ago

    Does it need – Cloudera to be installed
    Can I try it on a Cloudera VM just for testing purposes ?

  4. sumit 2 years ago

    Thanks for the answers – appreciated – some more questions please

    – I have explored Spark Job Server from OOYALA – how is this different currently or in the future – from Spark Job Server ?

    • Hue Team 2 years ago

      The main use case of Livy is to launch and interactive spark shells inside YARN, whereas the last I checked, the Ooyala Job Server is mainly about launching batch jar Spark jobs (which Livy can also do). So no need to implement an interface and compile your code, you can just submit snippets. Livy supports PySpark and R too. Also Livy runs the Drivers in the YARN cluster, so if Livy crashes we don’t lose the current jobs. There are plans to integrate with additional backends/protocols.

  5. Ashish 2 years ago

    Is there a way to submit python script (.py) with the post request instead of writing raw code as {“code”: “a+1”}?

  6. Damien Carol 2 years ago

    Does Livy can use dynamic allocation with YARN?

  7. lonely7345 2 years ago

    when i use hue notebook to execute a scala program for spark,it’s successful ,
    but when i open another browser to execute the same programe ,it’s faiulture.
    the /api/newsessions return the code 504 Gateway Timout.

    I must kill the session in the first page.and the other will be running.
    the livy server only support one session?

    • Hue Team 2 years ago

      It supports multiple sessions. Are you using the very latest version?

  8. sashi 2 years ago

    Hi Im getting the following error when trying to access spark notebook from hue.

    Error: org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at: org.apache.spark.SparkContext.(SparkContext.scala:81) com.cloudera.hue.livy.repl.scala.interpreter.Interpreter.start(Interpreter.scala:75) com.cloudera.hue.livy.repl.scala.SparkSession.(SparkSession.scala:41) com.cloudera.hue.livy.repl.scala.SparkSession$.create(SparkSession.scala:31) com.cloudera.hue.livy.repl.ScalatraBootstrap.init(Main.scala:106) org.scalatra.servlet.ScalatraListener.configureCycleClass(ScalatraListener.scala:67)

    Can you please help me in resolving the error :

  9. Ravi 2 years ago

    Hi,
    I am getting gateway timeout when I am running curl -X POST –data ‘{“kind”: “pyspark”}’ -H “Content-Type: application/json” localhost:8998/sessions
    before that every thing working perfectly fine. What could be the reason.

  10. Federico Ponzi 2 years ago

    Hi,
    For some reason, this code is not working:
    curl localhost:8998/sessions/2/statements -X POST -H ‘Content-Type: application/json’ -d ‘{“code”:”def r():\n i=1 + 1\n return i\n”}’

    Basically it dosen’t work if a statement like def or for have more than one line of definition. I need to run a larger program using livy. How to resolve this? I need also Impersonation feature

    Thanks a lot for help

  11. Riya 2 years ago

    Hi team,

    What is the difference between “livy.server.session.factory = yarn” mode and “livy.server.session.factory = process” mode. And I believe if this property is not set value defaults to local.How the execution is different in these three cases ?

    • Hue Team 2 years ago

      Like the Spark options: yarn-cluster mode or local mode

    • Hue Team 2 years ago

      Thanks & Updated!

  12. lidl 1 year ago

    Hi team.
    I set conf `ivy.server.session.factory = yarn`
    But when I create session, response message is :
    `URI ‘local:spark’ is not supported by any registered client factories.`

    Any advise?

  13. Mahdi 1 year ago

    Hi

    I have couple of questions

    1) How do I run livy server so when I close my session it doesn’t stop? Right now, I have to SSH to one of the hosts and run Livy server and then I can use it.

    2) which node of the cluster I should do the livy server installation? which role? My Spark home is set to /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/spark/ is this a correct practice ?

    • Author
      Hue Team 1 year ago

      #1 Hue automatically sends a close to tell Livy to close the session when leaving the Notebook
      #2 Livy is not in CDH yet, so its does not have any role. Any host with a Spark Gateway will work

  14. Carlos Barichello 12 months ago

    This web page shows that statements are incrementing and all share the same context. It then goes on to show an example where the variable “a” gets a value of 10 in session 0. Next a statement in session 5 adds “a” to 1. That implies that a variable spans sessions. Did you mean for the second line (a+1) to be in session 0 instead of session 5?

    • Author
      Hue Team 12 months ago

      There is only one session here but multiple snippets of code. Snippets could be all in the same box, but by putting them in individual box they can be executed not all at the same time.

  15. hwy 12 months ago

    Hi ,I am using hue3.11,how can i config spark sql on hue with impersonation? i hava configed spark sql throuth spark thriftserver,thanks ,best regard!

    • Author
      Hue Team 12 months ago

      In the hue.ini config, you would need to configure and uncomment:

      [beeswax]
      hive_server_host=localhost
      hive_server_port=10000


      [notebook]
      [[[sparksql]]]
      name=SparkSql
      interface=hiveserver2

  16. Diego 11 months ago

    Hi,
    Regarding comparison with Spark Job Server.

    I’d like a live application that would allow me to run jars, but keeping a Spark Context alive, mainly to keep some variables broadcast in memory alive between Spark jobs. This is possible in Spark Job Server. How would this be done in Livy?

    • Author
      Hue Team 11 months ago

      I would recommend to ask the livy.io people, the project was moved there!

  17. Azer ILA 11 months ago

    Hi

    I want to write an interactive spark driver that accomplish some operations such as reading from hdfs to data frame and execute some query on it and I want an external client (e.g. an open source dashboard or a java program ) to communicate with driver program, send request and receive results in form of some format such as JSON. Can I develop this scenario with livy.

  18. shaozhipeng 9 months ago

    I need help.When livy server started.I exe curl bigdata1:8998/sessions then
    curl -X POST –data ‘{“kind”: “pyspark”}’ -H “Content-Type: application/json” bigdata1:8998/sessions
    curl bigdata1:8998/sessions/0 | python -m json.tool
    % Total % Received % Xferd Average Speed Time Time Time Current
    Dload Upload Total Spent Left Speed
    100 309 100 309 0 0 10953 0 –:–:– –:–:– –:–:– 11444
    {
    “appId”: “application_1482889733217_0003”,
    “appInfo”: {
    “driverLogUrl”: “http://bigdata3:8042/node/containerlogs/container_1482889733217_0003_01_000001/hadoop”,
    “sparkUiUrl”: “http://bigdata1:8088/proxy/application_1482889733217_0003/”
    },
    “id”: 0,
    “kind”: “pyspark”,
    “log”: [],
    “owner”: null,
    “proxyUser”: null,
    “state”: “idle”
    }

    But when curl bigdata1:8998/sessions/0/statements -X POST -H ‘Content-Type: application/json’ -d ‘{“code”:”1 + 1″}’
    return
    {“id”:0,”state”:”waiting”,”output”:null}
    AND THEN
    curl bigdata1:8998/sessions/0 | python -m json.tool
    % Total % Received % Xferd Average Speed Time Time Time Current
    Dload Upload Total Spent Left Speed
    100 310 100 310 0 0 8478 0 –:–:– –:–:– –:–:– 9117
    {
    “id”: 0,
    “appId”: “application_1482975521942_0001”,
    “owner”: null,
    “proxyUser”: null,
    “state”: “error”,
    “kind”: “pyspark”,
    “appInfo”: {
    “driverLogUrl”: “http://bigdata3:8042/node/containerlogs/container_1482975521942_0001_01_000001/hadoop”,
    “sparkUiUrl”: “http://bigdata1:8088/proxy/application_1482975521942_0001/”
    },
    “log”: []
    }

    • Author
      Hue Team 9 months ago

      Livy was moved to its own project: livy.io, we recommend to ask Livy specific questions there.

  19. wuchang 5 months ago

    I have configured hue proxyuser in core-site.xml ,but any user login into hue can remove other users’ data

    hadoop.proxyuser.hue.hosts
    *

    hadoop.proxyuser.hue.groups
    *

    from notebook pyspark:
    For example , user named ‘hue’ login into hue system, and he new a notebook and the pyspark code is like this:
    import os
    os.system(‘hadoop fs -rm -r /user/appuser/test.dat’)

    this pyspark code can remove the data of user named ‘appuser’.
    the permission info of test.data is:
    -rw-r–r– 2 appuser supergroup 0 2017-05-02 10:12 /user/appuser/test.

    my hue version is 3.11.0

    Anyone can give me some suggestions?

    • Author
      Hue Team 5 months ago

      In Job Browser, what is the username of the running Spark job?

  20. David 4 months ago

    When I try to use … POST –data ‘{“kind”:”pyspark”}’ -H “Content-Type: application/json” … (as was instructed in the create an interactive pyspark section), I am getting an error that says that ‘kind’ is not a recognizable field. Why could this be?

  21. Ute 2 months ago

    Can I run python code using the same Livy session in parallel in Spark ?

Leave a reply

Your email address will not be published. Required fields are marked *

*