A Spark SQL Editor via Hue and the Spark SQL Server

Published on 31 December 2020 in Tutorial / Version 4.9 / Spark SQL - 3 minutes read - Last modified on 06 March 2021 - Read in jp

Write and Execute some Spark SQL quickly in your own Web Editor.

Initially published on https://medium.com/data-querying/a-sparksql-editor-via-hue-and-the-spark-sql-server-f82e72bbdfc7

Apache Spark is popular for wrangling/preparing data, especially when embedding some SQL snippets to keep the data manipulation programs declarative and simpler.

One good news is that the SQL syntax is very similar to Apache Hive so the very powerful Hive autocomplete of Hue works very well.

SparkSql Editor in action via both types of connection

Here we will describe how to integrate with the Spark SQL Thrift Server interface that might be already available in your stack.

The article comes with a One click demo setup. The scenario is pretty simple and about batch querying, we will see for more live data in a dedicated follow-up episode.

Two connectors possible for communicating with the SparkSql Thrift Server

For fetching the Docker Compose configuration and starting everything:

mkdir spark
cd spark
wget https://raw.githubusercontent.com/romainr/query-demo/master/spark/docker-compose.yml
docker-compose up -d
>
Creating network "spark_default" with the default driver
Creating hue-database ... done
Creating livy-spark   ... done
Creating spark-master ... done
Creating spark-sql    ... done
Creating hue          ... done

Then those URLs will be up:

  • http://127.0.0.1:8080/ Spark Master Web UI
  • http://127.0.0.1:4040/environment/ Thrift SQL UI
  • http://127.0.0.1:7070 Spark Master
  • http://localhost:8998 Livy REST Server

For stopping everything:

docker-compose down

Hello World

How to create a SQL table representing some cities and number of inhabitants:

CREATE TABLE IF NOT EXISTS us_population (
  state CHAR(2),
  city VARCHAR(20),
  population BIGINT
)
;


INSERT INTO us_population
VALUES
('NY', 'New York', 8143197),
('CA', 'Los Angeles', 3844829),
('IL', 'Chicago', 2842518),
('TX', 'Houston', 2016582),
('PA', 'Philadelphia', 1463281),
('AZ', 'Phoenix', 1461575),
('TX', 'San Antonio', 1256509),
('CA', 'San Diego', 1255540),
('TX', 'Dallas', 1213825),
('CA', 'San Jose', 91233)
;


SELECT
  state as State,
  count(city) as CityCount,
  sum(population) as PopulationSum
FROM
  us_population
GROUP BY
  state
ORDER BY
  sum(population) DESC
;

Which interface for connecting with the Spark SQL Server?

We previously demoed how to leverage Apache Livy to submit some Spark SQL via Hue. As detailed there, Livy was initially created within the Hue project and offers a lightweight submission of interactive or batch PySpark / Scala Spark /SparkSql statements.

However one main drawback is that it might appear less official than the Distributed SQL Engine (also known as “Thrift Server”) shipped within Spark.

Hue can connect to the Spark SQL Thrift Server via two interfaces:

Long story short: the main advantage of SqlAlchemy is to be have more SparkSql nits ironed out but queries are submitted synchronously (i.e. queries of more than a few seconds don’t have progress report yet and long ones will time out, unless the Hue Task Server is setup).

So we recommend to get started with SqlAlchemy but help report/contribute back small fixes on the HiveServer2 API which is more native/advanced.

Note: SqlAlchemy interface requires the Hive connector which does not work out of the box because of the issue #150. But Hue ships and show a slightly patched module that works: https://github.com/gethue/PyHive

Spark SQL Editor

Configurations

In the hue.ini configure the connectors and make sure you installed the PyHive connector as shown in the docs:

[notebook]
[[interpreters]]

[[[sparksql-alchemy]]]
name=SparkSql (via SqlAlchemy)
interface=sqlalchemy
options='{"url": "hive://localhost:10000/default"}'

[[[sparksql]]]
# Must be named 'sparksql', hostname and more options are
# in the 'spark' section
name=SparkSql (via HiveServer2)
interface=hiveserver2


[spark]
sql_server_host=localhost
sql_server_port=10000

What’s next?

Et voila!

Next time we will describe the progress on the Hue SQL Scratchpad component that can be leveraged for easily editing and quick testing embedded SparkSql snippets as well as how to query live data.

Onwards!

Romain


comments powered by Disqus

More recent stories

26 June 2024
Integrating Trino Editor in Hue: Supporting Data Mesh and SQL Federation
Read More
03 May 2023
Discover the power of Apache Ozone using the Hue File Browser
Read More