Bay Area bike share analysis with the Hadoop Notebook and Spark & SQL

Bay Area bike share analysis with the Hadoop Notebook and Spark & SQL

In a previous post, we demonstrated how to use Hue’s Search app to seamlessly index and visualize trip data from Bay Area Bike Share and use Spark to supplement that analysis by adding weather data to our dashboard.

In this tutorial, we’ll use the Notebook app to study deeper the peak usage of the Bay Area Bike Share (BABS) system.

To start, download the latest data set from ( the original website doesn’t have the data anymore) This post uses the data from August 2013 through February 2014.

The notebook can be downloaded here and imported like workflows or search dashboard.

Importing CSV Data with the Metastore App

The BABS data set contains 4 CSVs that contain data for stations, trips, rebalancing (availability), and weather. Using Hue’s Metastore import wizard, we can easily import these data sets and create tables that infer their schema from the CSV header.

File Upload to Metastore

Metastore Sample

The import wizard also provides the opportunity to override any field names or types, which we’ll do for the Trip data to change the “duration” field from a TINYINT to an INT.

Metastore Schema Fields

Interactive Analysis with an Hadoop Notebook

Lightning-Fast Impala Queries

Now that we’ve imported the data into our cluster, we can create a new Notebook to perform our data crunching. To start, we’ll run some quick exploration queries using Impala.

Let’s find the top 10 most popular start stations based on the trip data:

SELECT startterminal, startstation, COUNT(1) AS count FROM bikeshare.trips GROUP BY startterminal, startstation ORDER BY count DESC LIMIT 10


Once our results are returned, we can easily visualize this data; a bar graph works nicely for a simple COUNT..GROUP BY query.

Impala Bar Graph

It seems that the San Francisco Caltrain (Townsend at 4th) was by far the most common start station. Let’s determine which end stations, for trips starting from the SF Caltrain Townsend station, were the most popular. We’ll fetch the latitude and longitude coordinates so that we can visualize the results on a map.

 COUNT(1) AS count
FROM `bikeshare`.`trips` t
JOIN `bikeshare`.`stations` s ON s.station_id = t.endterminal
WHERE t.startterminal = 70
GROUP BY s.station_id,,, s.long

Bike Share Map

The map visualization indicates that the most popular trips starting from the SF Caltrain station are in fairly close proximity to the station, with most of the destinations being clustered around the Financial District and SOMA.

Long Running Queries with Hive

For longer-running SQL queries, or queries that require use of Hive’s built-in functions, we can add a Hive snippet to our notebook to perform this analysis.

Let’s say we wanted to dig further into the trip data for the SF Caltrain station and find the total number of trips and average duration (in minutes) of those trips, grouped by hour.

Since the trip data stores startdate as a STRING, we’ll need to apply some string-manipulation to extract the hour within an inline SQL query. The outer query will aggregate the count of trips and the average duration.

    COUNT(1) AS trips,
    ROUND(AVG(duration) / 60) AS avg_duration
        CAST(SPLIT(SPLIT(t.startdate, ' ')[1], ':')[0] AS INT) AS hour,
        t.duration AS duration
    FROM `bikeshare`.`trips` t
        t.startterminal = 70
        t.duration IS NOT NULL
    ) r

Since this data produces several numeric dimensions of data, we can visualize the results using a scatterplot graph, with the hour as the x-axis, number of trips as the y-axis, and the average duration as the scatterplot size.

Bike Share Scatter Plot

Let’s add another Hive snippet to analyze an hour-by-hour breakdown of availability at the SF Caltrain Station:

  ROUND(AVG(bikes_available)) AS avg_bikes,
  ROUND(AVG(docks_available)) AS avg_docks
    r.time AS time,
    CAST(SUBSTR(r.time, 12, 2) AS INT) AS hour,
    CAST(r.bikes_available AS INT) AS bikes_available,
    CAST(r.docks_available AS INT) AS docks_available
  FROM `bikeshare`.`rebalancing` r
  JOIN `bikeshare`.`stations` s ON r.station_id = s.station_id
    r.station_id = 70
    SUBSTR(r.time, 15, 2) = '00'
  ) t

We’ll visualize the results as a line graph, which indicates that the bike availability tends to fall starting at 6 AM and is regained around 6 PM.

Bike Share Availability Line Graph

Robust Data Analysis with PySpark

At a certain point, your data analysis may exceed the limits of relational analysis with SQL or require a more expressive, full-fledged API.

Hue’s Spark notebooks allow users to mix exploratory SQL-analysis with custom Scala, Python (pyspark), and R code that utilizes the Spark API.

For example, we can open a pyspark snippet and load the trip data directly from the Hive warehouse and apply a sequence of filter, map, and reduceByKey operations to determine the average number of trips starting from the SF Caltrain Station:

trips = sc.textFile('/user/hive/warehouse/bikeshare.db/trips/201402_trip_data.csv')

trips = line: line.split(","))
station_70 = trips.filter(lambda x: x[4] == '70')

# Emit tuple of ((date, hour), 1)
trips_by_day_hour = x: ((x[2].split()[0], x[2].split()[1].split(':')[0]), 1))

trips_by_day_hour = trips_by_day_hour.reduceByKey(lambda a, b: a+b)

# Emit tuple of (hour, count)
trips_by_hour = x: (int(x[0][1]), x[1]))
avg_trips_by_hour = trips_by_hour.combineByKey( (lambda x: (x, 1)), 
 (lambda x, y: (x[0] + y, x[1] + 1)), 
 (lambda x, y: (x[0] + y[0], x[1] + y[1])) 
avg_trips_by_hour = avg_trips_by_hour.mapValues(lambda v : v[0] / v[1]) 

avg_trips_sorted = sorted(avg_trips_by_hour.collect())
%table avg_trips_sorted

Notebook pyspark bar graph


As you can see, Hue’s Notebook app enables easy interactive data analysis and visualizations with a powerful mix of tools. Want to know more about the Spark Notebook work, read about the Livy, the Spark REST Job server and see you at the upcoming Hadoop World in New York and Spark Summit in Amsterdam!

Stay tuned for a number of exciting improvements to the notebook app, and as usual feel free to comment on the hue-user list or @gethue!


Helpful Tips

Importing quoted-CSV data

The BABS rebalancing data (named 201402_status_data.csv) uses quotes.  In these cases, it is easier to create the table in Hive in the Beeswax editor and use the OpenCSV Row SERDE for Hive:

CREATE TABLE rebalancing(station_id int, bikes_available int, docks_available int, time string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"

Then you can go back to the Metastore to import the CSV into the table; note that you may have to remove the header line manually.

Reset Impala’s Metastore Cache

When you create new databases or tables and plan to query them in an Impala snippet, it’s a good idea to run an INVALIDATE METADATA; command first to reset the metastore cache. Otherwise, you may encounter an error where the database or table is not recognized.


  1. Ruslan 3 years ago

    That’s awesome. How to enable Hadoop notebooks? We upgraded to CDH 5.5 / Hue 3.9 and can’t find how to access that. Thanks.

    • Hue Team 3 years ago

      If the Spark app is not visible in the ‘Editor’ menu, you will need to unblacklist it from the hue.ini:

      Note that this is not supported yet in CDH, only offered as beta as the Spark Server is not integrated yet:


      Note: To override a value in Cloudera Manager, you need to enter verbatim each mini section from below into the Hue Safety Valve: Hue Service → Configuration → Service-Wide → Advanced → Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini

  2. Giridhar 2 years ago

    Is there a plan to provide the notebook functionality from Cloudera officially , especially the spark one

  3. Miro 1 year ago
    • Author
      Hue Team 1 year ago

      Thanks! Link updated!

  4. Miro 1 year ago

    Where is file 201402_status_data.csv?

  5. Ron 1 year ago

    COUNT(1) AS count
    FROM `bikeshare`.`trips` t
    JOIN `bikeshare`.`stations` s ON s.station_id = t.endterminal
    WHERE t.startterminal = 70
    GROUP BY s.station_id,,, s.long
    ORDER BY count DESC LIMIT 10

    In the above Program, instead of hardcoding 70, can we get the id from user (dynamically)?

Leave a reply to Ron Click here to cancel the reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.