SQL Querying a log stream and outputting Calculations to another stream

Published on 10 January 2021 in Tutorial / Version 4.9 / Flink SQL / ksqlDB - 3 minutes read - Last modified on 06 March 2021 - Read in jp

Initially published on https://medium.com/data-querying/live-sql-querying-live-logs-and-sending-live-updates-easily-e6297150cf92

Log analysis tutorial from an Apache Kafka data stream via Flink SQL, ksqlDB & Hue Editor.

Real time queries on streams on data is a modern way to perform powerful analyses as demoed in the previous post. This time we will see a more personalized scenario by querying our own logs generated in the Web Query Editor.

First, thank you to the community for all the improvements on the open source projects mentioned below, with in particular Flink Version 1.12 and the SQL gateway as well as the Hue Editor.

The goal is to demo the current SQL capabilities and ease of use of interactively building queries on streams of data.

Querying a data log stream via Flink SQL and ksqlDB

Querying a data log stream via Flink SQL and ksqlDB

Architecture

The article comes with a live demo setup so that you can play with the products easily locally.

Raw logs from the Hue Editor..

[29/Dec/2020 22:43:21 -0800] access  INFO   172.21.0.1 romain - "POST /notebook/api/get_logs HTTP/1.1" returned in 30ms 200 81

.. are being collected via Fluentd, which forwards them directly to a Kafka topic after filtering out the non access/INFO rows (to keep the data simpler).

{"container_id":"7d4fa988b26e2034670bbe8df3f1d0745cd30fc9645c19d35e8004e7fcf8c71d","container_name":"/hue","source":"stdout","log":"[29/Dec/2020 22:43:21 -0800] access  INFO   172.21.0.1 romain - \"POST /notebook/api/get_logs HTTP/1.1\" returned in 30ms 200 81"}

This data is then picked from the Kafka topic, then analyzed interactively before being turned into a long running query calculating how many API calls per logged-in user are being made within rolling windows of 10 seconds.

Live stream analysis Architecture

Live stream analysis Architecture

Demo

For fetching the Docker Compose configuration and starting everything:

mkdir stream-sql-logs
cd stream-sql-logs

wget https://raw.githubusercontent.com/romainr/query-demo/master/stream-sql-logs/docker-compose.yml

docker-compose up -d
>
Creating network "stream-sql-logs_default" with the default driver
Creating hue-database                 ... done
Creating stream-sql-logs_jobmanager_1 ... done
Creating stream-sql-logs_fluentd_1    ... done
Creating stream-sql-logs_zookeeper_1   ... done
Creating ksqldb-server                 ... done
Creating hue                           ... done
Creating stream-sql-logs_taskmanager_1 ... done
Creating flink-sql-api                 ... done
Creating stream-sql-logs_kafka_1       ... done

Then those URLs will be up:

For stopping everything:

docker-compose down

Scenario

While interacting with the Web Editor, Web logs are being generated. We will ingest a subset of them into a Kafka Topic that we will query via Flink SQL. ksqlDB is used to prove that at the end of the day all the SQL SELECTs and INSERTs are purely going through standard Kafka topics.

Once again, the TUMBLE function is used to easily create live windows of aggregation.

We logged-in as two separate users (‘demo’ and ‘romain’) to show the grouping by username.

One of the novelty in Flink is the new UPSERT into Kafka connector, which will let us send the rolling aggregated data back into Kafka. This enables further downwards consumption by reporting or alerting systems which can simply read from the topic.

Calculating and inserting a rolling window of live stats into a Kafka TopicCalculating and inserting a rolling window of live stats into a Kafka Topic

As a follow-up, the querying could be more elaborate via the extraction of the real datetime and HTTP code fields as well as outputting alert messages when the counts go above/below a certain threshold, which is perfect for building a live application on top of it.

One of niceties of the editor is to let you interactively fiddle with SQL functions like REGEXP_EXTRACT, DATE_FORMAT… which can be time consuming to get right.

SQL

Here is the SQL source typed in the Query Editor:

SHOW TOPICS;
CREATE STREAM hue_logs (
log STRING
) WITH (
kafka_topic='hue_logs',
value_format='json',
partitions=1
)
;
SELECT
log,
rowtime
FROM
hue_logs
EMIT CHANGES
LIMIT 10
;
CREATE STREAM hue_stats (
ts string,
username STRING,
callsCt BIGINT
) WITH (
kafka_topic='hue_stats',
value_format='json',
partitions=1
)
;
SELECT
*
FROM
hue_stats
EMIT CHANGES
LIMIT 10
;

Et voila!

Any feedback or question? Feel free to comment here!

All these projects are also open source and welcome feedback and contributions.

In the case of the Hue Editor, the Forum or Github issues are good places for that. More sophisticated SQL autocompletes and connectors, Web Socket and Celery Task server integrations are some improvement ideas.

Onwards!

Romain


More recent stories

26 June 2024
Integrating Trino Editor in Hue: Supporting Data Mesh and SQL Federation
Read More
03 May 2023
Discover the power of Apache Ozone using the Hue File Browser
Read More