Initially published on https://medium.com/data-querying/live-sql-querying-live-logs-and-sending-live-updates-easily-e6297150cf92
Log analysis tutorial from an Apache Kafka data stream via Flink SQL, ksqlDB & Hue Editor.
Real time queries on streams on data is a modern way to perform powerful analyses as demoed in the previous post. This time we will see a more personalized scenario by querying our own logs generated in the Web Query Editor.
First, thank you to the community for all the improvements on the open source projects mentioned below, with in particular Flink Version 1.12 and the SQL gateway as well as the Hue Editor.
The goal is to demo the current SQL capabilities and ease of use of interactively building queries on streams of data.
Querying a data log stream via Flink SQL and ksqlDB
Architecture
The article comes with a live demo setup so that you can play with the products easily locally.
Raw logs from the Hue Editor..
[29/Dec/2020 22:43:21 -0800] access INFO 172.21.0.1 romain - "POST /notebook/api/get_logs HTTP/1.1" returned in 30ms 200 81
.. are being collected via Fluentd, which forwards them directly to a Kafka topic after filtering out the non access/INFO rows (to keep the data simpler).
{"container_id":"7d4fa988b26e2034670bbe8df3f1d0745cd30fc9645c19d35e8004e7fcf8c71d","container_name":"/hue","source":"stdout","log":"[29/Dec/2020 22:43:21 -0800] access INFO 172.21.0.1 romain - \"POST /notebook/api/get_logs HTTP/1.1\" returned in 30ms 200 81"}
This data is then picked from the Kafka topic, then analyzed interactively before being turned into a long running query calculating how many API calls per logged-in user are being made within rolling windows of 10 seconds.
Live stream analysis Architecture
Demo
For fetching the Docker Compose configuration and starting everything:
mkdir stream-sql-logs
cd stream-sql-logs
wget https://raw.githubusercontent.com/romainr/query-demo/master/stream-sql-logs/docker-compose.yml
docker-compose up -d
>
Creating network "stream-sql-logs_default" with the default driver
Creating hue-database ... done
Creating stream-sql-logs_jobmanager_1 ... done
Creating stream-sql-logs_fluentd_1 ... done
Creating stream-sql-logs_zookeeper_1 ... done
Creating ksqldb-server ... done
Creating hue ... done
Creating stream-sql-logs_taskmanager_1 ... done
Creating flink-sql-api ... done
Creating stream-sql-logs_kafka_1 ... done
Then those URLs will be up:
- http://localhost:8888/ Hue Editor
- http://localhost:8081/ Flink Dashboard
For stopping everything:
docker-compose down
Scenario
While interacting with the Web Editor, Web logs are being generated. We will ingest a subset of them into a Kafka Topic that we will query via Flink SQL. ksqlDB is used to prove that at the end of the day all the SQL SELECTs and INSERTs are purely going through standard Kafka topics.
Once again, the TUMBLE function is used to easily create live windows of aggregation.
We logged-in as two separate users (‘demo’ and ‘romain’) to show the grouping by username.
One of the novelty in Flink is the new UPSERT into Kafka connector, which will let us send the rolling aggregated data back into Kafka. This enables further downwards consumption by reporting or alerting systems which can simply read from the topic.
Calculating and inserting a rolling window of live stats into a Kafka Topic
As a follow-up, the querying could be more elaborate via the extraction of the real datetime and HTTP code fields as well as outputting alert messages when the counts go above/below a certain threshold, which is perfect for building a live application on top of it.
One of niceties of the editor is to let you interactively fiddle with SQL functions like REGEXP_EXTRACT, DATE_FORMAT… which can be time consuming to get right.
SQL
Here is the SQL source typed in the Query Editor:
-- Mapping Topic data to a Table | |
CREATE TABLE hue_logs ( | |
log STRING, | |
proctime AS PROCTIME() | |
) WITH ( | |
'connector' = 'kafka', | |
'topic' = 'hue_logs', | |
'properties.bootstrap.servers' = 'kafka:9094', | |
'format' = 'json' | |
); | |
-- Fiddling with functions | |
SELECT | |
REGEXP_EXTRACT('[31/Dec/2020 06:02:36 -0800] access INFO 172.26.0.1 romain - "POST /notebook/api/check_status HTTP/1.1" returned in 45ms 200 276', '\[([^ ]+ [^ ]+).*', 1) as dt | |
; | |
-- Extracting some of the fields | |
SELECT | |
REGEXP_EXTRACT(log, '\[([^ ]+ [^ ]+).*', 1) as dt, | |
REGEXP_EXTRACT(log, '.*access\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+\-.*', 3) as username, | |
log | |
FROM hue_logs | |
LIMIT 100 | |
; | |
-- Live counts of API calls made per user | |
SELECT | |
username, | |
count(*) as callsCt | |
FROM ( | |
SELECT REGEXP_EXTRACT(log, '.*access\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+\-.*', 3) as username | |
FROM hue_logs | |
) | |
GROUP BY username | |
; | |
-- Table for outputting calculations to a new Topic | |
CREATE TABLE hue_stats ( | |
ts TIMESTAMP(3), | |
-- ts bigint, -- Depends if using `SECOND()` in `TUMBLE` or not | |
username STRING, | |
callsCt BIGINT, | |
PRIMARY KEY (username, ts) NOT ENFORCED | |
) WITH ( | |
'connector' = 'upsert-kafka', | |
'topic' = 'hue_stats', | |
'properties.bootstrap.servers' = 'kafka:9094', | |
'key.format' = 'json', | |
'value.format' = 'json' | |
); | |
-- INSERT into hue_stats -- Uncomment to send to the Topic after creating the `hue_stats` table below | |
-- Rolling windows of aggregation | |
SELECT | |
TUMBLE_START(proctime, INTERVAL '10' SECOND) as ts, | |
username, | |
COUNT(*) as callsCt | |
FROM ( | |
SELECT | |
REGEXP_EXTRACT(log, '.*access\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+\-.*', 3) as username, proctime | |
FROM hue_logs | |
) | |
GROUP BY TUMBLE(proctime, INTERVAL '10' SECOND), username | |
; | |
-- Poke at the generated stats | |
SELECT * | |
FROM hue_stats | |
LIMIT 100 | |
; |
SHOW TOPICS; | |
CREATE STREAM hue_logs ( | |
log STRING | |
) WITH ( | |
kafka_topic='hue_logs', | |
value_format='json', | |
partitions=1 | |
) | |
; | |
SELECT | |
log, | |
rowtime | |
FROM | |
hue_logs | |
EMIT CHANGES | |
LIMIT 10 | |
; | |
CREATE STREAM hue_stats ( | |
ts string, | |
username STRING, | |
callsCt BIGINT | |
) WITH ( | |
kafka_topic='hue_stats', | |
value_format='json', | |
partitions=1 | |
) | |
; | |
SELECT | |
* | |
FROM | |
hue_stats | |
EMIT CHANGES | |
LIMIT 10 | |
; |
Et voila!
Any feedback or question? Feel free to comment here!
All these projects are also open source and welcome feedback and contributions.
In the case of the Hue Editor, the Forum or Github issues are good places for that. More sophisticated SQL autocompletes and connectors, Web Socket and Celery Task server integrations are some improvement ideas.
Onwards!
Romain