(De-)normalization & how ClickHouse® dictionaries can solve typical problems

Written by Stefan Kaeser, DoubleCloud Solutions Architect

Written by Stefan Kaeser, DoubleCloud Solutions Architect

A few days ago I stumbled across a blog post from our friends over at ClickHouse® about Exploring massive, real-world data sets: 100+ Years of Weather Records in ClickHouse.

It’s a well written blog and really shows what ClickHouse is capable of, regarding transforming billions of datapoints with low effort. They showed how to load data from files from external urls, how to use regex for formatting data, creation of dictionaries, etc. If you haven’t already, then go ahead and read through their post.

However… there was something in the blog I couldn’t stop thinking about:

Their main query for analyzing the data still took 14 seconds after optimization, which is a really long time when you are used to ClickHouse and its normally sub-second responses. The second thing which makes me feel uncomfortable is the denormalized structure of their main table.

When I first learnt about normalization during my time at University two decades ago, denormalisation was bad. It was drummed into us that you should normalize as much as possible, and during that time, mainly with the usage of only OLTP, this was the correct choice to do. But times have changed and OLAP took over in a lot of places, and with big analytica workflows denormalisation got a revival and helped to maximize query speeds in a lot of cases.

But still there are things to consider when you denormalise, especially when handling huge amounts of data:

  • When data changes it can take a very long time to update old data
  • With billions or trillions of rows, even small columns can take up very much space

How can dictionaries help with those problems?

Spoiler Alert: You can speed the query up from multiple seconds to milliseconds!

First of all, if you want to play around with optimizations yourself, I encourage you once again to read ClickHouses' blog post, as I won’t repeat the filling process from them, but instead build up on it. But for those of you just enjoying the read without replaying it, I’ll give you the tabledefinitions and some cardinalities. The main table which has weather data from weather stations all over the world since 1900 looks like this:

CREATE TABLE sensors_blog
(
    `station_id` LowCardinality(String),
    `date` Date32,
    `tempAvg` Nullable(Int32) COMMENT 'Average temperature (tenths of a degrees C)',
    `tempMax` Nullable(Int32) COMMENT 'Maximum temperature (tenths of degrees C)',
    `tempMin` Nullable(Int32) COMMENT 'Minimum temperature (tenths of degrees C)',
    `precipitation` Nullable(UInt32) COMMENT 'Precipitation (tenths of mm)',
    `snowfall` Nullable(UInt32) COMMENT 'Snowfall (mm)',
    `snowDepth` Nullable(UInt32) COMMENT 'Snow depth (mm)',
    `percentDailySun` Nullable(UInt8) COMMENT 'Daily percent of possible sunshine (percent)',
    `averageWindSpeed` Nullable(UInt32) COMMENT 'Average daily wind speed (tenths of meters per second)',
    `maxWindSpeed` Nullable(UInt32) COMMENT 'Peak gust wind speed (tenths of meters per second)',
    `weatherType` Nullable(Enum8('Normal' = 0, 'Fog' = 1, 'Heavy Fog' = 2, 'Thunder' = 3, 'Small Hail' = 4, 'Hail' = 5, 'Glaze' = 6, 'Dust/Ash' = 7, 'Smoke/Haze' = 8, 
                      'Blowing/Drifting Snow' = 9, 'Tornado' = 10, 'High Winds' = 11, 'Blowing Spray' = 12, 'Mist' = 13, 'Drizzle' = 14, 'Freezing Drizzle' = 15, 
                      'Rain' = 16, 'Freezing Rain' = 17, 'Snow' = 18, 'Unknown Precipitation' = 19, 'Ground Fog' = 21, 'Freezing Fog' = 22)),
    `location` Point,
    `elevation` Float32,
    `name` LowCardinality(String)
) ENGINE = MergeTree() ORDER BY (station_id, date);

Notice I’ve changed the name from noaa to sensors_blog and added some Nullables for some internal reasons. The table contains around 1.08 billion rows and there are around 120k unique stations within.

The columns location, elevation and name are all depended directly from station_id and make up the denormalized part, so let’s look to see if it’s worth adding them to the table at all and what it costs:

SELECT
    name,
    formatReadableSize(SUM(data_compressed_bytes)) AS compressed,
    formatReadableSize(SUM(data_uncompressed_bytes)) AS uncompressed
FROM system.columns
WHERE (database = 'weather') AND (table = 'sensors_blog')
GROUP BY name
    WITH TOTALS

┌─name─────────────┬─compressed─┬─uncompressed─┐
│ date2.46 GiB   │ 4.01 GiB     │
│ snowfall         │ 124.18 MiB │ 5.01 GiB     │
│ tempMax          │ 1.17 GiB   │ 5.01 GiB     │
│ tempMin          │ 1.17 GiB   │ 5.01 GiB     │
│ tempAvg          │ 279.95 MiB │ 5.01 GiB     │
│ weatherType      │ 40.65 MiB  │ 2.00 GiB     │
│ snowDepth        │ 141.74 MiB │ 5.01 GiB     │
│ precipitation    │ 1.11 GiB   │ 5.01 GiB     │
│ averageWindSpeed │ 41.31 MiB  │ 5.01 GiB     │
│ maxWindSpeed     │ 35.07 MiB  │ 5.01 GiB     │
│ percentDailySun  │ 10.28 MiB  │ 2.00 GiB     │

│ station_id       │ 16.63 MiB  │ 1.98 GiB     │

│ name             │ 20.85 MiB  │ 1.99 GiB     │
│ location         │ 87.51 MiB  │ 16.04 GiB    │
│ elevation        │ 22.52 MiB  │ 4.01 GiB     │
└──────────────────┴────────────┴──────────────┘

Totals:
┌─name─┬─compressed─┬─uncompressed─┐
│      │ 6.71 GiB   │ 72.14 GiB    │
└──────┴────────────┴──────────────┘

I’ve separated the denormalized columns for better readability.

As the table is ordered by station_id, the denormalized depended columns compress extremely well, but still take up nearly 2% of the total data. Uncompressed it’s 30% of the data, which will eat some of your cpu cycles for decompression when querying it. That not only takes denormalisation more space, it can also slow down your queries.

Let’s take a look now at the main query which bothered me the most when I read the original post:

SELECT
    tempMax / 10 AS maxTemp,
    station_id,
    date,
    location
FROM sensors_blog
WHERE dictGet(country_polygons, 'name', location) = 'Portugal'
ORDER BY tempMax DESC
LIMIT 5

Query id: 2e2bd273-f4a0-4268-856c-20ef5fec9f61

┌─maxTemp─┬─station_id──┬───────date─┬─location──────────┐
│    45.8 │ PO000008549 │ 1944-07-30 │ (-8.4167,40.2)    │
│    45.4 │ PO000008562 │ 2003-08-01 │ (-7.8667,38.0167) │
│    45.2 │ PO000008562 │ 1995-07-23 │ (-7.8667,38.0167) │
│    44.5 │ POM00008558 │ 2003-08-01 │ (-7.9,38.533)     │
│    44.2 │ POM00008558 │ 2022-07-13 │ (-7.9,38.533)     │
└─────────┴─────────────┴────────────┴───────────────────┘

5 rows in set. Elapsed: 14.427 sec. Processed 1.08 billion rows, 35.94 GB (74.61 million rows/s., 2.49 GB/s.)

The query tries to get the top five temperatures occurring in Portugal since the recordings began. It’s already optimized by using a dictionary to look up the geolocations of Portugal, but still it takes nearly 15 seconds… and for clickhouse that’s like an eternity.

So can it be improved?

Actually it can, and the authors of the original blog already did so, as they found out that the first two characters of station_id already contained some kind of country code:

SELECT
    tempMax / 10 AS maxTemp,
    station_id,
    date,
    location
FROM sensors_blog
WHERE substring(station_id,1,2) = 'PO'
ORDER BY tempMax DESC
LIMIT 5

┌─maxTemp─┬─station_id──┬───────date─┬─location──────────┐
│    45.8 │ PO000008549 │ 1944-07-30 │ (-8.4167,40.2)    │
│    45.4 │ PO000008562 │ 2003-08-01 │ (-7.8667,38.0167) │
│    45.2 │ PO000008562 │ 1995-07-23 │ (-7.8667,38.0167) │
│    44.5 │ POM00008558 │ 2003-08-01 │ (-7.9,38.533)     │
│    44.2 │ POM00008558 │ 2022-07-13 │ (-7.9,38.533)     │
└─────────┴─────────────┴────────────┴───────────────────┘

5 rows in set. Elapsed: 1.440 sec. Processed 1.08 billion rows, 7.39 GB (747.38 million rows/s., 5.13 GB/s.)

We can see that it is already 10 times faster, but only because we have some meta information about how the station_id is built. If however the station_id would just be a generic id, we would not be able to use it for speeding up that query.

So let’s go back to the original query and analyze what’s taking so long.

Actually it is the denormalization of the main table which slows down the whole query. For each of the 1.08 billion rows ClickHouse has to read the location from disk, decompress it (16GB in total), throw it into the dictionary (which is cached of course), and then throw it away if the condition of 'Portugal' does not hold.

The same problem occurs with the already optimized version, but of course much faster as the station_id is much smaller and handling substrings is of course faster than checking if a point lies within a polygon.

Speed Up Query by Normalization

After analyzing the query and what it does, especially reading a lot of redundant data from disk and decompressing it, we can speed up the query by undoing the denormalization.

CREATE TABLE stations_blog (
    station_id String,
    `location` Point,
    elevation Float32, 
    name String
) Engine=ReplacingMergeTree PARTITION BY tuple() ORDER BY (station_id);

INSERT INTO stations_blog
SELECT DISTINCT station_id, location, elevation, name
FROM sensors_blog

SELECT count(*)
FROM stations_blog

Query id: 700a0ef9-d97b-4a1b-9144-bdaf5bbbf2cc

┌─count()─┐
│  121564 │
└─────────┘

Of course if you’ve downloaded the original stations.txt you could have created the table from that file as well. So now as we have our stations_blog table we can finally speed up the query to find the top five temperatures of Portugal to get the result in a ClickHouse felt way:

SELECT
    tempMax / 10 AS maxTemp,
    station_id,
    date,
    location
FROM sensors_blog
-- WHERE dictGet(country_polygons, 'name', location) = 'Portugal'
WHERE station_id IN (
    SELECT station_id
    FROM stations_blog
    WHERE dictGet(country_polygons, 'name', location) = 'Portugal'
)
ORDER BY tempMax DESC
LIMIT 5

Query id: 5e08add8-ab76-4273-9019-ae1519b72988

┌─maxTemp─┬─station_id──┬───────date─┬─location──────────┐
│    45.8 │ PO000008549 │ 1944-07-30 │ (-8.4167,40.2)    │
│    45.4 │ PO000008562 │ 2003-08-01 │ (-7.8667,38.0167) │
│    45.2 │ PO000008562 │ 1995-07-23 │ (-7.8667,38.0167) │
│    44.5 │ POM00008558 │ 2003-08-01 │ (-7.9,38.533)     │
│    44.2 │ POM00008558 │ 2022-07-13 │ (-7.9,38.533)     │
└─────────┴─────────────┴────────────┴───────────────────┘

5 rows in set. Elapsed: 0.034 sec. Processed 581.63 thousand rows, 20.13 MB (17.27 million rows/s., 597.57 MB/s.)

0.034 sec now that’s what I consider ClickHouse speed.

So here we have an example when denormalization does not help at all, or even worse prevents you from getting the fastest results.

Dictionaries to the rescue!

But what about other queries, where we need to access the denormalized columns, will normalizing slow them down?

Well let’s test!
Let’s say we want the five stations with top average maxtemperature with their name printed:
Denormalised:

SELECT
    AVG(tempMax / 10) AS avgMaxTemp,
    station_id,
    any(name) AS stationName
FROM sensors_blog
GROUP BY station_id
ORDER BY avgMaxTemp DESC
LIMIT 5

Query id: 1dd12cbb-6962-4b22-948d-bbc1976b691b

┌─avgMaxTemp─┬─station_id──┬─stationName────────┐
│       44.4 │ USC00143323 │ KS HADDAM          │
│      41.65 │ IN019180300 │ PHALODI            │
│       40.6 │ USC00413142 │ TX FIFE            │
│      40.55 │ USC00207427 │ MI SECORD DAM      │
│      39.85 │ USC00416953 │ TX PERRYTON 11 WNW │
└────────────┴─────────────┴────────────────────┘

5 rows in set. Elapsed: 3.596 sec. Processed 1.08 billion rows, 17.18 GB (299.34 million rows/s., 4.78 GB/s.)

Normalized and joining the data:

SELECT
    AVG(tempMax / 10) AS avgMaxTemp,
    station_id,
    any(stations_blog.name) AS stationName
FROM sensors_blog
INNER JOIN stations_blog ON stations_blog.station_id = sensors_blog.station_id
GROUP BY station_id
ORDER BY avgMaxTemp DESC
LIMIT 5

Query id: 7fdf927d-1446-443c-af5a-48573440060c

┌─avgMaxTemp─┬─station_id──┬─stationName────────┐
│       44.4 │ USC00143323 │ KS HADDAM          │
│      41.65 │ IN019180300 │ PHALODI            │
│       40.6 │ USC00413142 │ TX FIFE            │
│      40.55 │ USC00207427 │ MI SECORD DAM      │
│      39.85 │ USC00416953 │ TX PERRYTON 11 WNW │
└────────────┴─────────────┴────────────────────┘

5 rows in set. Elapsed: 9.005 sec. Processed 1.08 billion rows, 10.98 GB (119.54 million rows/s., 1.22 GB/s.)

So as most of you might have expected already, joining the stations table makes the query nearly three times slower going up from 3.5 sec to 9.0 sec… Of course there are ways to prevent the join by getting the station name later, or by building more complicated queries with SubSelects or some special tricks, but that just isn’t fun, and ClickHouse is all about fun… right?

So what can we do to speed up those types of select queries?

We can build a dictionary out of the stations_blog table and use that instead of the join:

CREATE DICTIONARY stations_dict
(
   `station_id` String,
   `name` String
)
PRIMARY KEY station_id
SOURCE(CLICKHOUSE(TABLE 'stations_blog'))
LIFETIME(MIN 0 MAX 0)
LAYOUT(complex_key_hashed())

SELECT
    AVG(tempMax / 10) AS avgMaxTemp,
    station_id,
    dictGet(stations_dict, 'name', station_id) AS stationName
FROM sensors_blog
GROUP BY station_id
ORDER BY avgMaxTemp DESC
LIMIT 5

Query id: df158271-faee-4b48-9f9b-7e82e48a1009

┌─avgMaxTemp─┬─station_id──┬─stationName────────┐
│       44.4 │ USC00143323 │ KS HADDAM          │
│      41.65 │ IN019180300 │ PHALODI            │
│       40.6 │ USC00413142 │ TX FIFE            │
│      40.55 │ USC00207427 │ MI SECORD DAM      │
│      39.85 │ USC00416953 │ TX PERRYTON 11 WNW │
└────────────┴─────────────┴────────────────────┘

5 rows in set. Elapsed: 1.930 sec. Processed 1.08 billion rows, 10.97 GB (557.63 million rows/s., 5.69 GB/s.)

1.9 sec wow, that’s even faster than the select on the denormalized table.

Again the reason here is, that ClickHouse does not have to read and uncompress the name column at all, but uses the station_id to fetch the corresponding name form the cached dictionary.

So what to take with you?

Even in OLAP scenarios, denormalisation can be problematic and slow down your queries apart from the other known facts. Normalization can help with optimization but can complicate things. ClickHouse dictionaries can help you get the best out of both worlds sometimes. It’s always worth trying to optimize stuff, you will find out something new a lot of the time. So have fun using ClickHouse, support the OpenSource world, and when you need help managing ClickHouse within the cloud ask us!

Get started with DoubleCloud

* ClickHouse® is a trademark of ClickHouse, Inc. https://clickhouse.com

Sign in to save this post