Deduplication in ClickHouse® - A practical approach
Written by: Stefan Kaeser, Senior Solutions Architect, DoubleCloud
Written By: Stefan Kaeser, DoubleCloud Solution Architect
As a solutions architect, working with customers on their setup of ClickHouse® cluster, I often get asked the same question. “How do I make sure, when something in my pipeline breaks and I restart it, that I don’t have duplicate data in my dashboards”.
Data deduplication in ClickHouse is an oft discussed topic amongst ClickHouse users and experts.
That’s because, unfortunately, there’s no built-in solution to handle all cases (at least at the time of me writing this Blog post, but with ClickHouse being developed so fast, there might be, by the time you read this…)
Primary Keys In ClickHouse
The reason there’s no general built in solution lies in the depth of how ClickHouse is built and what makes it so fast.
When creating a table in ClickHouse you can specify a Primary Key but unlike other DBMS like MySQL, a primary key in ClickHouse does not mean you cannot have the same entry twice.
The primary key is used for sorting the data and accessing data blocks but it’s not meant to be unique. Depending on the table engine you choose, the primary key will be used for merging multiple rows together.
That being said, it makes it very hard to ensure uniqueness the way ClickHouse is built under the hood.
Ways Around ClickHouse’s Limitations
Ok, we know that uniqueness of rows is not an easy built in solution, but there are solutions to tackle the issue, right?
Actually there are multiple ways to reach the goal.
Our friends at Altinity have a whole section at their knowledgebase: https://kb.altinity.com/altinity-kb-schema-design/row-level-deduplication/.
Depending on your data and workload, different approaches might work better for you, and others won’t work at all.
In my experience a combination of Approach 2 and 5 works best for most use cases.
First we have to assume some things to get the logic right:
- Data is not sharded or the sharding logic is also based on the deduplication key
- Data duplication can only happen in a specific amount of time (for example up to one week in the past)
- There are no materialized views involved (as MVs in ClickHouse are actually insert triggers, and they don’t care about the data being in a table already)
Practical Implementation
As we saw on the knowledgebase, each approach has some limitations, like being complicated to implement, being slow when fetching the data etc.
So how can we combine different approaches to get a solution which works for most cases? Let’s assume our base table looks like that:
CREATE TABLE dt_longterm
(
`id` UInt64,
`evDate` Date,
`int_col` Int32,
`str_col` String
)
ENGINE = ReplacingMergeTree
PARTITION BY toYYYYMM(evDate)
ORDER BY (id, evDate);
We want to ensure that the id is unique for a given day.
As we are using ReplacingMergeTree we can already use SELECT … FROM dt_longterm FINAL and all is well. But unfortunately the FINAL keyword means that the select could get a lot slower.
But what if we are sure that there are no duplicates in the table?
Then we can just omit the FINAL keyword and get the full speed of ClickHouse. Of course the question is, how can we be sure about the uniqueness of id in that table?
This is where we combine the approach #2 (Using ReplacingMergeTree) with the idea of approach #5 (Using Multiple Tables): We create a second table, where we handle the deduplication of the newer data:
CREATE TABLE dt_shortterm
(
`id` UInt64,
`evDate` Date,
`int_col` Int32,
`str_col` String
)
ENGINE = ReplacingMergeTree
PARTITION BY toYYYYMMDD(evDate)
ORDER BY (id, evDate);
Note that the table looks quite like the original, just a more granular partition key is used, so we can deduplicate full days quite easily in the backend by using OPTIMIZE TABLE … PARTITION ‘YYYY‑MM‑DD’ FINAL.
With this additional short term table, we can now change our inserts to go directly to this table.
But of course it’s not finished as we still need a script which does the deduplication, when filling the longterm table.
Fortunately those scripts can be quite short as they just have to execute small simple SQL snippets to copy the deduplicated data to the longterm table:
INSERT INTO dt_longterm
SELECT *
FROM dt_shortterm FINAL
WHERE evDate = toDate(now()) - toIntervalDay(7)
Ok, now we are sure that in our longterm table, there exists only deduplicated data, as the FINAL keyword enforces deduplication when moving the data to long term.
However to get the most recent data we would still have to access different tables, which would make SELECT queries complicated.
Fortunately we can just create a simple view to select the data from both tables:
CREATE OR REPLACE VIEW dt_combine AS (
SELECT * FROM dt_longterm WHERE evDate < toDate(now()) - toIntervalDay(7)
UNION ALL
SELECT * FROM dt_shortterm FINAL WHERE evDate >= toDate(now()) - toIntervalDay(7)
);
Notice that we use the FINAL keyword when selecting data from short term and omit it when selecting from long term.
That way we do not have to think about when to use FINAL in our select queries from the application, and still get the data the fastest way possible. The change in the application itself is minimal, you just have to change the table to insert data to the short term one, and change the table to select from to the view which combines short and long term data.
Some Speed Tests
So does the solution really work the way it should?
Well, we can just use some test data and do some simple selects. Note that the cutoff date between longterm and shortterm was the first of November when writing the blog!
Select data within the deduplication window:
SELECT
avg(length(str_col)) AS avgl,
count(*) AS ct,
int_col % 5 AS grcol
FROM dt_combine
WHERE evDate = '2022-11-01'
GROUP BY grcol
Query id: 6c320aad-48e5-43b7-ae54-8290445f5497
┌─avgl─┬─────ct─┬─grcol─┐
│ 3.5 │ 200000 │ 0 │
│ 4.5 │ 200000 │ 1 │
│ 5.5 │ 200000 │ 2 │
│ 6.5 │ 200000 │ 3 │
│ 7.5 │ 200000 │ 4 │
└──────┴────────┴───────┘
5 rows in set. Elapsed: 0.028 sec. Processed 1.06 million rows, 30.13 MB (37.98 million rows/s., 1.08 GB/s.)
Now select data older than the deduplication window:
SELECT
avg(length(str_col)) AS avgl,
count(*) AS ct,
int_col % 5 AS grcol
FROM dt_combine
WHERE evDate = '2022-10-30'
GROUP BY grcol
Query id: bbaced99-2ffe-44f0-9594-a978dbdf2f58
┌─avgl─┬─────ct─┬─grcol─┐
│ 3.5 │ 200000 │ 0 │
│ 4.5 │ 200000 │ 1 │
│ 5.5 │ 200000 │ 2 │
│ 6.5 │ 200000 │ 3 │
│ 7.5 │ 200000 │ 4 │
└──────┴────────┴───────┘
5 rows in set. Elapsed: 0.010 sec. Processed 5.29 million rows, 29.25 MB (546.45 million rows/s., 3.02 GB/s.)
As you can see, accessing the Data on 30th October reads more than 500 million rows per second, while Data from the 1st of November only reads 40 million rows per second, because of the FINAL keyword.
However as the short term table has a more granular partitioning, which can be used to reduce data access, it’s only three times slower.
Final thought
In the end, it’s not that hard to hide all deduplication in ClickHouse itself, without changing your application too much.
The impact of speed is also limited with this solution, you won’t have any slowdowns for inserts, and selects will just be slower in the “deduplication window”.
Still there might be some ways to improve it further for the exact cases you might have.
However I am quite sure, that in a year or so, I will look back to this blog, and just smile, because at that time, there will be a built in solution, which handles deduplication just out of the box without the need of FINAL, VIEW, multiple table etc. and still be as fast as heck!
ClickHouse® is a trademark of ClickHouse, Inc. https://clickhouse.com
In this article, we’ll talk about:
DoubleCloud Managed Service for ClickHouse®
An open-source, managed ClickHouse DBMS service for sub-second analytics.