Why is everyone talking about ClickHouse®?
It’s a good question as the platform seems to have taken the data world by storm, with more and more people adopting it within their data stack.
Now, cards on the table, we’re talking about ClickHouse® as DoubleCloud offers a managed ClickHouse service but that doesn’t explain why everyone else is…
We think the reason is probably because we all now live firmly in the world of Big Data.
Big Data can help organizations derive actionable business intelligence from a tangle of information coming into their business from multiple, disparate sources but if it’s to be useful it needs to be both fast and actionable.
Realizing the data was indicating you should do something weeks after the issue has passed isn’t helpful for anyone.
That’s where ClickHouse comes in.
It’s a solution designed to help organizations store, manage and analyze Big Data.
ClickHouse, short for Clickstream Data Warehouse, was created as a columnar OLAP database that’s become known for it’s incredibly high insert rates, its equally fast analytical queries and an SQL-like code language.
It’s highly scalable but perhaps most important, it’s open-source, cutting out on the vendor lock-in other data solutions can create.
Its proponents (ourselves included) state that ClickHouse is the perfect solution when working with enormous structured data sets for analytics, reporting and much, much more.
Why Is ClickHouse So Good?
ClickHouse is good for so many reasons, the main one being its high performance output.
Those high performance levels can largely be attributed to:
- It’s column-oriented data storage architecture
- It’s data compression functionality
- It’s vector computation engine
- The clever use of physical sparse indices
Massively increased performance isn’t the only benefit to ClickHouse though.
It’s important to remember that ClickHouse isn’t just a database (even a great database); it’s also a sophisticated data management system that is easily capable of supporting distributed query processing, partitioning, replication & sharding.
What Can ClickHouse Do?
What can’t ClickHouse do?
One of our favorite things about ClickHouse is the vibrant and active open-source dev-sphere that surrounds the product with many developers contributing third-party apps and functionality upstream (including, we’re proud to say, many of our own developers here at DoubleCloud).
Don’t think though that means there’s loads of half built, untested applications and plug-in’s floating around.
Despite ClickHouse’s rapid growth, the vetting procedure for new features and plug-ins remains robust and rigorous.
Aside from these third-party add-ons though, what else does ClickHouse do? What are its core functions?
Column-Oriented DBMS
This feature is probably the most well known of all of ClickHouse’s features; the one everyone always seems to know about.
ClickHouse stores data in a DBMS in which the values of the same columns physically next to each other are stored with no extra data attached to each value.
Why is that important?
In a data set with a few thousands or so records it really isn’t. But consider having billions or even trillions of items in the table. Saving even a small amount of extra data massively adds up over those kinds of numbers and has huge benefits when it comes to affecting the speed of compression and decompression when utilizing the records.
Data Compression
Speaking of data compression…
Like a lot of data management solutions, to hit their performance targets, ClickHouse utilizes a lot of data compression.
There’s your run of the mill compression you’re probably familiar with but also a whole lot of specialized compression techniques that target different types of data stored in separate columns.
It makes reading your data a lot faster if your query involves just looking up a few columns (as you don’t need to read entire rows).
Query Processing Across Multiple Servers
Another great feature of ClickHouse is that it supports distributed query processing with any data stored across different shards. Large queries get parallelized across multiple cores meaning they only use resources when they’re needed, making this system more efficient and cheaper!
Vector Computation Engine
When ClickHouse processes an organization’s data, it does so in what it calls vectors, something other OLAP databases haven’t yet been able to achieve.
Even with compression and columnar data storage, most other solutions still rely on incremental processing to pre-compute aggregated data.
These vectors are groupings of columns that when processed, are treated as one rather than individual values, again, massively speeding up the performance of the DBMS.
This ‘vectorised computing’ also grants ClickHouse the opportunity to use more efficient code on its platform, with modern SIMD processors
No Database Locks
Unlike other solutions, ClickHouse updates all of your tables continuously to improve performance, without the need of locks when adding in new data.
Primary Data Skipping Indices
As you can see, almost all of these features are designed to improve performance.
Another way ClickHouse does this is by storing data using a primary key. Secondary indices are then used to show in advance which data won’t match any filtering criteria which may be applied (and can therefore just be skipped over).
Approximated Calculations
Finally, as if another boost to performance was needed, ClickHouse supports calculations based on samples of data held rather than the entire set, with a balance between accuracy and high-end performance being finely tuned.
Why Did ClickHouse Base Their Tech On A Column-Oriented Database Management System?
To understand why ClickHouse settled on their column-oriented approach it’s first important to understand OLTP’s and OLAP’s, and in particular, the granular way they manipulate data.
OLTP (Online Transaction Processing)
OLTP apps are used to perform very small but very frequent operations that need to insert, update or select small numbers of rows.
For an OLTP, the de-facto norm has always been to use a row-oriented approach as the most efficient method.
OLAP (Online Analytical Processing)
OLAP’s are almost the complete opposite of OLTP’s.
OLAP’s never target single lines. Instead they’re created to deal with hundreds of thousands (even millions) of records at a time.
Data in an OLAP is often represented by events and because of that rarely needs updating. That means they only a fraction of a field needs to be retrieved or processed at any one time.
Whilst capable of handling much more data, this does make them inefficient at handling complete rows.
That means OLTP apps are great for storing and easily updating individual rows whilst OLAP apps are almost always used for analyzing massive swathes of data.
OLAP & Column Oriented Systems
Column-oriented solutions like ClickHouse were designed to solve OLAP performance issues.
With ClickHouse, the data is physically grouped and stored by columns. That approach minimizes disk access whilst also improving performance as a query now only needs to read a fraction of the data.
It’s great for compression as all the columns contain the same type of data.
It’s also a lot easier to manage, with less overhead as files can be easily created or deleted. By contrast, when adding a new column in a row-oriented database, every single row would need to be updated!
Disadvantages of ClickHouse
DoubleCloud are an upfront kind of organization and we recognise that ClickHouse and our managed ClickHouse solution may not be the perfect fit for everyone.
We think ClickHouse can be an awesome fit for most scenarios but it’s important to remember that it might not be suitable for everyone…
One of the biggest complaints we hear is that, as ClickHouse is so unique and different to other solutions out there, it’s too easy to overlook something that can lead to days of trawling for a mistake, leading to suboptimal performance (although the easy answer there is to opt for a managed ClickHouse solution with a company offering a free trial of their platform >>.
Other minor complaints that are worth highlighting is that ClickHouse is SQL ‘like’ but not quite SQL — which can cause confusion for some.
It doesn’t currently support full-fledged transactions and its sparse index means it’s not quite as efficient as other solutions for point queries when retrieving single rows by their keys.
That’s because it was designed for OLAP applications so if you’re just looking for an OLTP app, you’re probably better off looking elsewhere.
Is ClickHouse The Right Choice For Me?
As we say, ClickHouse was designed for OLAP operations, so if that’s what you’re looking for, it’s probably the right choice for you.
Working with huge amounts of data (we’re talking terabytes +)? ClickHouse is what you need.
Working with tables that have huge amounts of columns with short values? ClickHouse is almost certainly what you need!
It’s also worth investigating if your data is well-structured but not aggregated, you’re looking to insert data in large batches, you don’t need to modify a lot of data later, you’re not looking to retrieve specific rows and you don’t need transactions.
ClickHouse is a trademark of ClickHouse, Inc. https://clickhouse.com