7 essential tips for a production ClickHouse cluster
Written by: Vladimir Ivoninskii
ClickHouse is the fastest, most resource-efficient OLAP database which can query billions of rows in milliseconds and is trusted by thousands of companies for real-time analytics.
Here are seven tips that will help you spin up a production ClickHouse cluster and avoid the most common mistakes.
Use multiple replicas
While testing ClickHouse, it’s natural to deploy a configuration with only one host because you may not want to use additional resources or take on unnecessary expenses.
There’s nothing wrong with this in a development or testing environment, but that can come at a cost if you want to use only one host in production. If there’s a failure and you only have one replica and a single host, you’re at risk of losing all your data.
For production loads, you should use several hosts and replicate data across them. Not only does it ensure that data remains safe when a host fails, but also allows you to balance the user load on several hosts, which makes resource-intensive queries faster.
Don’t be shy with RAM
ClickHouse is fast, but its speed depends on available resources, especially RAM. You can see great performance when running a ClickHouse cluster with the minimum amount of RAM in a development or testing environment, but that may change when the load increases.
In a production environment with a lot of simultaneous read and write operations, a lack of RAM will be more noticeable. If your ClickHouse cluster doesn’t have enough memory, it will be slower and executing complex queries will take longer.
On top of that, when ClickHouse is performing resource-intensive operations, it may compete with the OS itself for RAM and that eventually leads to OOM, downtime, and data loss.
Developers of ClickHouse recommend using at least 16 GB of RAM to ensure that the cluster is stable. You can opt for less memory, but only do so when you know that the load won’t be high.
Think twice when choosing a table engine
ClickHouse supports several table engines with different characteristics, but a MergeTree engine will most likely be ideal. Specialized tables are tailored for specific uses, but have limitations that may not be obvious at first glance. Log Family engines may seem ideal for logs, but they don’t support replication and their database size is limited.
Table engines in the MergeTree family are the default choice, and they provide the core data capabilities that ClickHouse is known for. Unless you know for sure why you need a different table engine, use an engine from a MergeTree family, and it will cover most of your use cases.
Don’t use more than three columns for the primary key
Primary keys in ClickHouse don’t serve the same purpose as in traditional databases. They don’t ensure uniqueness, but instead define how data is stored and then retrieved. Read more about primary keys in a blog post Deduplication in ClickHouse by our Solutions Architect Stefan.
If you use all columns as the primary key, you may benefit from faster queries. Yet, ClickHouse performance doesn’t only depend on reading data, but on writing it, too. When the primary key contains many columns, the whole cluster slows down when data is written to it.
The optimal size of the primary key in ClickHouse is two or three columns, so you can run faster queries but not slow down data inserts. When choosing the columns, think of the requests that will be made and go for columns that will often be selected in filters.
Avoid small inserts
When you insert data in ClickHouse, it first saves a part with this data to a disk. It then sorts this data, merges it, and inserts into the right place in the database in the background. If you insert small chunks of data very often, ClickHouse will create a part for every small insert. It will slow down the whole cluster and you may get the “Too many parts” error.
To insert data efficiently, add data in big chunks and avoid sending more than one insert statement per second. ClickHouse can insert a lot of data at a high pace — even 100K rows per second is okay — but it should be one bulk insert instead of multiple smaller ones.
If your data comes in small portions, consider using an external system such as Managed Kafka for making batches of data. ClickHouse is well integrated with Kafka and can efficiently consume data from it.
Think of how you will get rid of duplicate data
Primary keys in ClickHouse don’t ensure that data is unique. Unlike other databases, if you insert duplicate data in ClickHouse, it will be added as is.
Thus, the best option would be to ensure that the data is unique before inserting it. You can do it, for example, in a stream processing application, like Apache Kafka. If it’s not possible, there are ways to deal with it when you run queries. One option is to use argMax
to select only the last version of the duplicate row. You can also use the ReplacingMergeTree
engine that removes duplicate entries by design. Finally, you can run OPTIMIZE TABLE ... FINAL
to merge data parts, but that’s a resource-demanding operation, and you should only run it when you know it won’t affect the cluster performance.
Don’t create an index for every column
Just like with primary keys, you may want to use multiple indexes to improve performance. This may be the case when you query data with the filters that match an index, but overall it won’t help you make queries faster.
At the same time, you’ll certainly experience the downsides of this strategy. Multiple indexes significantly slow down data inserts because ClickHouse will need to both write the data in the correct place and then update indexes.
When you want to create indexes in a production cluster, select the columns that correlate with the primary key.
Managed Service for ClickHouse
Fully managed service from the creators of the world’s 1st managed ClickHouse. Backups, 24/7 monitoring, auto-scaling, and updates.