Troubleshoot an overloaded Managed ClickHouse® cluster

Even though ClickHouse® is known to be extremely fast and resource efficient, it has its limits just like any other system. When a ClickHouse® cluster is overloaded, its performance degrades, and it can lead to failures in applications that rely on it for data.

This page explains how to investigate why your ClickHouse® cluster is overloaded and how to optimize it for increased loads, so that they don’t negatively affect performance.

You should monitor the cluster health because it ensures that the cluster works reliably and can handle temporary usage spikes should they occur.

Increased CPU usage

High CPU usage is the main indicator that a cluster is experiencing a high load. If your cluster is experiencing an increased CPU usage, first check how long it has lasted. You don’t need to worry about short bursts of CPU usage because they can be caused by a large insert operation, a complex query, or something similar.

If the CPU usage has been between 80% and 100% for over a day, check if the number of connections has increased too.

Higher number of connections

If the increase in CPU usage has coincided with a rise in the number of connections, the overload is caused by an increased concurrency of requests.

To check the number of executing queries, use the following command:

SELECT value FROM system.metrics WHERE metric = 'Query'

There are three ways how you can handle that:

  • Consider using materialized views to let queries execute quicker.

  • If slower queries are okay for your use case, limit the maximum number of simultaneous threads by setting max_threads to a half of its current value.

  • Add more replicas. Every replica can execute queries independently, so if you scale up from three to five replicas, you can run 66% more concurrent queries.

Same number of connections

If the number of connections hasn't changed, but the CPU usage increased, check the RAM usage in the cluster.

In case the RAM usage has increased, increase the amount of RAM in the cluster. ClickHouse® performs best when it has enough RAM.

If the RAM usage has remained, the CPU usage may have increased because of a higher ingestion rate. To mitigate this:

  • Consider sharing the ingestion load between different nodes using a load balancer.

  • Make sure you’re using bulk inserts. An insert can contain tens of thousands or even millions of rows, but there shouldn't be more than one insert per node per second.

Increased RAM consumption

If the CPU usage remains under 60%, and you can only see a high RAM usage in the cluster, check how RAM usage has been changing over a longer time range.

Gradual increase in RAM consumption

If RAM usage in the cluster has been increasing steadily over a long period of time, it usually indicates that more data has been inserted in the cluster. In this case, the instance has become too small to handle an increased load.

Write operations in ClickHouse® are resource intensive, and the cluster performance decreases when ClickHouse® needs to write a lot of data.

To mitigate this, select a resource preset with more RAM.

RAM consumption spiked

A jump in memory usage is typically caused by one or several queries that aren't performing well. Queries that use JOIN also need a lot of RAM.

To avoid such spikes in RAM usage, check the query log for resource intensive queries and consider optimizing them.

Autoscaling

To prevent your Managed ClickHouse® cluster from running out of resources when the load increases, you can configure instance and storage autoscaling.