ClickHouse® sharding

Sharding is a method of splitting and storing data from a single Managed ClickHouse® cluster on different partitions or shards. Each shard has the same schema and columns, yet the data in each shard is unique and independent of other shards.

A shard consists of one or more replica hosts. This means you can send a write or read request to any shard replica because there is no dedicated master. When you insert the data into a shard, the Managed Service for ClickHouse® service takes it from the replica on which the INSERT command was executed. After the data has been inserted, it's copied to other replicas within the shard.

This article features:

Benefits of sharding

  • Capability to efficiently scale up resources and improve query performance within your cluster

    If you operate large data sets and the application is running at hardware performance limits, sharding allows read operations to be performed concurrently.

  • Improved query performance

    Under load, requests compete for resources of cluster hosts so that it reduces the processing rate. A sharded cluster can execute queries to the same table in parallel. This reduces request competition for resources and improves query processing time.

  • Improved fault tolerance

    When you don't use sharding, if one host or a set of replicas fails, all the data they contain may become inaccessible. However 80% of the data is still available if one shard out of five fails.

Use of sharding

When executing the INSERT query, ClickHouse® uses a sharding key (similar to partitioning key ) to determine where to place the data. Its value determines which shard the query is directed to.

SELECT queries send subqueries to all shards in your cluster, regardless of how data is distributed across the shards.

Sharding management in Managed Service for ClickHouse®

You can specify the number of shards when creating or editing a cluster. For more information, see Create a cluster.

Sharding best practices

We suggest having at least two shards on your cluster.

Using a distributed table and a single shard is equivalent to using replication with no sharding. It protects against data loss when one or more replica hosts fail, but provides no distributed data storage and doesn't improve query performance.

To efficiently set up distributed data storage with additional fault tolerance, use multiple replicas per shard with tables operated with the ReplicatedMergeTree engine.

See also