ClickHouse® sharding
Sharding is a method of splitting and storing data from a single Managed ClickHouse® cluster on different partitions or shards. Each shard has the same schema and columns, yet the data in each shard is unique and independent of other shards.
A shard consists of one or more replica hosts. This means you can send a write or read request to any shard replica because there is no dedicated master. When you insert the data into a shard, the Managed Service for ClickHouse® service takes it from the replica on which the INSERT
command was executed. After the data has been inserted, it's copied to other replicas within the shard.
This article features:
Benefits of sharding
-
Capability to efficiently scale up resources and improve query performance within your cluster
If you operate large data sets and the application is running at hardware performance limits, sharding allows read operations to be performed concurrently.
-
Improved query performance
Under load, requests compete for resources of cluster hosts so that it reduces the processing rate. A sharded cluster can execute queries to the same table in parallel. This reduces request competition for resources and improves query processing time.
-
Improved fault tolerance
When you don't use sharding, if one host or a set of replicas fails, all the data they contain may become inaccessible. However 80% of the data is still available if one shard out of five fails.
Use of sharding
When executing the INSERT
query, ClickHouse® uses a sharding key (similar to partitioning key
SELECT
queries send subqueries to all shards in your cluster, regardless of how data is distributed across the shards.
Sharding management in Managed Service for ClickHouse®
You can specify the number of shards when creating or editing a cluster. For more information, see Create a cluster.
Sharding best practices
We suggest having at least two shards on your cluster.
Using a distributed table and a single shard is equivalent to using replication with no sharding. It protects against data loss when one or more replica hosts fail, but provides no distributed data storage and doesn't improve query performance.
To efficiently set up distributed data storage with additional fault tolerance, use multiple replicas per shard with tables operated with the ReplicatedMergeTree engine.