Sharding is a method of splitting and storing data from a single Managed ClickHouse® cluster on different partitions or shards. Each shard has the same schema and columns, yet the data in each shard is unique and independent of other shards.
A shard consists of one or more replica hosts. This means you can send a write or read request to any shard replica because there is no dedicated master. When you insert the data into a shard, the Managed Service for ClickHouse® service takes it from the replica on which the
INSERT command was executed. After the data has been inserted, it's copied to other replicas within the shard.
This article features:
Benefits of sharding
Capability to efficiently scale up resources and improve query performance within your cluster
If you operate large data sets and the application is running at hardware performance limits, sharding allows read operations to be performed concurrently.
Improved query performance
Under load, requests compete for resources of cluster hosts so that it reduces the processing rate. A sharded cluster can execute queries to the same table in parallel. This reduces request competition for resources and improves query processing time.
Improved fault tolerance
When you don't use sharding, if one host or a set of replicas fails, all the data they contain may become inaccessible. However 80% of the data is still available if one shard out of five fails.
Use of sharding
When executing the
INSERT query, ClickHouse® uses a sharding key (similar to partitioning key ) to determine where to place the data. Its value determines which shard the query is directed to.
SELECT queries send subqueries to all shards in your cluster, regardless of how data is distributed across the shards.
Sharding management in Managed Service for ClickHouse®
You can specify the number of shards when creating or editing a cluster. For more information, see Create a cluster.
Have at least two shards on your cluster to use all the benefits of sharding.
Using a distributed table and a single shard is equivalent to using replication with no sharding. It protects against data loss when one or more replica hosts fail, but provides no distributed data storage and doesn't improve query performance.