Hybrid storage in Managed ClickHouse® clusters

Hybrid storage provides fault tolerance for data storage. It allows you to manage data placement for MergeTree-family tables: the data is placed either in a cluster or the object storage, depending on the storage policy for the tables.

This possibility is enabled by default and the DoubleCloud service always creates an S3 bucket for each Managed ClickHouse® cluster as a system disk . To use the feature, specify a non-default storage policy when creating a table.

The size of the object storage is not limited, but you are charged for each piece of data you store there. You are not charged for the object storage until you enable it and put some data there.

Specifying TTL for hybrid storage is not obligatory, but it allows you to control which data will be in object storage explicitly. If you don't specify TTL, data is placed in object storage only when storage on network disks runs out of space.

To learn more about storage policies and disks, refer to the following ClickHouse® documentation section: Using Multiple Block Devices for Data Storage .

How to use hybrid storage

The following settings are required to configure a table that will separate the data and place it in the cold and hot storage sections:

  • The TTL expression defines how long a table should preserve data. This property also defines a policy for operating with expiring data with the TO DISK rule. You can set TTL only for the values of the Date or DateTime types and only in the MergeTree-family tables.

  • INTERVAL specifies the time period for data to be before it's transferred to the object storage.

  • PARTITION BY specifies a custom partitioning key. It defines by which property to separate the data for partitions .

  • The storage_policy setting sets how the data with the exceeded TTL is stored. When the storage_policy is hybrid_storage, rows are placed only to the object storage, and data is not transferred between storages.

A sample query that creates a table with TTL looks as follows:

CREATE TABLE table_name ON CLUSTER 'default'
      (date_column Date, ...)
      engine ReplicatedMergeTree
      PARTITION BY Date
      ORDER BY (Date)
      TTL Date + INTERVAL 5 DAY TO DISK 'object_storage'
      SETTINGS storage_policy = 'hybrid_storage' 

This query works as follows:

  • If the number of months from the current date to the date_column column value is less than the TTL value, this data is stored on network drives.

  • If the number of days from the current date to the Date is greater than or equal to the TTL value (that is, the lifetime has already expired), this data is placed in the object storage according to the TO DISK 'object_storage' policy .

Caching in the DoubleCloud ClickHouse® object storage

DoubleCloud automatically caches data in the object storage. It uses the LRU algorithm so that the most used data is cached and your cluster can access this data much faster than other data in the object storage. This feature is enabled by default for all Managed ClickHouse® clusters with the version starting from 22.5.

See also: