Hybrid storage in Managed ClickHouse® clusters

Hybrid storage provides fault tolerance for data storage. It allows you to manage data placement for MergeTree-family tables: the data is placed either in a cluster or the object storage, depending on the storage policy for the tables.

This possibility is enabled by default. The DoubleCloud service creates an S3 bucket (if located in AWS) or a Cloud Storage bucket (if located in Google Cloud) for each Managed ClickHouse® cluster as a system disk . To use the feature, specify a non-default storage policy when creating a table.

The size of the object storage isn't limited, but you are charged for each piece of data you store there. You are not charged for the object storage until you enable it and put some data there.

You're not required to set a TTL (Time To Live) for hybrid storage, but doing so gives you direct control over which data gets stored in object storage. Without specifying a TTL, data will be moved to object storage only when the storage on network disks runs out of space.

To learn more about storage policies and disks, refer to the following ClickHouse® documentation: Using Multiple Block Devices for Data Storage .

How to use hybrid storage

The following settings are required to configure a table that will separate the data and place it in the cold and hot storage sections:

  • The TTL expression defines how long a table should preserve data. This property also defines a policy for operating with expiring data with the TO DISK rule. You can set TTL only for the values of the Date or DateTime types and only in the MergeTree-family tables.

  • INTERVAL specifies the time period for data to be before it's transferred to the object storage.

  • PARTITION BY specifies a custom partitioning key. It defines by which property to separate the data for partitions .

  • The storage_policy setting sets how the data with the exceeded TTL is stored. When the storage_policy is hybrid_storage, rows are placed only to the object storage, and data is not transferred between storages.

A sample query to create a table with TTL looks as follows:

CREATE TABLE table_name ON CLUSTER default (
      date_column Date, 
      ...
      )
      ENGINE = ReplicatedMergeTree
      PARTITION BY Date
      ORDER BY (Date)
      TTL Date + INTERVAL 5 DAY TO DISK 'object_storage'
      SETTINGS storage_policy = 'hybrid_storage' 

This query works as follows:

  • If the number of months from the current date to the date_column column value is less than the TTL value, this data is stored on network drives.

  • If the number of days from the current date to the Date is greater than or equal to the TTL value (that is, the lifetime has already expired), this data is placed in the object storage according to the TO DISK 'object_storage' policy .

Caching in the DoubleCloud ClickHouse® object storage

DoubleCloud automatically caches data in the object storage. It uses the LRU algorithm so that the most used data is cached, and your cluster can access this data much faster than other data in the object storage. This feature is enabled by default for all Managed ClickHouse® clusters starting from version 22.5.

When creating a ClickHouse® cluster, your allocated cache size is 50% of the SSD Storage volume you configure. For example, the minimum possible s1-c2-m4 resource preset with 32 GB of available storage will have 16 GB of available cache.

After the cluster is created, its cache size remains fixed.

Tip

After increasing your cluster's SSD Storage capacity, contact our Support team to update its allocated cache size.

See also

Previous