Use hybrid storage for the DoubleCloud Managed ClickHouse® clusters
Hybrid storage in Managed ClickHouse® clusters allows you to optimize how data is stored and discrease storage costs. When hybrid storage is configured, frequently used hot data remains on the local disk, and cold data is offloaded to object storage. AWS S3 and Google Cloud Storage are used for ClickHouse® clusters running on AWS and Google Cloud respectively.
This scenario shows how to create a database and a table with the specified
TTL
Before you start
-
If you haven't already, create a ClickHouse® cluster using a method of your choice.
-
Make sure you have the ClickHouse® client installed.
Connect to the cluster
-
Select Clusters from the list of services on the left.
-
Select the name of your cluster to open its information page. By default, you will see the Overview tab.
-
Under Connection strings, find the Native interface string and click Copy.
-
Open your terminal and run a command to connect to your cluster:
DockerNative clickhouse-clientdocker run --network host --rm -it clickhouse/<Native interface connection string>
The complete Docker command structure
docker run --network host --rm -it \ clickhouse/clickhouse-client \ --host <FQDN of your cluster> \ --secure \ --user <cluster user name> \ --password <cluster user password> \ --port 9440
<Native interface connection string>
-
Once connected, you can also check if your cluster has all the required policies. Get the available storage policies with the following query:
SELECT * FROM system.storage_policies
Look at the
policy_name
column. The output should display the following policies for the above-mentioned disks -default
,local
(alias fordefault
) andobject_storage
:┌─policy_name────┬─volume_name────┬─volume_priority─┬─disks──────────────┬─────┬─prefer_not_to_merge─┐ │ default │ default │ 1 │ ['default'] │.....│ 0 │ │ hybrid_storage │ default │ 1 │ ['default'] │.....│ 0 │ │ hybrid_storage │ object_storage │ 2 │ ['object_storage'] │.....│ 1 │ │ local │ default │ 1 │ ['default'] │.....│ 0 │ │ object_storage │ object_storage │ 1 │ ['object_storage'] │.....│ 1 │ └────────────────┴────────────────┴─────────────────┴────────────────────┴─────┴─────────────────────┘
Create a database and a table with a non-default storage policy
When you've confirmed that all the required policies are present, you can proceed to creating a database.
-
Type the following command to create a database:
CREATE DATABASE IF NOT EXISTS "db_hits"
-
You have the database, and now it's time to create a table for the data you'll upload later. You'll need to define the TTL
Send the following query to create a table that will automatically split data by months:
CREATE TABLE db_hits.hybrid_storage_table ON CLUSTER default ( Hit_ID Int32, Date Date, Time_Spent Float32, Cookie_Enabled Int32, Region_ID Int32, Gender String, Browser String, Traffic_Source String, Technology String ) ENGINE ReplicatedMergeTree PARTITION BY Date ORDER BY (Hit_ID, Date) TTL Date + toIntervalDay(dateDiff('day', toDate('2016-12-31'), now())) TO DISK 'object_storage' SETTINGS storage_policy = 'hybrid_storage'
Warning
The expression for TTL is shaped for the selected test dataset. You must split the fixed data collected long ago into parts for placement at different storage levels. For most tables that are constantly updated with new data, a simpler expression for TTL will suffice.
For example, to move the data older than
5
days to the object storage, execute the following query:TTL Date + INTERVAL 5 DAY TO DISK 'object_storage'
Upload data to your cluster
-
Now, you can open another terminal instance or exit from your cluster by typing the
exit
command. -
In a separate terminal, run the following query that will fetch the data from our S3 bucket and upload it with the INSERT
curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv \ | docker run --network host --rm -i clickhouse/<Native interface connection string> \ --query="INSERT INTO db_hits.hybrid_storage_table FORMAT CSVWithNames" \ --format_csv_delimiter=";"
The complete Docker command structure
curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv | docker run --network host --rm -i clickhouse/clickhouse-client --host <FQDN of your cluster> --port 9440 --secure --user <your cluster username> --password <your cluster password> --query="INSERT INTO db_hits.hybrid_storage_table FORMAT CSVWithNames" --format_csv_delimiter=";"
-
Now, open another terminal instance or exit from your cluster by typing the
exit
command. -
Run the following query that will fetch the data from our S3 bucket and upload it with the INSERT
curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv \ | <Native interface connection string> \ --query="INSERT INTO db_hits.hybrid_storage_table FORMAT CSVWithNames" \ --format_csv_delimiter=";"
If the upload was successful, perform a query to select partitions, their names, and disks where these partitions are located:
-
Now, you can open another terminal instance or exit from your cluster by typing the
exit
command. -
In a separate terminal, run the following query that will fetch the data from our S3 bucket and upload it with the INSERT
curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv \ | docker run --network host --rm -i clickhouse/<Native interface connection string> \ --query="SELECT partition, name, disk_name from system.parts where table = 'hybrid_storage_table'"
The complete Docker command structure
curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv | docker run --network host --rm -i clickhouse/clickhouse-client --host <FQDN of your cluster> --port 9440 --secure --user <your cluster username> --password <your cluster password> --query="SELECT partition, name, disk_name from system.parts where table = 'hybrid_storage_table'"
-
Now, open another terminal instance or exit from the current clickhouse-client session with the
exit
command. -
Run the following query that will fetch the data from our S3 bucket and upload it with the INSERT
curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv \ | <Native interface connection string> \ --query="SELECT partition, name, disk_name from system.parts where table = 'hybrid_storage_table'"
If you configured your table and uploaded data correctly, the terminal output will show that your data is divided between the local disk and object storage. A fragment of this output looks as follows:
┌─partition──┬─name───────────┬─disk_name──────┐
│ 2016-01-10 │ 20160110_0_0_0 │ object_storage │
│ 2016-01-30 │ 20160130_0_0_0 │ object_storage │
│ .......... │ .............. │ .............. │
│ 2016-12-25 │ 20161225_0_0_0 │ object_storage │
│ 2016-12-31 │ 20161231_0_0_0 │ object_storage │
│ 2017-01-06 │ 20170106_0_0_0 │ default │
│ 2017-01-12 │ 20170112_0_0_0 │ default │
│ 2017-01-18 │ 20170118_0_0_0 │ default │
│ .......... │ .............. │ .............. │
│ 2017-09-09 │ 20170909_0_0_0 │ default │
└────────────┴────────────────┴────────────────┘