Use hybrid storage for the DoubleCloud Managed ClickHouse® clusters

DoubleCloud Managed ClickHouse® clusters allow users to store data separately - the frequently used hot data present in the local disk storage and the cold data is stored in object storage (S3).

This scenario shows how to create a Managed ClickHouse® cluster, a database, and a table with the specified TTL policy. After that, we'll upload data that the service will automatically split between the disk and object storage.

Create a Managed ClickHouse® cluster

  1. Go to the Clusters overview page.

  2. Click Create cluster in the upper-right corner of the page.

    1. Select ClickHouse®.

    2. Choose a provider and a region closest to your geographical location.

    3. Under Resources:

      • Select a preset for CPU, RAM capacity, and storage space. The default s1-c2-m8 preset will be more than enough for this scenario.

      • Select 1 replica and 1 shard.

    4. Under Basic settings:

      • Enter the cluster Name: clickhouse-for-s3.

      • Keep the Version as is - this is the latest stable version of ClickHouse®.

    5. Under NetworkingVPC, specify in which DoubleCloud VPC to locate your cluster. Use the default value in the previously selected region if you don't need to create this cluster in a specific network.

    6. Click Submit.

Your cluster will appear with the Creating status on the Clusters page in the console. Setting everything up may take some time. When the cluster is ready, it changes its state to Alive.

Install clickhouse-client

Use one of the ways to install the CLI client:

  1. Open your terminal.

  2. (Optional) Start Docker if needed:

    service docker start
    
  3. Pull the clickhouse-client Docker image:

    docker pull clickhouse/clickhouse-client
    
  1. Open your terminal.

  2. Connect to the ClickHouse® official DEB repository from your Linux system:

    sudo apt update && sudo apt install -y apt-transport-https ca-certificates dirmngr && \
    sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754 && \
    echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \
    /etc/apt/sources.list.d/clickhouse.list
    
  3. Refresh the package list and install the clickhouse-client :

    sudo apt update && sudo apt install -y clickhouse-client
    
  1. Open your terminal.

  2. Connect to a ClickHouse® official RPM repository from your Linux system:

    sudo yum install -y yum-utils
    sudo yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo
    
  3. Install the clickhouse-client :

    sudo yum install -y clickhouse-client
    

Warning

If you run a RedHat 7-based Linux distribution, including Cent OS 7, Oracle Linux 7 and others, you need to download and install trusted certificates and manually add the path to them in the clickhouse-client configuration file as follows:

  1. Install the root certificate:

    curl https://letsencrypt.org/certs/isrg-root-x2-cross-signed.pem > \ 
    /etc/pki/ca-trust/source/anchors/isrg-root-x2-cross-signed.pem
    
  2. Install the intermediate certificate:

    curl https://letsencrypt.org/certs/lets-encrypt-r3-cross-signed.pem > \
    /etc/pki/ca-trust/source/anchors/lets-encrypt-r3-cross-signed.pem
    
  3. Update the list of trusted certificates:

    sudo update-ca-trust
    
  4. Locate your clickhouse-client configuration file (by default, you can find it at /etc/clickhouse-client/config.xml) and add the path to the certificates into the <openSSL> section:

    <client> <!-- Used for connection to server's secure tcp port -->
       <loadDefaultCAFile>true</loadDefaultCAFile>
       <cacheSessions>true</cacheSessions>
       <disableProtocols>sslv2,sslv3</disableProtocols>
       <preferServerCiphers>true</preferServerCiphers>
       <caConfig>/etc/ssl/certs/ca-bundle.crt</caConfig>
       <!-- Use for self-signed: <verificationMode>none</verificationMode> -->
       <invalidCertificateHandler>
       <!-- Use for self-signed: <name>AcceptCertificateHandler</name> -->
       <name>RejectCertificateHandler</name>
       </invalidCertificateHandler>
    </client>
    

Connect to the cluster

  1. Select Clusters from the list of services on the left.

  2. Select the name of your cluster to open its information page. By default, you will see the Overview tab.

  3. Under Connection strings, find the Native interface string and click Copy.

  4. Open your Linux terminal and run a command to connect to your cluster:

    docker run --network host --rm -it 
    clickhouse/<Native interface connection string>
    
    The complete Docker command structure
    docker run --network host --rm -it  \ 
                clickhouse/clickhouse-client \
                --host <FQDN of your cluster> \
                --secure \
                --user admin \
                --password <Cluster user password> \
                --port 9440 
    
    1. Open your terminal.

    2. Connect to the ClickHouse® official DEB repository from your Linux system:

      sudo apt update && sudo apt install -y apt-transport-https ca-certificates dirmngr && \
      sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754 && \
      echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \
      /etc/apt/sources.list.d/clickhouse.list
      
    3. Refresh the package list and install the clickhouse-client :

      sudo apt update && sudo apt install -y clickhouse-client
      
    1. Open your terminal.

    2. Connect to a ClickHouse® official RPM repository from your Linux system:

      sudo yum install -y yum-utils
      sudo yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo
      
    3. Install the clickhouse-client :

      sudo yum install -y clickhouse-client
      

    Warning

    If you run a RedHat 7-based Linux distribution, including Cent OS 7, Oracle Linux 7 and others, you need to download and install trusted certificates, and manually add the path to them in the clickhouse-client configuration file as follows:

    1. Install the root certificate:

      curl https://letsencrypt.org/certs/isrg-root-x2-cross-signed.pem > \ 
      /etc/pki/ca-trust/source/anchors/isrg-root-x2-cross-signed.pem
      
    2. Install the intermediate certificate:

      curl https://letsencrypt.org/certs/lets-encrypt-r3-cross-signed.pem > \
      /etc/pki/ca-trust/source/anchors/lets-encrypt-r3-cross-signed.pem
      
    3. Update the list of trusted certificates:

      sudo update-ca-trust
      
    4. Locate your clickhouse-client configuration file (by default, you can find it at /etc/clickhouse-client/config.xml) and add the path to the certificates into the <openSSL> section:

      <client> <!-- Used for connection to server's secure tcp port -->
         <loadDefaultCAFile>true</loadDefaultCAFile>
         <cacheSessions>true</cacheSessions>
         <disableProtocols>sslv2,sslv3</disableProtocols>
         <preferServerCiphers>true</preferServerCiphers>
         <caConfig>/etc/ssl/certs/ca-bundle.crt</caConfig>
         <!-- Use for self-signed: <verificationMode>none</verificationMode> -->
         <invalidCertificateHandler>
         <!-- Use for self-signed: <name>AcceptCertificateHandler</name> -->
         <name>RejectCertificateHandler</name>
         </invalidCertificateHandler>
      </client>
      
  5. Once connected, you can also check if your cluster has all the required policies. Execute the following queries:

    Get the available storage policies (see the policy_name column):

    SELECT * FROM system.storage_policies
    

    The output should display the following policies for the above-mentioned disks - default, local (alias for default) and object_storage:

    
    ┌─policy_name────┬─volume_name────┬─volume_priority─┬─disks──────────────┬─────┬─prefer_not_to_merge─┐
    │ default        │ default        │               1 │ ['default']        │.....│                   0 │
    │ hybrid_storage │ default        │               1 │ ['default']        │.....│                   0 │
    │ hybrid_storage │ object_storage │               2 │ ['object_storage'] │.....│                   1 │
    │ local          │ default        │               1 │ ['default']        │.....│                   0 │
    │ object_storage │ object_storage │               1 │ ['object_storage'] │.....│                   1 │
    └────────────────┴────────────────┴─────────────────┴────────────────────┴─────┴─────────────────────┘
    
    

Create a database and a table with a non-default storage policy

When you've confirmed that all the required policies are present, you can proceed to creating a database.

  1. Type the following command to create a database:

    CREATE DATABASE IF NOT EXISTS "db_hits"
    
  2. You have the database, and now it's time to create a table for the data you'll upload later. You'll need to define the TTL expression. TTL sets the lifetime of a table row. In this case, it's the number of months from the last date of 2016 to the current date. The data that fits in this interval is stored on network drives. Therefore, all the data recorded before this interval will go into the object storage.

    Send the following query to create a table that will automatically split data by months:

    CREATE TABLE db_hits.hybrid_storage_table ON CLUSTER 'default' \
       (Hit_ID Int32, Date Date, Time_Spent Float32, Cookie_Enabled Int32, Region_ID Int32, Gender String, Browser String, Traffic_Source String, Technology String) \
       ENGINE ReplicatedMergeTree \
       PARTITION BY Date \
       ORDER BY (Hit_ID, Date) \
       TTL Date + toIntervalDay(dateDiff('day', toDate('2016-12-31'), now())) TO DISK 'object_storage' \
       SETTINGS storage_policy = 'hybrid_storage'
    

    Warning

    The expression for TTL is shaped for the selected test dataset. You must split the fixed data collected long ago into parts for placement at different storage levels. For most tables that are constantly updated with new data, a simpler expression for TTL will suffice.

    For example, to move the data older than 5 days to the object storage, execute the following query:

    TTL Date + INTERVAL 5 DAY TO DISK 'object_storage'
    

Upload data to your cluster

  1. Now, you can open another terminal instance or exit from your cluster by typing the exit command.

  2. In a separate terminal, run the following query that will fetch the data from our S3 bucket and upload it with the INSERT query:

    curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv \
    | docker run --network host --rm -i
    clickhouse/<Native interface connection string> \
    --query="INSERT INTO db_hits.hybrid_storage_table FORMAT CSVWithNames" \
    --format_csv_delimiter=";"
    
    The complete Docker command structure
    curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv
    | docker run --network host --rm -i 
       clickhouse/clickhouse-client 
       --host <FQDN of your cluster> 
       --port 9440 --secure 
       --user admin 
       --password <your cluster password> 
       --query="INSERT INTO db_hits.hybrid_storage_table FORMAT CSVWithNames"
       --format_csv_delimiter=";"
    
  1. Now, open another terminal instance or exit from your cluster by typing the exit command.

  2. Run the following query that will fetch the data from our S3 bucket and upload it with the INSERT query:

    curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv \
    | <Native interface connection string> \
    --query="INSERT INTO db_hits.hybrid_storage_table FORMAT CSVWithNames" \
    --format_csv_delimiter=";"
    

If the upload was successful, perform a query to select partitions, their names, and disks where these partitions are located:

  1. Now, you can open another terminal instance or exit from your cluster by typing the exit command.

  2. In a separate terminal, run the following query that will fetch the data from our S3 bucket and upload it with the INSERT query:

    curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv \
    | docker run --network host --rm -i
    clickhouse/<Native interface connection string> \
    --query="SELECT partition, name, disk_name from system.parts where table = 'hybrid_storage_table'"
    
    The complete Docker command structure
    curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv
    | docker run --network host --rm -i 
       clickhouse/clickhouse-client 
       --host <FQDN of your cluster> 
       --port 9440 --secure 
       --user admin 
       --password <your cluster password> 
       --query="SELECT partition, name, disk_name from system.parts where table = 'hybrid_storage_table'"
    
  1. Now, open another terminal instance or exit from the current clickhouse-client session with the exit command.

  2. Run the following query that will fetch the data from our S3 bucket and upload it with the INSERT query:

    curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv \
    | <Native interface connection string> \
    --query="SELECT partition, name, disk_name from system.parts where table = 'hybrid_storage_table'"
    

If you configured your table and uploaded data correctly, the terminal output will show that your data is divided between the local disk and object storage. A fragment of this output looks as follows:

┌─partition──┬─name───────────┬─disk_name──────┐
│ 2016-01-10 │ 20160110_0_0_0 │ object_storage │
│ 2016-01-30 │ 20160130_0_0_0 │ object_storage │
│ .......... │ .............. │ .............. │
│ 2016-12-25 │ 20161225_0_0_0 │ object_storage │
│ 2016-12-31 │ 20161231_0_0_0 │ object_storage │
│ 2017-01-06 │ 20170106_0_0_0 │ default        │
│ 2017-01-12 │ 20170112_0_0_0 │ default        │
│ 2017-01-18 │ 20170118_0_0_0 │ default        │
│ .......... │ .............. │ .............. │
│ 2017-09-09 │ 20170909_0_0_0 │ default        │
└────────────┴────────────────┴────────────────┘

See also: