Getting started with Transfer

Transfer is a tool that allows you to copy or replicate data between databases and stream processing services by creating endpoints and configuring transfers between them.

This tutorial transfers a parquet file from an Amazon S3 bucket to a Managed ClickHouse® cluster.

To see how the Transfer works with CSV files, refer to The complete end-to-end DoubleCloud services guide.

Prepare a Managed ClickHouse® cluster

Managed ClickHouse® clusters are one or more database hosts.

First of all, we need to create a cluster and prepare it for data upload.

Create a Managed ClickHouse® cluster

  1. Go to the Clusters overview page in console.

  2. Click Create cluster in the upper-right corner of the page.

    1. Select ClickHouse®.

    2. Choose a provider and a region closest to your location.

    3. Under Resources:

      • Select the s1-c2-m4 preset for CPU, RAM capacity, and storage space to create a cluster with minimal configuration.

      • Choose a number of replicas. Let's keep it as is with a single replica.

      • Select a number of shards. Keep a single shard.

    4. Under Basic settings:

      • Enter the cluster Name, in this tutorial - first-cluster.

      • From the Version drop-down list, select the ClickHouse® version the Managed ClickHouse® cluster will use. For most clusters, we recommend using the latest version.

    5. Under NetworkingVPC, specify in which DoubleCloud VPC to locate your cluster. Use the default value in the previously selected region if you don't need to create this cluster in a specific network.

    6. Click Submit.

Your cluster will appear with the Creating status on the Clusters page in the console. Setting everything up may take some time. When the cluster is ready, it changes its state to Alive.

Click on the cluster and you'll see the following page:

cluster-created

Tip

The DoubleCloud service creates the superuser admin and its password automatically. You can find both the User and the Password in the Overview tab on the cluster information page.

To create users for other roles, see Manage ClickHouse® users

Connect to the cluster and create a database

Tip

This tutorial shows how to use a CLI client with Docker and with clickhouse-client on DEB-based and RPM-based Linux distributions. You can also use other tools of your choice.

For more connection options, see Connect to a ClickHouse® database.

  1. Install the software:

    1. Open your terminal.

    2. (Optional) Start Docker if needed:

      service docker start
      
    3. Pull the clickhouse-client Docker image:

      docker pull clickhouse/clickhouse-client
      
    1. Open your terminal.

    2. Connect to the ClickHouse® official DEB repository from your Linux system:

      sudo apt update && sudo apt install -y apt-transport-https ca-certificates dirmngr && \
      sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754 && \
      echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \
      /etc/apt/sources.list.d/clickhouse.list
      
    3. Refresh the package list and install the clickhouse-client :

      sudo apt update && sudo apt install -y clickhouse-client
      
    1. Open your terminal.

    2. Connect to a ClickHouse® official RPM repository from your Linux system:

      sudo yum install -y yum-utils
      sudo yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo
      
    3. Install the clickhouse-client :

      sudo yum install -y clickhouse-client
      

    Warning

    If you run a RedHat 7-based Linux distribution, including Cent OS 7, Oracle Linux 7 and others, you need to download and install trusted certificates, and manually add the path to them in the clickhouse-client configuration file as follows:

    1. Install the root certificate:

      curl https://letsencrypt.org/certs/isrg-root-x2-cross-signed.pem > \ 
      /etc/pki/ca-trust/source/anchors/isrg-root-x2-cross-signed.pem
      
    2. Install the intermediate certificate:

      curl https://letsencrypt.org/certs/lets-encrypt-r3-cross-signed.pem > \
      /etc/pki/ca-trust/source/anchors/lets-encrypt-r3-cross-signed.pem
      
    3. Update the list of trusted certificates:

      sudo update-ca-trust
      
    4. Locate your clickhouse-client configuration file (by default, you can find it at /etc/clickhouse-client/config.xml) and add the path to the certificates into the <openSSL> section:

      <client> <!-- Used for connection to server's secure tcp port -->
         <loadDefaultCAFile>true</loadDefaultCAFile>
         <cacheSessions>true</cacheSessions>
         <disableProtocols>sslv2,sslv3</disableProtocols>
         <preferServerCiphers>true</preferServerCiphers>
         <caConfig>/etc/ssl/certs/ca-bundle.crt</caConfig>
         <!-- Use for self-signed: <verificationMode>none</verificationMode> -->
         <invalidCertificateHandler>
         <!-- Use for self-signed: <name>AcceptCertificateHandler</name> -->
         <name>RejectCertificateHandler</name>
         </invalidCertificateHandler>
      </client>
      

    Now you have a CLI tool to manage your Managed ClickHouse® cluster. Let's connect to it:

  2. Go to the Clusters overview page in console.

  3. Select the name of your cluster to open its information page. By default, you will see the Overview tab.

  4. Under Connection strings, find the Native interface string and click Copy.

  5. Connect to your ClickHouse® cluster:

    docker run --network host --rm -it 
    clickhouse/<Native interface connection string>
    
    The complete Docker command structure
    docker run --network host --rm -it  \ 
                clickhouse/clickhouse-client \
                --host <FQDN of your cluster> \
                --secure \
                --user admin \
                --password <Cluster user password> \
                --port 9440 
    
    <Native interface connection string>
    
  6. After you have connected to the cluster, run the CREATE query:

    CREATE DATABASE IF NOT EXISTS db_for_s3
    
  7. Send the following query to make sure that your database is successfully created:

    SHOW DATABASES
    

    The terminal readout should be the following:

    ┌─name───────────────┐
    │ INFORMATION_SCHEMA │
    │ _system            │
    │ db_for_s3          │
    │ default            │
    │ information_schema │
    │ system             │
    └────────────────────┘
    
  8. Keep this terminal opened - we will need it later.

At this stage, go to the console and configure Transfer.

Transfer data from S3

Now it's time to set up the tools that will get the data from a remote source and transfer it to your db_for_s3 ClickHouse® database.

Create a source endpoint

This is your data fetcher. It will connect to a remote source and send the data to the target endpoint.

  1. In the list of services, select Transfer.

  2. Select Endpoints tab, click Create endpoint and select Source.

    create-source-endpoint

  3. Select S3 as the Source type.

  4. Under Basic settings:

    1. Enter the Name of the endpoint: first-s3-endpoint.

    2. (optional) Enter a Description of the endpoint.

  5. Specify endpoint parameters under Endpoint settings:

    1. Specify the Dataset: hits.

    2. Provide the Path pattern: data-sets/hits.parquet.

    3. Auto-infer the Schema by leaving the default value - {}.

    Your filled settings should be the following:

    endpoint-settings

  6. Select the data format - parquet.

  7. Under parquet, specify the Buffer Size - 10000. Keep the rest of the fields with their default values.

  8. Under S3: Amazon Web Services, enter the name of the Bucket: doublecloud-docs. As the bucket is public, leave the rest of the fields blank.

    This is how the rest of your configuration should look:

    format-specific-settings

  9. Click Submit.

The transmitter is ready to go. Now we need to create an endpoint that will receive the data from a remote source.

Create a target endpoint

This is your receiver. It will acquire the data sent by the source endpoint and write it to the database on your Managed ClickHouse® cluster.

  1. In the list of services, select Transfer.

  2. Select Endpoints tab, click Create endpoint and select Target.

    traget-endpoint-create

  3. Select ClickHouse® as the Target type.

  4. Under Basic settings:

    1. Enter the Name of the endpoint: clickhouse-target-endpoint

    2. (optional) Enter a Description of the endpoint.

  5. Specify endpoint parameters under Endpoint settings:

    1. Select connection type - Managed cluster.

    2. Specify the connection properties:

      • Under Managed cluster, select your cluster name (first-cluster) from the drop-down list.

      • Specify the User of the database: admin.

      • Enter the Password of the database user.

      • Specify the Database name you want to transfer the data to: db_for_s3.

    3. Under Cleanup policy, select Drop.

  6. Leave all the other fields blank or with their default values.

    You should have the following configuration:

    traget-endpoint-configuration

  7. Click Submit.

Good work. Now we've created an endpoint that will receive and write the data to your ClickHouse® database. All we need now is the tool that will connect both endpoints and transfer the data.

Create and activate a transfer

This is a tool that activates the transfer process through the data pipeline. It will connect your endpoints and ensure the integrity of the data.

  1. In the list of services, select Transfer.

  2. Click Create transfer.

  3. Under Endpoints:

    1. Select first-s3-endpoint from the Source drop-down menu.

    2. Select clickhouse-target-endpoint from the Target.

  4. Under Basic settings:

    1. Enter the transfer Name: transfer-quickstart

    2. (optional) Enter the transfer Description.

  5. Under Transfer settingsTransfer type, make sure that the default Snapshot type is selected.

  6. Leave all the other fields blank or with their default values.

  7. Click Submit. You will see the following line on your Transfer page:

    transfer-ready

  8. After you've created a transfer, click Activate.

  9. Wait until your transfer status changes to Done.

Query the data in the Managed ClickHouse® cluster

Check the data transferred to your ClickHouse® database:

  1. Open the Linux terminal you used to create the database or connect to your cluster once again as we did above:

    • Docker

      docker run --network host --rm -it 
      clickhouse/<Native interface connection string>
      
      The complete Docker command structure
      docker run --network host --rm -it 
         clickhouse/clickhouse-client 
         --host <FQDN of your cluster> 
         --port 9440 --secure 
         --user admin 
         --password <your cluster password>
      
    • Native clickhouse-client

      <Native interface connection string>
      
    <Native interface connection string>
    
  2. Send the following query to check if your data exists in the cluster. The name of a table in the db_for_s3 corresponds to the name of the source dataset (hits).

    SELECT * FROM db_for_s3.hits
    

    The terminal readout should display the following data:

    ┌─Browser─┬─Cookie_Enabled─┬─Date───────┬─Gender─┬─Hit_ID─┬─Region_ID─┬─Technology───────────┬─Time_Spent─────────┬─Traffic_Source─┐
    │ Firefox │              0 │ 2016-01-10 │ Male   │  67112 │       184 │ PC (Mac)             │ 388.93975903614455 │ Direct         │
    │ Chrome  │              0 │ 2016-03-12 │ Female │  54562 │        51 │ Smartphone (I0S)     │ 325.20392156862744 │ Search engine  │ 
    │ Chrome  │              1 │ 2016-03-18 │ Female │  63833 │        51 │ Smartphone (I0S)     │ 316.09774436090225 │ Search engine  │ 
    │ Firefox │              1 │ 2016-03-24 │ Male   │  43941 │        51 │ PC (Windows)         │  263.7365269461078 │ Search engine  │ 
    │ Safari  │              0 │ 2016-03-30 │ Female │  38583 │        51 │ Smartphone (Android) │  363.8421052631579 │ Internal       │
    

Nice work! You have all the data transferred from a remote source and replicated with complete integrity in your own ClickHouse® database. Now let's make this data earn its worth.

See also