Getting started with Transfer

Transfer is a tool that allows you to copy or replicate data between databases and stream processing services by creating endpoints and configuring transfers between them.

This tutorial transfers a parquet file from an Amazon S3 bucket to a Managed ClickHouse® cluster.

To see how the Transfer works with CSV files, refer to The complete end-to-end DoubleCloud services guide.

Prepare a Managed ClickHouse® cluster

Managed ClickHouse® clusters are one or more database hosts.

First of all, we need to create a cluster and prepare it for data upload.

Create a Managed ClickHouse® cluster

  1. Go to the Clusters overview page in the console.

  2. Click Create cluster in the upper-right corner of the page.

    1. Select ClickHouse®.

    2. Choose a provider and a region closest to your location.

    3. Under Resources:

      • Select the s2-c2-m4 preset for CPU, RAM capacity, and storage space to create a cluster with minimal configuration.

      • Choose a number of replicas. Let's keep it as is with a single replica.

      • Select a number of shards. Keep a single shard.

    4. Under Basic settings:

      • Enter the cluster Name, in this tutorial - first-cluster.

      • From the Version drop-down list, select the ClickHouse® version the Managed ClickHouse® cluster will use. For most clusters, we recommend using the latest version.

    5. Under NetworkingVPC, specify in which DoubleCloud VPC to locate your cluster. Use the default value in the previously selected region if you don't need to create this cluster in a specific network.

    6. Click Submit.

Your cluster will appear with the Creating status on the Clusters page in the console. Setting everything up may take some time. When the cluster is ready, it changes its state to Alive.

Click the cluster name and you'll see the following page:

cluster-created

Tip

The DoubleCloud service creates the superuser admin and its password automatically. You can find both the User and the Password in the Overview tab on the cluster information page.

To create users for other roles, see Manage ClickHouse® users

To create a ClickHouse® cluster, use the ClusterService create method with the following parameters:

  • project_id - the ID of your project. You can get this value on your project's information page.

  • cloud_type - aws or gcp.

  • region_id - for the purpose of this quickstart, specify eu-central-1 for AWS or europe-west3 for GCP.

  • name - first-cluster.

  • resources - specify the following from the doublecloud.clickhouse.v1.ClusterResources model:

    • resource_preset_id - s2-c2-m4.

    • disk_size - let's go with the default 34359738368 bytes (32 GB).

    • replica_count - 1.

    • shard_count - 1.

Connect to the cluster and create a database

Tip

This tutorial shows how to use a CLI client with Docker and with clickhouse-client on DEB-based and RPM-based Linux distributions. You can also use other tools of your choice.

For more connection options, see Connect to a ClickHouse® database.

  1. Install the software:

    1. Open your terminal.

    2. (Optional) Start Docker if needed:

      sudo service docker start
      
    3. Pull the clickhouse-client Docker image:

      docker pull clickhouse/clickhouse-client
      
    1. Open your terminal.

    2. Connect to the ClickHouse® official DEB repository from your Linux system:

      sudo apt update && sudo apt install -y apt-transport-https ca-certificates dirmngr && \
      sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754 && \
      echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \
      /etc/apt/sources.list.d/clickhouse.list
      
    3. Refresh the package list and install the clickhouse-client :

      sudo apt update && sudo apt install -y clickhouse-client
      
    1. Open your terminal.

    2. Connect to a ClickHouse® official RPM repository from your Linux system:

      sudo yum install -y yum-utils
      sudo yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo
      
    3. Install the clickhouse-client :

      sudo yum install -y clickhouse-client
      

    Warning

    If you run a RedHat 7-based Linux distribution, including Cent OS 7, Oracle Linux 7 and others, you need to download and install trusted certificates and manually add the path to them in the clickhouse-client configuration file as follows:

    1. Install the root certificate:

      curl https://letsencrypt.org/certs/isrg-root-x2-cross-signed.pem > \ 
      /etc/pki/ca-trust/source/anchors/isrg-root-x2-cross-signed.pem
      
    2. Install the intermediate certificate:

      curl https://letsencrypt.org/certs/lets-encrypt-r3-cross-signed.pem > \
      /etc/pki/ca-trust/source/anchors/lets-encrypt-r3-cross-signed.pem
      
    3. Update the list of trusted certificates:

      sudo update-ca-trust
      
    4. Locate your clickhouse-client configuration file (by default, you can find it at /etc/clickhouse-client/config.xml) and add the path to the certificates into the <openSSL> section:

      <client> <!-- Used for connection to server's secure tcp port -->
         <loadDefaultCAFile>true</loadDefaultCAFile>
         <cacheSessions>true</cacheSessions>
         <disableProtocols>sslv2,sslv3</disableProtocols>
         <preferServerCiphers>true</preferServerCiphers>
         <caConfig>/etc/ssl/certs/ca-bundle.crt</caConfig>
         <!-- Use for self-signed: <verificationMode>none</verificationMode> -->
         <invalidCertificateHandler>
         <!-- Use for self-signed: <name>AcceptCertificateHandler</name> -->
         <name>RejectCertificateHandler</name>
         </invalidCertificateHandler>
      </client>
      

    Now you have a CLI tool to manage your Managed ClickHouse® cluster. Let's connect to it:

  2. Go to the Clusters overview page in the console.

  3. Select the name of your cluster to open its information page. By default, you'll see the Overview tab.

  4. Under Connection strings, find the Native interface string and click Copy.

  5. Connect to your ClickHouse® cluster:

    docker run --network host --rm -it clickhouse/<Native interface connection string>
    
    The complete Docker command structure
    docker run --network host --rm -it \ 
                clickhouse/clickhouse-client \
                --host <FQDN of your cluster> \
                --secure \
                --user <cluster user name> \
                --password <cluster user password> \
                --port 9440 
    
    <Native interface connection string>
    
  6. After you have connected to the cluster, run the CREATE query:

    CREATE DATABASE IF NOT EXISTS db_for_s3
    
  7. Send the following query to make sure that your database is successfully created:

    SHOW DATABASES
    

    The terminal readout should be the following:

    ┌─name───────────────┐
    │ INFORMATION_SCHEMA │
    │ _system            │
    │ db_for_s3          │
    │ default            │
    │ information_schema │
    │ system             │
    └────────────────────┘
    
  8. Keep this terminal opened - we'll need it later.

At this stage, go to the console and configure Transfer.

Transfer data from S3

Now it's time to set up the tools that will get the data from a remote source and transfer it to your db_for_s3 ClickHouse® database.

Create a source endpoint

This is your data fetcher. It will connect to a remote source and send the data to the target endpoint.

  1. Open the Transfer page in the console.

  2. Select Endpoints tab, click Create endpoint and select Source.

    create-source-endpoint

  3. Select S3 as the Source type.

  4. Under Basic settings:

    1. Enter the Name of the endpoint: first-s3-endpoint.

    2. (optional) Enter a Description of the endpoint.

  5. Specify endpoint parameters under Endpoint settings:

    1. Specify the Dataset: hits.

    2. Provide the Path pattern: data-sets/hits.parquet.

    3. Auto-infer the Schema by leaving the default value - {}.

    Your filled settings should be the following:

    endpoint-settings

  6. Select the data format - parquet.

  7. Under parquet, specify the Buffer Size - 10000. Keep the rest of the fields with their default values.

  8. Under S3: Amazon Web Services, enter the name of the Bucket: doublecloud-docs. As the bucket is public, leave the rest of the fields blank.

    This is how the rest of your configuration should look:

    format-specific-settings

  9. Optionally, you can test your source endpoint:

    After configuring your endpoint, click Test. You'll see an endpoint test dialog:

    endpoint-test

    You can use two runtime types for connection testing - Dedicated and Serverless.

    Runtime compatibility warning

    Don't use endpoints with different runtime types in the same transfer - this will cause it to fail.

    Dedicated

    The Transfer service uses this runtime to connect to your data source via an internal or external network.

    This runtime is useful when you need to use a specific network - it may be an external network or an internal one with a peer connection.

    To run the test with this runtime:

    1. Under Runtime, select Dedicated.

    2. From the drop-down list, select the network to use to connect to your data source.

    3. Click Test connection.

      After completing the test procedure, you'll see a list of the endpoint's data sources. For Apache Kafka® endpoints, you'll also see data samples for each data source.

    Serverless

    The Transfer service uses this runtime to connect to your data sources available from the internet via an automatically defined network.

    Use this runtime to test an endpoint to a data source located outside isolated networks.

    To run the test with this runtime:

    1. Under Runtime, select Serverless.

    2. Click Test.

      After completing the test procedure, you'll see a list of the endpoint's data sources. For Apache Kafka® endpoints, you'll also see data samples for each data source.

    Warning

    Please be patient - testing may take up to several minutes.

  10. Click Submit.

To create a source endpoint, let's use the EndpointService create method and pass the following parameters:

  • project_id - The ID of the project in which you want to create your endpoint. You can get this value on your project's information page.

  • name - the name of your new source endpoint, first-s3-endpoint.

  • settings - the type of source endpoint. For this quickstart, use the s3_source, as reflected in the S3Source endpoint model:

    • dataset - hits, this is the name of our dataset at the S3 source.

    • path_pattern - data-sets/hits.parquet, this is the path to our dataset at the S3 source.

    • schema - for this quickstart we don't need to define a schema, use the default value {}.

    • format - parquet, this is the dataset's format. We need to pass an additional parameter for this data type:

      • buffer_size - 10000, this is the size of the reading buffer for the dataset.
    • provider - bucket with the value doublecloud-docs. As the source bucket is public, you don't need to provide any other parameters from the doublecloud.transfer.v1.endpoint.airbyte.S3Source.Provider model.

The transmitter is ready to go. Now we need to create an endpoint that will receive the data from a remote source.

Create a target endpoint

This is your receiver. It will acquire the data sent by the source endpoint and write it to the database on your Managed ClickHouse® cluster.

  1. In the list of services, select Transfer.

  2. Select Endpoints tab, click Create endpoint and select Target.

    traget-endpoint-create

  3. Select ClickHouse® as the Target type.

  4. Under Basic settings:

    1. Enter the Name of the endpoint: clickhouse-target-endpoint

    2. (optional) Enter a Description of the endpoint.

  5. Specify endpoint parameters under Endpoint settings:

    1. Select connection type - Managed cluster.

    2. Specify the connection properties:

      • Under Managed cluster, select your cluster name (first-cluster) from the drop-down list.

      • Specify the User of the database: admin.

      • Enter the Password of the database user.

      • Specify the Database name you want to transfer the data to: db_for_s3.

    3. Under Cleanup policy, select Drop.

  6. Leave all the other fields blank or with their default values.

    You should have the following configuration:

    traget-endpoint-configuration

  7. Click Submit.

To create a target endpoint, let's use the EndpointService create method and pass the following parameters:

  • project_id - The ID of the project in which you want to create your endpoint. You can get this value on your project's information page.

  • name - the name of your new target endpoint, clickhouse-target-endpoint.

  • settings - the type of target endpoint. For this quickstart, use the clickhouse_target, as reflected in the clickhouse_target endpoint model:

    • connection.connection_options - your endpoint's connection settings:

      • address - mdb_cluster_id in this quickstart, we'll connect to a managed ClickHouse® cluster, so we need to pass the cluster ID value.

        To get your cluster's ID, get a list of clusters in the project.

      • database - db_for_s3, this is your ClickHouse® database.

      • user - admin, your ClickHouse® cluster username.

      • password.value.raw - your ClickHouse® cluster password.

  • clickhouse_cluster_name - first-cluster, this is your cluster's name.

  • cleanup_policy - CLICKHOUSE_CLEANUP_POLICY_DROP.

Good work. Now we've created an endpoint that will receive and write the data to your ClickHouse® database. All we need now is the tool that will connect both endpoints and transfer the data.

Create and activate a transfer

This is a tool that activates the transfer process through the data pipeline. It will connect your endpoints and ensure the integrity of the data.

  1. In the list of services, select Transfer.

  2. Click Create transfer.

  3. Under Endpoints:

    1. Select first-s3-endpoint from the Source drop-down menu.

    2. Select clickhouse-target-endpoint from the Target.

  4. Under Basic settings:

    1. Enter the transfer Name: transfer-quickstart

    2. (optional) Enter the transfer Description.

  5. Under Transfer settingsTransfer type, make sure you select the default Snapshot type.

  6. Leave all the other fields blank or with their default values.

  7. Click Submit. You will see the following line on your Transfer page:

    transfer-ready

  8. After you've created a transfer, click Activate.

  9. Wait until your transfer status changes to Done.

  1. Let's create a transfer using the TrasnferService create method with the following parameters:

    • source_id - the endpoint ID for the source endpoint.

      To find the endpoint ID, get a list of endpoints in the project.

    • target_id - the endpoint ID for the target endpoint.

    • name - the transfer name, transfer-quickstart.

    • project_id - the ID of the project in which you create a transfer. You can get this value on your project's information page.

    • type - SNAPSHOT_ONLY.

  2. Now, let's activate it using the TransferService activate method and pass the transfer ID in the transfer_id request parameter.

    To find the transfer ID, get a list of transfers in the project.

Query the data in the Managed ClickHouse® cluster

Check the data transferred to your ClickHouse® database:

  1. Open the terminal you used to create the database or connect to your cluster once again as we did above:

    docker run --network host --rm -it clickhouse/<Native interface connection string>
    
    The complete Docker command structure
    docker run --network host --rm -it \ 
                clickhouse/clickhouse-client \
                --host <FQDN of your cluster> \
                --secure \
                --user <cluster user name> \
                --password <cluster user password> \
                --port 9440 
    
    <Native interface connection string>
    
  2. Send the following query to check if your data exists in the cluster. The name of a table in the db_for_s3 corresponds to the name of the source dataset (hits).

    SELECT * FROM db_for_s3.hits
    

    The terminal readout should display the following data:

    ┌─Browser─┬─Cookie_Enabled─┬─Date───────┬─Gender─┬─Hit_ID─┬─Region_ID─┬─Technology───────────┬─Time_Spent─────────┬─Traffic_Source─┐
    │ Firefox │              0 │ 2016-01-10 │ Male   │  67112 │       184 │ PC (Mac)             │ 388.93975903614455 │ Direct         │
    │ Chrome  │              0 │ 2016-03-12 │ Female │  54562 │        51 │ Smartphone (I0S)     │ 325.20392156862744 │ Search engine  │ 
    │ Chrome  │              1 │ 2016-03-18 │ Female │  63833 │        51 │ Smartphone (I0S)     │ 316.09774436090225 │ Search engine  │ 
    │ Firefox │              1 │ 2016-03-24 │ Male   │  43941 │        51 │ PC (Windows)         │  263.7365269461078 │ Search engine  │ 
    │ Safari  │              0 │ 2016-03-30 │ Female │  38583 │        51 │ Smartphone (Android) │  363.8421052631579 │ Internal       │
    

Nice work! You have all the data transferred from a remote source and replicated with complete integrity in your own ClickHouse® database. Now let's make this data earn its worth.

See also

Previous