Getting started with Transfer

Transfer is a tool that allows you to copy or replicate data between databases and stream processing services by creating endpoints and configuring transfers between them.

This tutorial transfers a parquet file from an Amazon S3 bucket to a Managed ClickHouse® cluster.

To see how the Transfer works with CSV files, refer to The complete end-to-end DoubleCloud services guide.

Prepare a Managed ClickHouse® cluster

Managed ClickHouse® clusters are one or more database hosts.

First of all, we need to create a cluster and prepare it for data upload.

Create a Managed ClickHouse® cluster

  1. Go to the Clusters page in the console.

  2. Click Create cluster in the upper-right corner of the page.

    1. Select ClickHouse®.

    2. Choose a provider and a region closest to your location.

    3. Under Resources:

      • Select the s2-c2-m4 preset for CPU, RAM capacity, and storage space to create a cluster with minimal configuration.

      • Choose a number of replicas. Let's keep it as is with a single replica.

      • Select a number of shards. Keep a single shard.

    4. Under Basic settings:

      • Enter the cluster Name, in this tutorial - clickhouse-dev.

      • From the Version drop-down list, select the ClickHouse® version the Managed ClickHouse® cluster will use. For most clusters, we recommend using the latest version.

    5. Under NetworkingVPC, select the network where you want to create the cluster.

      If you don’t need to place the cluster in a specific network, leave the preselected default option.

    6. Click Submit.

Your cluster will appear with the Creating status on the Clusters page in the console. Setting everything up may take some time. When the cluster is ready, it changes its state to Alive.

Click the cluster name and you'll see the following page:

Screenshot of a ClickHouse® cluster page in the DoubleCloud console

Tip

DoubleCloud creates the admin superuser and its password automatically. You can the username and password under Credentials in the Overview tab on the cluster page.

To learn how to create users for other roles, refer to Manage ClickHouse® users

To create a ClickHouse® cluster, use the ClusterService create method with the following parameters:

  • project_id - the ID of your project. You can get this value on your project's information page.

  • cloud_type - aws or gcp.

  • region_id - for the purpose of this quickstart, specify eu-central-1 for AWS or europe-west3 for GCP.

  • name - clickhouse-dev.

  • resources - specify the following from the doublecloud.clickhouse.v1.ClusterResources model:

    • resource_preset_id - s2-c2-m4.

    • disk_size - let's go with the default 34359738368 bytes (32 GB).

    • replica_count - 1.

    • shard_count - 1.

When the cluster is ready, its status changes from Creating to Alive.

Connect to the cluster and create a database

To connect to the cluster, you can use WebSQL — a DoubleCloud service that provides a full-fledged SQL editor in your browser tab. You can also connect using the ClickHouse client or any other tool of your choice.

  1. After the cluster status has changed to Alive, select it in the cluster list.

  2. Click WebSQL at the top right.

  3. In WebSQL, click on any database in the connection manager on the left to open the query editor.

  4. Create a database with the CREATE query:

    CREATE DATABASE IF NOT EXISTS db_for_s3 ON CLUSTER default
    
  5. Make sure that the database has been created:

    SHOW DATABASES
    

    The terminal readout should be the following:

    ┌─name───────────────┐
    │ INFORMATION_SCHEMA │
    │ _system            │
    │ db_for_s3          │  // your database
    │ default            │
    │ information_schema │
    │ system             │
    └────────────────────┘
    

Transfer data from S3

Now it's time to set up the tools that will get the data from a remote source and transfer it to your db_for_s3 ClickHouse® database.

Create a source endpoint

This is your data fetcher. It will connect to a remote source and send the data to the target endpoint.

  1. Go to the Transfer page in the console.

  2. Click CreateSource endpoint.

  3. In Source type, select Object storage.

  4. Under Basic settings:

    1. Enter the Name of the endpoint: first-s3-endpoint.

    2. (optional) Enter a Description of the endpoint.

  5. Specify endpoint parameters under Endpoint settings:

    1. Specify the Dataset: hits.

    2. Provide the Path pattern: data-sets/hits.parquet.

    3. Auto-infer the Schema by leaving the default value - {}.

    Your filled settings should be the following:

    endpoint-settings

  6. Select the data format - parquet.

  7. Under parquet, specify the Buffer Size - 10000. Keep the rest of the fields with their default values.

  8. Under S3: Amazon Web Services, enter the name of the Bucket: doublecloud-docs. As the bucket is public, leave the rest of the fields blank.

    This is how the rest of your configuration should look:

    format-specific-settings

  9. Optionally, you can test your source endpoint:

    After configuring your endpoint, click Test connection. You'll see the following dialog:

    Screenshot of the endpoint testing dialog

    You can use two runtime types for connection testing - Dedicated and Serverless.

    Runtime compatibility warning

    Don't use endpoints with different runtime types in the same transfer — this will cause the transfer to fail.

    Dedicated

    Transfer uses this runtime to connect to your data source via an internal or external network.

    This runtime is useful when you need to use a specific network — it may be an external network or an internal network with a peer connection.

    To run the test with this runtime:

    1. Under Runtime, select Dedicated.

    2. In the dropdown, select the network you want to use to connect to your data source.

    3. Click Test connection.

      After completing the test procedure, you'll see a list of the endpoint's data sources. For Apache Kafka® endpoints, you'll also see data samples for each data source.

    Serverless

    Transfer uses this runtime to connect to your data sources available from the internet via an automatically defined network.

    Use this runtime to test an endpoint to a data source located outside isolated networks.

    To run the test with this runtime:

    1. Under Runtime, select Serverless.

    2. Click Test connection.

      After completing the test procedure, you'll see a list of the endpoint's data sources. For Apache Kafka® endpoints, you'll also see data samples for each data source.

    Warning

    Testing a connection may take up to several minutes.

  10. Click Submit.

To create a source endpoint, let's use the EndpointService create method and pass the following parameters:

  • project_id - The ID of the project in which you want to create your endpoint. You can get this value on your project's information page.

  • name - the name of your new source endpoint, first-s3-endpoint.

  • settings - the type of source endpoint. For this quickstart, use the s3_source, as reflected in the S3Source endpoint model:

    • dataset - hits, this is the name of our dataset at the S3 source.

    • path_pattern - data-sets/hits.parquet, this is the path to our dataset at the S3 source.

    • schema - for this quickstart we don't need to define a schema, use the default value {}.

    • format - parquet, this is the dataset's format. We need to pass an additional parameter for this data type:

      • buffer_size - 10000, this is the size of the reading buffer for the dataset.
    • provider - bucket with the value doublecloud-docs. As the source bucket is public, you don't need to provide any other parameters from the doublecloud.transfer.v1.endpoint.airbyte.S3Source.Provider model.

The transmitter is ready to go. Now we need to create an endpoint that will receive the data from a remote source.

Create a target endpoint

This is your receiver. It will acquire the data sent by the source endpoint and write it to the database on your Managed ClickHouse® cluster.

  1. In the list of services, select Transfer.

  2. Click CreateTarget endpoint.

  3. In Target type, select ClickHouse.

  4. Under Basic settings:

    1. Enter the Name of the endpoint: clickhouse-target-endpoint

    2. (optional) Enter a Description of the endpoint.

  5. In Connection settingsConnection type, select Managed cluster.

  6. In Managed cluster, select clickhouse-dev from the dropdown list.

  7. In Authentication, select Default to connect to the cluster as the admin user.

  8. In Database, enter db_for_s3 — the name of the database you created earlier.

  9. In Cleanup policy, select Drop.

  10. Leave all the other fields blank or with their default values.

  11. Click Submit.

To create a target endpoint, let's use the EndpointService create method and pass the following parameters:

  • project_id - The ID of the project in which you want to create your endpoint. You can get this value on your project's information page.

  • name - the name of your new target endpoint, clickhouse-target-endpoint.

  • settings - the type of target endpoint. For this quickstart, use the clickhouse_target, as reflected in the clickhouse_target endpoint model:

    • connection.connection_options - your endpoint's connection settings:

      • address - mdb_cluster_id in this quickstart, we'll connect to a managed ClickHouse® cluster, so we need to pass the cluster ID value.

        To get your cluster's ID, get a list of clusters in the project.

      • database - db_for_s3, this is your ClickHouse® database.

      • user - admin, your ClickHouse® cluster username.

      • password.value.raw - your ClickHouse® cluster password.

  • clickhouse_cluster_name - clickhouse-dev, this is your cluster's name.

  • cleanup_policy - CLICKHOUSE_CLEANUP_POLICY_DROP.

Good work. Now we've created an endpoint that will receive and write the data to your ClickHouse® database. All we need now is the tool that will connect both endpoints and transfer the data.

Create and activate a transfer

This is a tool that activates the transfer process through the data pipeline. It will connect your endpoints and ensure the integrity of the data.

  1. In the list of services, select Transfer.

  2. Click Create transfer.

  3. Under Endpoints:

    1. Select first-s3-endpoint from the Source drop-down menu.

    2. Select clickhouse-target-endpoint from the Target.

  4. Under Basic settings:

    1. Enter the transfer Name: transfer-quickstart

    2. (optional) Enter the transfer Description.

  5. Under Transfer settingsTransfer type, make sure you select the default Snapshot type.

  6. Leave all the other fields blank or with their default values.

  7. Click Submit. You will see the following line on your Transfer page:

    transfer-ready

  8. After you've created a transfer, click Activate.

  9. Wait until your transfer status changes to Done.

  1. Let's create a transfer using the TrasnferService create method with the following parameters:

    • source_id - the endpoint ID for the source endpoint.

      To find the endpoint ID, get a list of endpoints in the project.

    • target_id - the endpoint ID for the target endpoint.

    • name - the transfer name, transfer-quickstart.

    • project_id - the ID of the project in which you create a transfer. You can get this value on your project's information page.

    • type - SNAPSHOT_ONLY.

  2. Now, let's activate it using the TransferService activate method and pass the transfer ID in the transfer_id request parameter.

    To find the transfer ID, get a list of transfers in the project.

Query the data in the Managed ClickHouse® cluster

Make sure that the data has been transferred to your ClickHouse® database:

  1. Open WebSQL from your ClickHouse® cluster page.

  2. Send the following query to check if your data exists in the cluster. The name of a table in the db_for_s3 corresponds to the name of the source dataset (hits).

    SELECT * FROM db_for_s3.hits
    

    The query output should look as follows:

    ┌─Browser─┬─Cookie_Enabled─┬─Date───────┬─Gender─┬─Hit_ID─┬─Region_ID─┬─Technology───────────┬─Time_Spent─────────┬─Traffic_Source─┐
    │ Firefox │              0 │ 2016-01-10 │ Male   │  67112 │       184 │ PC (Mac)             │ 388.93975903614455 │ Direct         │
    │ Chrome  │              0 │ 2016-03-12 │ Female │  54562 │        51 │ Smartphone (I0S)     │ 325.20392156862744 │ Search engine  │ 
    │ Chrome  │              1 │ 2016-03-18 │ Female │  63833 │        51 │ Smartphone (I0S)     │ 316.09774436090225 │ Search engine  │ 
    │ Firefox │              1 │ 2016-03-24 │ Male   │  43941 │        51 │ PC (Windows)         │  263.7365269461078 │ Search engine  │ 
    │ Safari  │              0 │ 2016-03-30 │ Female │  38583 │        51 │ Smartphone (Android) │  363.8421052631579 │ Internal       │
    

Nice work! You have all the data transferred from a remote source and replicated with complete integrity in your own ClickHouse® database. Now let's make this data earn its worth.

See also

Previous