Transfer data from Amazon S3 to your Managed ClickHouse® cluster and vice versa

DoubleCloud provides multiple ways to upload data from an Amazon S3 bucket to your Managed ClickHouse® cluster. ClickHouse® table functions and engines are designed to fetch data only from Amazon buckets, not other S3-compatible buckets. In case you have any bucket other than Amazon S3, use Transfer.

All the methods described here allow users to upload data from public and private buckets.

You can select a way to integrate Amazon S3 that fits your needs the best from the options below. They are divided into two groups depending on whether you need to transfer data only once or constantly.

  1. One-time data upload:

  2. Continuous transfer:

Besides the S3 integrations mentioned above, each Managed ClickHouse® cluster in DoubleCloud comes with an integrated bucket by default. You don't pay for it until you use it to store 'cold' data.

To learn more about this storage policy, see Hybrid storage in Managed ClickHouse®.

Use the S3 table function

Use this built-in table function as part of a query to a bucket. This method is useful when you need to transfer data only once.

You can use this function to perform SELECT and INSERT queries on an Amazon S3 bucket. This will upload data to ClickHouse® from Amazon S3 or vice versa.

The function syntax is described below.

Upload to ClickHouse®:

SELECT * FROM
s3(bucket_URL, [format,] [aws_access_key_id, aws_secret_access_key,] [table_structure,] [compression]);

Upload to Amazon S3:

INSERT INTO FUNCTION s3(bucket_URL, [format,] [aws_access_key_id, aws_secret_access_key,] [table_structure,] [compression])
SELECT field_name1, field_name2 FROM existing_table;

Tip

If you don't want to expose your AWS credentials, use named collections to verify your queries:

Create a named collection and update your upload query

To create a named collection, a user with admin privileges must send this query:

CREATE NAMED COLLECTION s3_upload AS
access_key_id = 'AWS KEY',
secret_access_key = 'AWS SECRET KEY',
url = 'https://your_s3_bucket.s3.eu-central-1.amazonaws.com/';

Now, you can send the above query without exposing your Amazon S3 credentials:

INSERT INTO FUNCTION s3(bucket_URL, [format,] s3_upload [table_structure,] [compression])
SELECT field_name1, field_name2 FROM existing_table;

Parameters

The possible engine parameters are the following:

path

An URL to a bucket with a file path. Refer to the Wildcards in path in the ClickHouse® documentation to see how to upload multiple files.

format

The file format .

aws_access_key_id

An AWS access key ID used to access the bucket. This parameter is required for private buckets.

aws_secret_access_key

An AWS account Secret Access Key. This parameter is required for private buckets.

table structure

The structure of the table described as field_name1 FieldType1, field_name2 FieldType2, ....

compression

Compression type. Supported values: none, gzip/gz, brotli/br, xz/LZMA, zstd/zst. This parameter is optional. If not specified, ClickHouse® detects compression by file extension.

You can also apply table settings to tweak the data migration process.

The use of these functions is shown in the DoubleCloud documentation for two different formats:

Use the S3Cluster table function

Use this table function as part of a query to a bucket. This method is useful when you need to transfer data only once.

The key difference from the s3 function is that you can utilize the resources of multiple hosts in your cluster. When using this function, you can parallelize the data transfer process. As a result, the query execution is faster, but the overall resource consumption is higher.

You can use this function to perform SELECT and INSERT queries on an Amazon S3 bucket. This will upload data to ClickHouse® from Amazon S3 or vice versa.

The function syntax is described below.

Upload to ClickHouse®:

SELECT * FROM
s3Cluster(cluster, path, [aws_access_key_id, aws_secret_access_key,] [format,] [structure]);

Upload to Amazon S3:

INSERT INTO FUNCTION s3(cluster, path, [aws_access_key_id, aws_secret_access_key,] [format,] [structure])
SELECT field_name1, field_name2 FROM existing_table;

Parameters

The possible engine parameters are the following:

cluster

The cluster name in DoubleCloud in which you execute this query. We recommend that you use the {cluster} macro to substitute the required value automatically.

path

An URL to a bucket with a file path. Refer to the Wildcards in path in the ClickHouse® documentation to see how to upload multiple files.

aws_access_key_id

An AWS access key ID used to access the bucket. This parameter is required for private buckets.

aws_secret_access_key

An AWS account Secret Access Key. This parameter is required for private buckets.

format

The file format .

structure

The structure of the table described as field_name1 FieldType1, field_name2 FieldType2, ....

A practical example of the s3Cluster function usage

  1. Create a ClickHouse® cluster or select an existing one on Clusters page.

  2. On the Overview tab, go to the ConnectionSQL console line and click Open.

    The console will open in a new tab.

  3. To get access to the cluster, enter your username and password in the top right corner.

  4. Run the following query to create a table to insert data from Amazon S3:

    CREATE TABLE default.newtable ON CLUSTER default (
       region String, 
       country String, 
       item_type String, 
       order_date Date, 
       order_id Int32, 
       units_sold Int32, 
       unit_price Float64, 
       unit_cost Float64, 
       total_revenue Float64, 
       total_cost Float64, 
       total_profit Float64
    ) 
    ENGINE = ReplicatedMergeTree() 
    ORDER BY (order_id)
    
  5. Execute the following query with the s3Cluster table function to fetch data from the DoubleCloud public bucket:

    INSERT INTO newtable
    SELECT * 
    FROM s3Cluster(
       '{cluster}', 
       'https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/sales_info.csv', 
       'CSVWithNames',
       'region String, 
       country String, 
       item_type String, 
       order_date Date, 
       order_id Int32, 
       units_sold Int32, 
       unit_price Float64, 
       unit_cost Float64, 
       total_revenue Float64, 
       total_cost Float64, 
       total_profit Float64'
    )
    
  6. Send a SELECT query to check the data upload results:

    SELECT * FROM s3_engine_table LIMIT 5
    

    You should get the following output:

    ┌─region───────────────────────┬─country────┬─item_type──┬─────┬─unit_cost─┬─total_revenue─┬─total_cost─┬─total_profit─┐
    │ Europe                       │ Norway     │ Cereal     │ ... │    117.11 │        133705 │    76121.5 │      57583.5 │
    │ Sub-Saharan Africa           │ Mali       │ Vegetables │ ... │     90.93 │       63934.9 │   37735.95 │     26198.95 │
    │ Middle East and North Africa │ Azerbaijan │ Baby Food  │ ... │    159.42 │    1359621.28 │  849070.92 │    510550.36 │
    │ Sub-Saharan Africa           │ Botswana   │ Vegetables │ ... │     90.93 │     336467.04 │  198591.12 │    137875.92 │
    │ Europe                       │ Croatia    │ Baby Food  │ ... │    159.42 │    1339454.16 │  836476.74 │    502977.42 │
    └──────────────────────────────┴────────────┴────────────┴─────┴───────────┴───────────────┴────────────┴──────────────┘
    

Create a table with the S3 table engine

When you need to constantly upload data from an Amazon S3 bucket, create a table in a database with an S3 engine . This table will access your bucket with the provided credentials to fetch data automatically.

The S3 table engine creates a two-way connection between a ClickHouse® cluster and a bucket. It means that when you upload data to your table, the data is automatically uploaded to the bucket, and vice versa. No caching is applied and the data is uploaded immediately.

Use this method for regular data upload and download when you need to transfer data as is, especially if you have any data on the source that has any transformations applied.

The table DDL query structure is the following:

CREATE TABLE s3_engine_table ON CLUSTER default (
   field_name1 FieldType1, 
   field_name2 FieldType2
)
ENGINE = S3 (
   bucket_URL, 
   [aws_access_key_id, aws_secret_access_key,] 
   format, 
   [compression]
)
[SETTINGS]

Parameters

The possible engine parameters are the following:

path

An URL to a bucket with a file path. Refer to the Wildcards in path in the ClickHouse® documentation to see how to upload multiple files.

format

The file format .

aws_secret_access_key

An AWS account Secret Access Key. This parameter is required for private buckets.

compression

Compression type. Supported values: none, gzip/gz, brotli/br, xz/LZMA, zstd/zst. This parameter is optional. If not specified, ClickHouse® detects compression by file extension.

Settings

Besides the parameters, you can configure table settings . S3 tables also have several settings specific to this engine:

S3-related:

You can specify these settings with SQL and in the server configuration files.

s3_max_single_part_upload_size

The maximum size of an object to upload using single part upload to S3. The default value is 64Mb.

s3_min_upload_part_size

The minimum size of parts to upload during multipart upload to S3. The default value is 512Mb.

s3_max_redirects

The maximum number of S3 redirects hops allowed. The default value is 10.

s3_single_read_retries

The maximum number of attempts during a single read. The default value is 4.

A practical guide to creating a table with the S3 engine

  1. Create a ClickHouse® cluster or select an existing one on the Clusters page.

  2. On the Overview tab, go to the ConnectionSQL console line and click Open.

    The console will open in a new tab.

  3. To get access to the cluster, enter your username and password in the top right corner.

  4. Run the following query to create a table and fill it with the data from the DoubleCloud public bucket:

    CREATE TABLE default.s3_engine_table ON CLUSTER default (
       Hit_ID Int32, 
       Date Date, 
       Time_Spent Float32, 
       Cookie_Enabled Int32, 
       Region_ID Int32, 
       Gender String, 
       Browser String,
       Traffic_Source String, 
       Technology String
    )
    ENGINE=S3(
       'https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv', 'CSVWithNames'
    )
    SETTINGS format_csv_delimiter=';'
    
  5. Send a SELECT query to check the data upload:

    SELECT * FROM s3_engine_table LIMIT 3
    

    You should get the following output:

    ┌─Hit_ID─┬───────Date─┬─Time_Spent─┬─Cookie_Enabled─┬─Region_ID─┬─Gender─┬─Browser─┬─Traffic_Source─┬─Technology───┐
    │  67112 │ 2016-01-10 │  388.93976 │              0 │       184 │ Male   │ Firefox │ Direct         │ PC (Mac)     │
    │  65411 │ 2016-01-30 │  267.13525 │              0 │       229 │ Female │ Firefox │ Search engine  │ PC (Windows) │
    │  57240 │ 2016-02-05 │  321.99017 │              0 │       229 │ Female │ Firefox │ Search engine  │ PC (Windows) │
    └────────┴────────────┴────────────┴────────────────┴───────────┴────────┴─────────┴────────────────┴──────────────┘
    

After you create the S3 table in your Managed ClickHouse® cluster, you can upload more data to the source bucket. ClickHouse® will automatically transfer this newly uploaded data to the table.

Use Transfer to migrate data

This method differs from the S3 table functions and engine since it relies on the DoubleCloud services integration. Each transfer snapshots the current state of the bucket.

With Transfer, you can optimize the resulting tables to your needs during the data migration process. You only collect data in Amazon S3, but you perform all the transformations in DoubleCloud.

Transfer has the following key features:

  • Supports 4 data types - CSV, parquet, Avro, and JSON Lines.

  • Data transformations allow altering the list of tables or columns to transfer. With these transformations, you can modify the data transfer process on the fly.

  • Copy schedule allows you to select when and how to transfer data. Use this feature to set cursor fields for incremental copies and upload only new or modified data.

  • Runtimes make it possible to configure transfer performance and resource consumption.

The use of Transfer is demonstrated in the following articles for two different formats: