Transfer data from Amazon S3 to your Managed ClickHouse® cluster and vice versa
DoubleCloud provides multiple ways to upload data from an Amazon S3 bucket to your Managed ClickHouse® cluster. ClickHouse® table functions and engines are designed to fetch data only from Amazon buckets, not other S3-compatible buckets. In case you have any bucket other than Amazon S3, use Transfer.
All the methods described here allow users to upload data from public and private buckets.
You can select a way to integrate Amazon S3 that fits your needs the best from the options below. They are divided into two groups depending on whether you need to transfer data only once or constantly.
-
One-time data upload:
-
Continuous transfer:
-
The Transfer service.
Besides the S3 integrations mentioned above, each Managed ClickHouse® cluster in DoubleCloud comes with an integrated bucket by default. You don't pay for it until you use it to store 'cold' data.
To learn more about this storage policy, see Hybrid storage in Managed ClickHouse®.
Use the S3 table function
Use this built-in table function as part of a query to a bucket. This method is useful when you need to transfer data only once.
You can use this function to perform SELECT
and INSERT
queries on an Amazon S3 bucket. This will upload data to ClickHouse® from Amazon S3 or vice versa.
The function syntax is described below.
Upload to ClickHouse®:
SELECT * FROM
s3(bucket_URL, [format,] [aws_access_key_id, aws_secret_access_key,] [table_structure,] [compression]);
Upload to Amazon S3:
INSERT INTO FUNCTION s3(bucket_URL, [format,] [aws_access_key_id, aws_secret_access_key,] [table_structure,] [compression])
SELECT field_name1, field_name2 FROM existing_table;
Tip
If you don't want to expose your AWS credentials, use named collections to verify your queries:
Create a named collection and update your upload query
To create a named collection, a user with admin
privileges must send this query:
CREATE NAMED COLLECTION s3_upload AS
access_key_id = 'AWS KEY',
secret_access_key = 'AWS SECRET KEY',
url = 'https://your_s3_bucket.s3.eu-central-1.amazonaws.com/';
Now, you can send the above query without exposing your Amazon S3 credentials:
INSERT INTO FUNCTION s3(bucket_URL, [format,] s3_upload [table_structure,] [compression])
SELECT field_name1, field_name2 FROM existing_table;
Parameters
The possible engine parameters are the following:
|
An URL to a bucket with a file path. Refer to the Wildcards in path |
|
The file format |
|
An AWS access key ID used to access the bucket. This parameter is required for private buckets. |
|
An AWS account Secret Access Key. This parameter is required for private buckets. |
|
The structure of the table described as |
|
Compression type. Supported values: |
You can also apply table settings
The use of these functions is shown in the DoubleCloud documentation for two different formats:
Use the S3Cluster table function
Use this table function as part of a query to a bucket. This method is useful when you need to transfer data only once.
The key difference from the s3 function is that you can utilize the resources of multiple hosts in your cluster. When using this function, you can parallelize the data transfer process. As a result, the query execution is faster, but the overall resource consumption is higher.
You can use this function to perform SELECT
and INSERT
queries on an Amazon S3 bucket. This will upload data to ClickHouse® from Amazon S3 or vice versa.
The function syntax is described below.
Upload to ClickHouse®:
SELECT * FROM
s3Cluster(cluster, path, [aws_access_key_id, aws_secret_access_key,] [format,] [structure]);
Upload to Amazon S3:
INSERT INTO FUNCTION s3(cluster, path, [aws_access_key_id, aws_secret_access_key,] [format,] [structure])
SELECT field_name1, field_name2 FROM existing_table;
Parameters
The possible engine parameters are the following:
|
The cluster name in DoubleCloud in which you execute this query. We recommend that you use the |
|
An URL to a bucket with a file path. Refer to the Wildcards in path |
|
An AWS access key ID used to access the bucket. This parameter is required for private buckets. |
|
An AWS account Secret Access Key. This parameter is required for private buckets. |
|
The file format |
|
The structure of the table described as |
A practical example of the s3Cluster function usage
-
Create a ClickHouse® cluster or select an existing one on the Clusters
-
Click WebSQL at the top right.
-
In WebSQL, click on any database in the connection manager on the left to open the query editor.
-
Run the following query to create a table to insert data from Amazon S3:
CREATE TABLE default.newtable ON CLUSTER default ( region String, country String, item_type String, order_date Date, order_id Int32, units_sold Int32, unit_price Float64, unit_cost Float64, total_revenue Float64, total_cost Float64, total_profit Float64 ) ENGINE = ReplicatedMergeTree() ORDER BY (order_id)
-
Execute the following query with the s3Cluster table function to fetch data from the DoubleCloud public bucket:
INSERT INTO newtable SELECT * FROM s3Cluster( '{cluster}', 'https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/sales_info.csv', 'CSVWithNames', 'region String, country String, item_type String, order_date Date, order_id Int32, units_sold Int32, unit_price Float64, unit_cost Float64, total_revenue Float64, total_cost Float64, total_profit Float64' )
-
Send a
SELECT
query to check the data upload results:SELECT * FROM s3_engine_table LIMIT 5
You should get the following output:
┌─region───────────────────────┬─country────┬─item_type──┬─────┬─unit_cost─┬─total_revenue─┬─total_cost─┬─total_profit─┐ │ Europe │ Norway │ Cereal │ ... │ 117.11 │ 133705 │ 76121.5 │ 57583.5 │ │ Sub-Saharan Africa │ Mali │ Vegetables │ ... │ 90.93 │ 63934.9 │ 37735.95 │ 26198.95 │ │ Middle East and North Africa │ Azerbaijan │ Baby Food │ ... │ 159.42 │ 1359621.28 │ 849070.92 │ 510550.36 │ │ Sub-Saharan Africa │ Botswana │ Vegetables │ ... │ 90.93 │ 336467.04 │ 198591.12 │ 137875.92 │ │ Europe │ Croatia │ Baby Food │ ... │ 159.42 │ 1339454.16 │ 836476.74 │ 502977.42 │ └──────────────────────────────┴────────────┴────────────┴─────┴───────────┴───────────────┴────────────┴──────────────┘
Create a table with the S3 table engine
When you need to constantly upload data from an Amazon S3 bucket, create a table in a database with an S3 engine
The S3 table engine creates a two-way connection between a ClickHouse® cluster and a bucket. It means that when you upload data to your table, the data is automatically uploaded to the bucket, and vice versa. No caching is applied and the data is uploaded immediately.
Use this method for regular data upload and download when you need to transfer data as is, especially if you have any data on the source that has any transformations applied.
The table DDL query structure is the following:
CREATE TABLE s3_engine_table ON CLUSTER default (
field_name1 FieldType1,
field_name2 FieldType2
)
ENGINE = S3 (
bucket_URL,
[aws_access_key_id, aws_secret_access_key,]
format,
[compression]
)
[SETTINGS]
Parameters
The possible engine parameters are the following:
|
An URL to a bucket with a file path. Refer to the Wildcards in path |
|
The file format |
|
An AWS account Secret Access Key. This parameter is required for private buckets. |
|
Compression type. Supported values: |
Settings
Besides the parameters, you can configure table settings
S3-related:
You can specify these settings with SQL and in the server configuration files.
|
The maximum size of an object to upload using single part upload to S3. The default value is |
|
The minimum size of parts to upload during multipart upload |
|
The maximum number of S3 redirects hops allowed. The default value is |
|
The maximum number of attempts during a single read. The default value is |
A practical guide to creating a table with the S3 engine
-
Create a ClickHouse® cluster or select an existing one on the Clusters
-
Click WebSQL at the top right.
-
In WebSQL, click on any database in the connection manager on the left to open the query editor.
-
Run the following query to create a table and fill it with the data from the DoubleCloud public bucket:
CREATE TABLE default.s3_engine_table ON CLUSTER default ( Hit_ID Int32, Date Date, Time_Spent Float32, Cookie_Enabled Int32, Region_ID Int32, Gender String, Browser String, Traffic_Source String, Technology String ) ENGINE=S3( 'https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.csv', 'CSVWithNames' ) SETTINGS format_csv_delimiter=';'
-
Send a
SELECT
query to check the data upload:SELECT * FROM s3_engine_table LIMIT 3
┌─Hit_ID─┬───────Date─┬─Time_Spent─┬─Cookie_Enabled─┬─Region_ID─┬─Gender─┬─Browser─┬─Traffic_Source─┬─Technology───┐ │ 67112 │ 2016-01-10 │ 388.93976 │ 0 │ 184 │ Male │ Firefox │ Direct │ PC (Mac) │ │ 65411 │ 2016-01-30 │ 267.13525 │ 0 │ 229 │ Female │ Firefox │ Search engine │ PC (Windows) │ │ 57240 │ 2016-02-05 │ 321.99017 │ 0 │ 229 │ Female │ Firefox │ Search engine │ PC (Windows) │ └────────┴────────────┴────────────┴────────────────┴───────────┴────────┴─────────┴────────────────┴──────────────┘
After you create the S3 table in your Managed ClickHouse® cluster, you can upload more data to the source bucket. ClickHouse® will automatically transfer this newly uploaded data to the table.
Use Transfer to migrate data
This method differs from the S3 table functions and engine since it relies on the DoubleCloud services integration. Each transfer snapshots the current state of the bucket.
With Transfer, you can optimize the resulting tables to your needs during the data migration process. You only collect data in Amazon S3, but you perform all the transformations in DoubleCloud.
Transfer has the following key features:
-
Supports 4 data types -
CSV
,parquet
,Avro
, andJSON Lines
. -
Data transformations allow altering the list of tables or columns to transfer. With these transformations, you can modify the data transfer process on the fly.
-
Copy schedule allows you to select when and how to transfer data. Use this feature to set cursor fields for incremental copies and upload only new or modified data.
-
Runtimes make it possible to configure transfer performance and resource consumption.
The use of Transfer is demonstrated in the following articles for two different formats:
-
Quick-start into the DoubleCloud world. This introductory tutorial illustrates transfer a
CSV
file from an Amazon S3 bucket to the Managed ClickHouse® cluster. -
Getting started with Transfer shows how to transfer
parquet
from Amazon S3 to ClickHouse® .