DoubleCloud provides multiple ways to upload data from an Amazon S3 bucket to your Managed ClickHouse® cluster. ClickHouse® table functions and engines are designed to fetch data only from Amazon buckets, not other S3-compatible buckets. In case you have any bucket other than Amazon S3, use Transfer.
All the methods described here allow users to upload data from public and private buckets.
You can select a way to integrate Amazon S3 that fits your needs the best from the options below. They are divided into two groups depending on whether you need to transfer data only once or constantly.
Besides the S3 integrations mentioned above, each Managed ClickHouse® cluster in DoubleCloud comes with an integrated bucket by default. You don't pay for it until you use it to store 'cold' data.
Use this built-in table function as part of a query to a bucket. This method is useful when you need to transfer data only once.
You can use this function to perform SELECT and INSERT queries on an Amazon S3 bucket. This will upload data to ClickHouse® from Amazon S3 or vice versa.
An AWS access key ID used to access the bucket. This parameter is required for private buckets.
aws_secret_access_key
An AWS account Secret Access Key. This parameter is required for private buckets.
table structure
The structure of the table described as field_name1 FieldType1, field_name2 FieldType2, ....
compression
Compression type. Supported values: none, gzip/gz, brotli/br, xz/LZMA, zstd/zst. This parameter is optional. If not specified, ClickHouse® detects compression by file extension.
You can also apply table settings to tweak the data migration process.
The use of these functions is shown in the DoubleCloud documentation for two different formats:
Use this table function as part of a query to a bucket. This method is useful when you need to transfer data only once.
The key difference from the s3 function is that you can utilize the resources of multiple hosts in your cluster. When using this function, you can parallelize the data transfer process. As a result, the query execution is faster, but the overall resource consumption is higher.
You can use this function to perform SELECT and INSERT queries on an Amazon S3 bucket. This will upload data to ClickHouse® from Amazon S3 or vice versa.
The cluster name in DoubleCloud in which you execute this query. We recommend that you use the {cluster} macro to substitute the required value automatically.
path
An URL to a bucket with a file path. Refer to the Wildcards in path in the ClickHouse® documentation to see how to upload multiple files.
aws_access_key_id
An AWS access key ID used to access the bucket. This parameter is required for private buckets.
aws_secret_access_key
An AWS account Secret Access Key. This parameter is required for private buckets.
When you need to constantly upload data from an Amazon S3 bucket, create a table in a database with an S3 engine . This table will access your bucket with the provided credentials to fetch data automatically.
The S3 table engine creates a two-way connection between a ClickHouse® cluster and a bucket. It means that when you upload data to your table, the data is automatically uploaded to the bucket, and vice versa. No caching is applied and the data is uploaded immediately.
Use this method for regular data upload and download when you need to transfer data as is, especially if you have any data on the source that has any transformations applied.
An AWS account Secret Access Key. This parameter is required for private buckets.
compression
Compression type. Supported values: none, gzip/gz, brotli/br, xz/LZMA, zstd/zst. This parameter is optional. If not specified, ClickHouse® detects compression by file extension.
Settings
Besides the parameters, you can configure table settings . S3 tables also have several settings specific to this engine:
S3-related:
You can specify these settings with SQL and in the server configuration files.
s3_max_single_part_upload_size
The maximum size of an object to upload using single part upload to S3. The default value is 64Mb.
s3_min_upload_part_size
The minimum size of parts to upload during multipart upload to S3. The default value is 512Mb.
s3_max_redirects
The maximum number of S3 redirects hops allowed. The default value is 10.
s3_single_read_retries
The maximum number of attempts during a single read. The default value is 4.
A practical guide to creating a table with the S3 engine
After you create the S3 table in your Managed ClickHouse® cluster, you can upload more data to the source bucket. ClickHouse® will automatically transfer this newly uploaded data to the table.
Use Transfer to migrate data
This method differs from the S3 table functions and engine since it relies on the DoubleCloud services integration. Each transfer snapshots the current state of the bucket.
With Transfer, you can optimize the resulting tables to your needs during the data migration process. You only collect data in Amazon S3, but you perform all the transformations in DoubleCloud.
Transfer has the following key features:
Supports 4 data types - CSV, parquet, Avro, and JSON Lines.
Data transformations allow altering the list of tables or columns to transfer. With these transformations, you can modify the data transfer process on the fly.
Copy schedule allows you to select when and how to transfer data. Use this feature to set cursor fields for incremental copies and upload only new or modified data.
Runtimes make it possible to configure transfer performance and resource consumption.
The use of Transfer is demonstrated in the following articles for two different formats:
Quick-start into the DoubleCloud world. This introductory tutorial illustrates transfer a CSV file from an Amazon S3 bucket to the Managed ClickHouse® cluster.