Upload Parquet data to your Managed ClickHouse® cluster
What these instructions are about
This article will show you how to upload Parquet files to your cluster using CLI software within DoubleCloud.
Before you start, download
We tested these instructions in the following environments:
- Ubuntu 20.04 LTS:
- Bash:
5.1.16
- clickhouse-client:
22.7.3.5
- Docker:
20.10.12
- Bash:
clickhouse-client
Before you start
-
Download
-
Install the ClickHouse® client.
-
Connect to your cluster with a connection string:
-
Go to the Clusters
-
Under Connection strings, find the Native interface string and click Copy.
-
Open your terminal and run a command to connect to your cluster:
DockerNative clickhouse-clientdocker run --network host --rm -it clickhouse/<Native interface connection string>
The complete Docker command structure
docker run --network host --rm -it \ clickhouse/clickhouse-client \ --host <FQDN of your cluster> \ --secure \ --user <cluster user name> \ --password <cluster user password> \ --port 9440
<Native interface connection string>
-
-
Create a table to import your CSV data to.
The table must have the same structure as your CSV file. For this tutorial, create a simple table containing performance statistics for a group of sales representatives. Call it
parquet_test_data
.CREATE TABLE parquet_test_data ON CLUSTER default ( `registration_date` String, `id` Int, `first_name` String, `last_name` String, `email` String, `gender` String, `ip_address` String, `country` String, `birthdate` String, `title` String ) ENGINE = ReplicatedMergeTree() ORDER BY (id)
-
Exit the cluster command interface by typing
exit
to access local files in the next step.
Upload your data
-
Import the data from your Parquet file into the table on your cluster.
You need to specify the path to your CSV file, connect to the cluster and perform an
INSERT
query. Type the following in your terminal:Native clickhouse-clientDockercat <path to your Parquet file> | <Native Interface connection string> \ --query="INSERT INTO default.parquet_test_data FORMAT Parquet"
cat <path to sample_data.parquet> | docker run --network host --rm -i clickhouse/<Native interface connection string> --query="INSERT INTO default.parquet_test_data FORMAT Parquet"
The complete Docker command structure
docker run --network host --rm -it \ clickhouse/clickhouse-client \ --host <FQDN of your cluster> \ --secure \ --user <your cluster username> \ --password <Cluster user password> \ --port 9440 --query="INSERT INTO default.parquet_test_data FORMAT Parquet"
-
Check your imported data:
Native clickhouse-clientDocker<Native interface connection string> --query="SELECT * FROM parquet_test_data LIMIT 20"
docker run --network host --rm -it clickhouse/<Native interface connection string> --query="SELECT * FROM parquet_test_data LIMIT 20"
The complete Docker command structure
docker run --network host --rm -it \ clickhouse/clickhouse-client \ --host <FQDN of your cluster> \ --secure \ --user <your cluster username> \ --password <Cluster user password> \ --port 9440 --query="SELECT * FROM parquet_test_data LIMIT 20"
You should see the data from your CSV file in the terminal output:
┌─registration_date─────────────┬─id─┬─first_name─┬─last_name─┬─email────────────────────┬─gender─┬─ip_address──────┬─country────────────────┬─birthdate──┬─title────────────────────────┐ │ 2016-02-03 07:55:29.000000000 │ 1 │ Amanda │ Jordan │ ajordan0@com.com │ Female │ 1.197.201.2 │ Indonesia │ 3/8/1971 │ Internal Auditor │ │ 2016-02-03 17:04:03.000000000 │ 2 │ Albert │ Freeman │ afreeman1@is.gd │ Male │ 218.111.175.34 │ Canada │ 1/16/1968 │ Accountant IV │ │ 2016-02-03 01:09:31.000000000 │ 3 │ Evelyn │ Morgan │ emorgan2@altervista.org │ Female │ 7.161.136.94 │ Russia │ 2/1/1960 │ Structural Engineer │ │ 2016-02-03 00:36:21.000000000 │ 4 │ Denise │ Riley │ driley3@gmpg.org │ Female │ 140.35.109.83 │ China │ 4/8/1997 │ Senior Cost Accountant │ │ 2016-02-03 05:05:31.000000000 │ 5 │ Carlos │ Burns │ cburns4@miitbeian.gov.cn │ │ 169.113.235.40 │ South Africa │ │ │ │ 2016-02-03 07:22:34.000000000 │ 6 │ Kathryn │ White │ kwhite5@google.com │ Female │ 195.131.81.179 │ Indonesia │ 2/25/1983 │ Account Executive │ │ 2016-02-03 08:33:08.000000000 │ 7 │ Samuel │ Holmes │ sholmes6@foxnews.com │ Male │ 232.234.81.197 │ Portugal │ 12/18/1987 │ Senior Financial Analyst │ │ 2016-02-03 06:47:06.000000000 │ 8 │ Harry │ Howell │ hhowell7@eepurl.com │ Male │ 91.235.51.73 │ Bosnia and Herzegovina │ 3/1/1962 │ Web Developer IV │ │ 2016-02-03 03:52:53.000000000 │ 9 │ Jose │ Foster │ jfoster8@yelp.com │ Male │ 132.31.53.61 │ South Korea │ 3/27/1992 │ Software Test Engineer I │ │ 2016-02-03 18:29:47.000000000 │ 10 │ Emily │ Stewart │ estewart9@opensource.org │ Female │ 143.28.251.245 │ Nigeria │ 1/28/1997 │ Health Coach IV │ │ 2016-02-03 00:10:42.000000000 │ 11 │ Susan │ Perkins │ sperkinsa@patch.com │ Female │ 180.85.0.62 │ Russia │ │ │ │ 2016-02-03 18:04:34.000000000 │ 12 │ Alice │ Berry │ aberryb@wikipedia.org │ Female │ 246.225.12.189 │ China │ 8/12/1968 │ Quality Engineer │ │ 2016-02-03 18:48:17.000000000 │ 13 │ Justin │ Berry │ jberryc@usatoday.com │ Male │ 157.7.146.43 │ Zambia │ 8/15/1975 │ Structural Analysis Engineer │ │ 2016-02-03 21:46:52.000000000 │ 14 │ Kathy │ Reynolds │ kreynoldsd@redcross.org │ Female │ 81.254.172.13 │ Bosnia and Herzegovina │ 6/27/1970 │ Librarian │ │ 2016-02-03 08:53:23.000000000 │ 15 │ Dorothy │ Hudson │ dhudsone@blogger.com │ Female │ 8.59.7.0 │ Japan │ 12/20/1989 │ Nurse Practicioner │ │ 2016-02-03 00:44:01.000000000 │ 16 │ Bruce │ Willis │ bwillisf@bluehost.com │ Male │ 239.182.219.189 │ Brazil │ │ │ │ 2016-02-03 00:57:45.000000000 │ 17 │ Emily │ Andrews │ eandrewsg@cornell.edu │ Female │ 29.231.180.172 │ Russia │ 4/13/1990 │ Food Chemist │ │ 2016-02-03 16:44:24.000000000 │ 18 │ Stephen │ Wallace │ swallaceh@netvibes.com │ Male │ 152.49.213.62 │ Ukraine │ 1/15/1978 │ Account Representative I │ │ 2016-02-03 11:45:54.000000000 │ 19 │ Clarence │ Lawson │ clawsoni@vkontakte.ru │ Male │ 107.175.15.152 │ Russia │ │ │ │ 2016-02-03 10:30:36.000000000 │ 20 │ Rebecca │ Bell │ rbellj@bandcamp.com │ Female │ 172.215.104.127 │ China │ │ │ └───────────────────────────────┴────┴────────────┴───────────┴──────────────────────────┴────────┴─────────────────┴────────────────────────┴────────────┴──────────────────────────────┘
Amazon S3 URL
Uploading a CSV file in the Amazon S3 bucket gives you extra flexibility with DoubleCloud Managed Service for ClickHouse®. It's a sequence of manual steps that gives you complete control over every process step but lacks automation.
In short, you need to complete the following steps:
-
Connect to your cluster.
-
Create a table in the cluster that follows the CSV file structure you want to import.
-
Use the bash script to upload the contents of your CSV file from an Amazon S3 bucket into your Managed ClickHouse® cluster.
Before you start, it's essential to familiarize yourself with path-style URLs used by Amazon S3:
https://<Bucket-name>.s3.<Region>.amazonaws.com/<Key_name>
-
Region
is the identifier of the AWS region. -
Bucket-name
is the name of the bucket that contains the file. -
Key-name
is the name of the file you want to give a link to.
For example, there's a bucket named doublecloud-docs
located in Frankfurt(eu-central-1
), and we want to access the userdata1.parquet
file in the /data-sets
folder. The path-style URL will be as follows:
https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/userdata1.parquet
Now let's follow every step in detail.
-
Connect to your cluster with a connection string:
-
Go to the Clusters
-
Under Connection strings, find the Native interface string and click Copy.
-
Open your terminal and run a command to connect to your cluster:
DockerNative clickhouse-clientdocker run --network host --rm -it clickhouse/<Native interface connection string>
The complete Docker command structure
docker run --network host --rm -it \ clickhouse/clickhouse-client \ --host <FQDN of your cluster> \ --secure \ --user <cluster user name> \ --password <cluster user password> \ --port 9440
<Native interface connection string>
-
-
Create a table to import your CSV data to.
The table must have the same structure as your CSV file. For this tutorial, create a simple table containing performance statistics for a group of sales representatives. Call it
parquet_test
:CREATE TABLE parquet_test ON CLUSTER default ( `registration_date` String, `id` Int, `first_name` String, `last_name` String, `email` String, `gender` String, `ip_address` String, `country` String, `birthdate` String, `title` String ) ENGINE = ReplicatedMergeTree() ORDER BY (id)
-
Insert the data from your Parquet file located in the Amazon S3 bucket.
To do this, you need to use the
URL
function of ClickHouse®. The syntax of this function looks like this:`<URL>', Parquet, '<structure>'
<URL>
is the path to your file.Parquet
is the format of the file you want to import<structure>
is the structure of the target table.
Pay attention to the structure of the
URL
section in the following code example - it incorporates the path to the Parquet file and theINSERT
query:INSERT INTO parquet_test SELECT * FROM s3 ( 'https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/userdata1.parquet', 'Parquet', 'registration_dttm String, id Int, first_name String, last_name String, email String, gender String, ip_address String, country String, birthdate String, title String' )
-
Check your imported data:
SELECT * FROM parquet_test LIMIT 20
You should see the data from your Parquet file in the terminal output:
┌─registration_date─────────────┬─id─┬─first_name─┬─last_name─┬─email────────────────────┬─gender─┬─ip_address──────┬─country────────────────┬─birthdate──┬─title────────────────────────┐ │ 2016-02-03 07:55:29.000000000 │ 1 │ Amanda │ Jordan │ ajordan0@com.com │ Female │ 1.197.201.2 │ Indonesia │ 3/8/1971 │ Internal Auditor │ │ 2016-02-03 17:04:03.000000000 │ 2 │ Albert │ Freeman │ afreeman1@is.gd │ Male │ 218.111.175.34 │ Canada │ 1/16/1968 │ Accountant IV │ │ 2016-02-03 01:09:31.000000000 │ 3 │ Evelyn │ Morgan │ emorgan2@altervista.org │ Female │ 7.161.136.94 │ Russia │ 2/1/1960 │ Structural Engineer │ │ 2016-02-03 00:36:21.000000000 │ 4 │ Denise │ Riley │ driley3@gmpg.org │ Female │ 140.35.109.83 │ China │ 4/8/1997 │ Senior Cost Accountant │ │ 2016-02-03 05:05:31.000000000 │ 5 │ Carlos │ Burns │ cburns4@miitbeian.gov.cn │ │ 169.113.235.40 │ South Africa │ │ │ │ 2016-02-03 07:22:34.000000000 │ 6 │ Kathryn │ White │ kwhite5@google.com │ Female │ 195.131.81.179 │ Indonesia │ 2/25/1983 │ Account Executive │ │ 2016-02-03 08:33:08.000000000 │ 7 │ Samuel │ Holmes │ sholmes6@foxnews.com │ Male │ 232.234.81.197 │ Portugal │ 12/18/1987 │ Senior Financial Analyst │ │ 2016-02-03 06:47:06.000000000 │ 8 │ Harry │ Howell │ hhowell7@eepurl.com │ Male │ 91.235.51.73 │ Bosnia and Herzegovina │ 3/1/1962 │ Web Developer IV │ │ 2016-02-03 03:52:53.000000000 │ 9 │ Jose │ Foster │ jfoster8@yelp.com │ Male │ 132.31.53.61 │ South Korea │ 3/27/1992 │ Software Test Engineer I │ │ 2016-02-03 18:29:47.000000000 │ 10 │ Emily │ Stewart │ estewart9@opensource.org │ Female │ 143.28.251.245 │ Nigeria │ 1/28/1997 │ Health Coach IV │ │ 2016-02-03 00:10:42.000000000 │ 11 │ Susan │ Perkins │ sperkinsa@patch.com │ Female │ 180.85.0.62 │ Russia │ │ │ │ 2016-02-03 18:04:34.000000000 │ 12 │ Alice │ Berry │ aberryb@wikipedia.org │ Female │ 246.225.12.189 │ China │ 8/12/1968 │ Quality Engineer │ │ 2016-02-03 18:48:17.000000000 │ 13 │ Justin │ Berry │ jberryc@usatoday.com │ Male │ 157.7.146.43 │ Zambia │ 8/15/1975 │ Structural Analysis Engineer │ │ 2016-02-03 21:46:52.000000000 │ 14 │ Kathy │ Reynolds │ kreynoldsd@redcross.org │ Female │ 81.254.172.13 │ Bosnia and Herzegovina │ 6/27/1970 │ Librarian │ │ 2016-02-03 08:53:23.000000000 │ 15 │ Dorothy │ Hudson │ dhudsone@blogger.com │ Female │ 8.59.7.0 │ Japan │ 12/20/1989 │ Nurse Practicioner │ │ 2016-02-03 00:44:01.000000000 │ 16 │ Bruce │ Willis │ bwillisf@bluehost.com │ Male │ 239.182.219.189 │ Brazil │ │ │ │ 2016-02-03 00:57:45.000000000 │ 17 │ Emily │ Andrews │ eandrewsg@cornell.edu │ Female │ 29.231.180.172 │ Russia │ 4/13/1990 │ Food Chemist │ │ 2016-02-03 16:44:24.000000000 │ 18 │ Stephen │ Wallace │ swallaceh@netvibes.com │ Male │ 152.49.213.62 │ Ukraine │ 1/15/1978 │ Account Representative I │ │ 2016-02-03 11:45:54.000000000 │ 19 │ Clarence │ Lawson │ clawsoni@vkontakte.ru │ Male │ 107.175.15.152 │ Russia │ │ │ │ 2016-02-03 10:30:36.000000000 │ 20 │ Rebecca │ Bell │ rbellj@bandcamp.com │ Female │ 172.215.104.127 │ China │ │ │ └───────────────────────────────┴────┴────────────┴───────────┴──────────────────────────┴────────┴─────────────────┴────────────────────────┴────────────┴──────────────────────────────┘
Sample connection scripts
We tested these scripts in the following environments:
- Ubuntu 20.04
- Python:
3.8.10
, pip:20.0.2
- Python:
These instructions allow you to create a script that will perform all the actions to import your CSV file with a single terminal command. Moreover, it retains complete control over the process.
In short, you need to complete the following steps:
- Install the necessary software.
- Write a script to upload your data to the cluster.
- Upload the data to your cluster.
Let's follow every step in detail.
-
Install the software and dependencies:
sudo apt update && sudo apt install -y python3 python3-pip && pip3 install requests
-
Install the ClickHouse® driver and all the necessary dependencies to process
.parquet
files:pip install clickhouse-driver pandas fastparquet
-
Write the script to upload your data to the cluster.
parquet_uploader_test.py
import fastparquet import pandas as pd from clickhouse_driver import Client # Connect to your ClickHouse cluster CH_Client = Client( host='<FQDN of any ClickHouse® host on your cluster>', port=9440, user='<Cluster user name>', password='<Cluster user password>', secure=True) # Read the Parquet file into a pandas DataFrame df = pd.read_parquet('<path to your .parquet file>') # Insert the DataFrame into ClickHouse CH_Client.execute( 'INSERT INTO default.parquet_test_data FORMAT Parquet', df.to_dict(orient='records') )