Under Connection strings, find the Native interface string and click Copy.
Open your terminal and run a command to connect to your cluster:
Docker
Native clickhouse-client
docker run --network host --rm -it clickhouse/<Native interface connection string>
The complete Docker command structure
docker run --network host --rm -it \
clickhouse/clickhouse-client \
--host <FQDN of your cluster> \
--secure \
--user <cluster user name> \
--password <cluster user password> \
--port 9440
<Native interface connection string>
Create a table to import your CSV data to.
The table must have the same structure as your CSV file.
For this tutorial, create a simple table containing performance statistics for a group of sales representatives.
Call it parquet_test_data.
Exit the cluster command interface by typing exit to access local files in the next step.
Upload your data
Import the data from your Parquet file into the table on your cluster.
You need to specify the path to your CSV file, connect to the cluster and perform an INSERT query. Type the following in your terminal:
Native clickhouse-client
Docker
cat <path to your Parquet file> | <Native Interface connection string> \
--query="INSERT INTO default.parquet_test_data FORMAT Parquet"
cat <path to sample_data.parquet> | docker run --network host --rm -i
clickhouse/<Native interface connection string>
--query="INSERT INTO default.parquet_test_data FORMAT Parquet"
The complete Docker command structure
docker run --network host --rm -it \
clickhouse/clickhouse-client \
--host <FQDN of your cluster> \
--secure \
--user <your cluster username> \
--password <Cluster user password> \
--port 9440
--query="INSERT INTO default.parquet_test_data FORMAT Parquet"
Check your imported data:
Native clickhouse-client
Docker
<Native interface connection string> --query="SELECT * FROM parquet_test_data LIMIT 20"
docker run --network host --rm -it
clickhouse/<Native interface connection string>
--query="SELECT * FROM parquet_test_data LIMIT 20"
The complete Docker command structure
docker run --network host --rm -it \
clickhouse/clickhouse-client \
--host <FQDN of your cluster> \
--secure \
--user <your cluster username> \
--password <Cluster user password> \
--port 9440
--query="SELECT * FROM parquet_test_data LIMIT 20"
You should see the data from your CSV file in the terminal output:
┌─registration_date─────────────┬─id─┬─first_name─┬─last_name─┬─email────────────────────┬─gender─┬─ip_address──────┬─country────────────────┬─birthdate──┬─title────────────────────────┐
│ 2016-02-03 07:55:29.000000000 │ 1 │ Amanda │ Jordan │ ajordan0@com.com │ Female │ 1.197.201.2 │ Indonesia │ 3/8/1971 │ Internal Auditor │
│ 2016-02-03 17:04:03.000000000 │ 2 │ Albert │ Freeman │ afreeman1@is.gd │ Male │ 218.111.175.34 │ Canada │ 1/16/1968 │ Accountant IV │
│ 2016-02-03 01:09:31.000000000 │ 3 │ Evelyn │ Morgan │ emorgan2@altervista.org │ Female │ 7.161.136.94 │ Russia │ 2/1/1960 │ Structural Engineer │
│ 2016-02-03 00:36:21.000000000 │ 4 │ Denise │ Riley │ driley3@gmpg.org │ Female │ 140.35.109.83 │ China │ 4/8/1997 │ Senior Cost Accountant │
│ 2016-02-03 05:05:31.000000000 │ 5 │ Carlos │ Burns │ cburns4@miitbeian.gov.cn │ │ 169.113.235.40 │ South Africa │ │ │
│ 2016-02-03 07:22:34.000000000 │ 6 │ Kathryn │ White │ kwhite5@google.com │ Female │ 195.131.81.179 │ Indonesia │ 2/25/1983 │ Account Executive │
│ 2016-02-03 08:33:08.000000000 │ 7 │ Samuel │ Holmes │ sholmes6@foxnews.com │ Male │ 232.234.81.197 │ Portugal │ 12/18/1987 │ Senior Financial Analyst │
│ 2016-02-03 06:47:06.000000000 │ 8 │ Harry │ Howell │ hhowell7@eepurl.com │ Male │ 91.235.51.73 │ Bosnia and Herzegovina │ 3/1/1962 │ Web Developer IV │
│ 2016-02-03 03:52:53.000000000 │ 9 │ Jose │ Foster │ jfoster8@yelp.com │ Male │ 132.31.53.61 │ South Korea │ 3/27/1992 │ Software Test Engineer I │
│ 2016-02-03 18:29:47.000000000 │ 10 │ Emily │ Stewart │ estewart9@opensource.org │ Female │ 143.28.251.245 │ Nigeria │ 1/28/1997 │ Health Coach IV │
│ 2016-02-03 00:10:42.000000000 │ 11 │ Susan │ Perkins │ sperkinsa@patch.com │ Female │ 180.85.0.62 │ Russia │ │ │
│ 2016-02-03 18:04:34.000000000 │ 12 │ Alice │ Berry │ aberryb@wikipedia.org │ Female │ 246.225.12.189 │ China │ 8/12/1968 │ Quality Engineer │
│ 2016-02-03 18:48:17.000000000 │ 13 │ Justin │ Berry │ jberryc@usatoday.com │ Male │ 157.7.146.43 │ Zambia │ 8/15/1975 │ Structural Analysis Engineer │
│ 2016-02-03 21:46:52.000000000 │ 14 │ Kathy │ Reynolds │ kreynoldsd@redcross.org │ Female │ 81.254.172.13 │ Bosnia and Herzegovina │ 6/27/1970 │ Librarian │
│ 2016-02-03 08:53:23.000000000 │ 15 │ Dorothy │ Hudson │ dhudsone@blogger.com │ Female │ 8.59.7.0 │ Japan │ 12/20/1989 │ Nurse Practicioner │
│ 2016-02-03 00:44:01.000000000 │ 16 │ Bruce │ Willis │ bwillisf@bluehost.com │ Male │ 239.182.219.189 │ Brazil │ │ │
│ 2016-02-03 00:57:45.000000000 │ 17 │ Emily │ Andrews │ eandrewsg@cornell.edu │ Female │ 29.231.180.172 │ Russia │ 4/13/1990 │ Food Chemist │
│ 2016-02-03 16:44:24.000000000 │ 18 │ Stephen │ Wallace │ swallaceh@netvibes.com │ Male │ 152.49.213.62 │ Ukraine │ 1/15/1978 │ Account Representative I │
│ 2016-02-03 11:45:54.000000000 │ 19 │ Clarence │ Lawson │ clawsoni@vkontakte.ru │ Male │ 107.175.15.152 │ Russia │ │ │
│ 2016-02-03 10:30:36.000000000 │ 20 │ Rebecca │ Bell │ rbellj@bandcamp.com │ Female │ 172.215.104.127 │ China │ │ │
└───────────────────────────────┴────┴────────────┴───────────┴──────────────────────────┴────────┴─────────────────┴────────────────────────┴────────────┴──────────────────────────────┘
Amazon S3 URL
Uploading a CSV file in the Amazon S3 bucket gives you extra flexibility with DoubleCloud Managed Service for ClickHouse®. It's a sequence of manual steps that gives you complete control over every process step but lacks automation.
In short, you need to complete the following steps:
Connect to your cluster.
Create a table in the cluster that follows the CSV file structure you want to import.
Use the bash script to upload the contents of your CSV file from an Amazon S3 bucket into your Managed ClickHouse® cluster.
Before you start, it's essential to familiarize yourself with path-style URLs used by Amazon S3:
Bucket-name is the name of the bucket that contains the file.
Key-name is the name of the file you want to give a link to.
For example, there's a bucket named doublecloud-docs located in Frankfurt(eu-central-1), and we want to access the userdata1.parquet file in the /data-sets folder. The path-style URL will be as follows:
Under Connection strings, find the Native interface string and click Copy.
Open your terminal and run a command to connect to your cluster:
Docker
Native clickhouse-client
docker run --network host --rm -it clickhouse/<Native interface connection string>
The complete Docker command structure
docker run --network host --rm -it \
clickhouse/clickhouse-client \
--host <FQDN of your cluster> \
--secure \
--user <cluster user name> \
--password <cluster user password> \
--port 9440
<Native interface connection string>
Create a table to import your CSV data to.
The table must have the same structure as your CSV file.
For this tutorial, create a simple table containing performance statistics for a group of sales representatives.
Call it parquet_test:
Insert the data from your Parquet file located in the Amazon S3 bucket.
To do this, you need to use the URL function of ClickHouse®. The syntax of this function looks like this:
`<URL>', Parquet, '<structure>'
<URL> is the path to your file.
Parquet is the format of the file you want to import
<structure> is the structure of the target table.
Pay attention to the structure of the URL section in the following code example - it incorporates the path to the Parquet file and the INSERT query:
INSERTINTO parquet_test
SELECT*FROM s3
(
'https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/userdata1.parquet',
'Parquet',
'registration_dttm String, id Int, first_name String, last_name String, email String, gender String, ip_address String, country String, birthdate String, title String'
)
Check your imported data:
SELECT*FROM parquet_test LIMIT 20
You should see the data from your Parquet file in the terminal output:
┌─registration_date─────────────┬─id─┬─first_name─┬─last_name─┬─email────────────────────┬─gender─┬─ip_address──────┬─country────────────────┬─birthdate──┬─title────────────────────────┐
│ 2016-02-03 07:55:29.000000000 │ 1 │ Amanda │ Jordan │ ajordan0@com.com │ Female │ 1.197.201.2 │ Indonesia │ 3/8/1971 │ Internal Auditor │
│ 2016-02-03 17:04:03.000000000 │ 2 │ Albert │ Freeman │ afreeman1@is.gd │ Male │ 218.111.175.34 │ Canada │ 1/16/1968 │ Accountant IV │
│ 2016-02-03 01:09:31.000000000 │ 3 │ Evelyn │ Morgan │ emorgan2@altervista.org │ Female │ 7.161.136.94 │ Russia │ 2/1/1960 │ Structural Engineer │
│ 2016-02-03 00:36:21.000000000 │ 4 │ Denise │ Riley │ driley3@gmpg.org │ Female │ 140.35.109.83 │ China │ 4/8/1997 │ Senior Cost Accountant │
│ 2016-02-03 05:05:31.000000000 │ 5 │ Carlos │ Burns │ cburns4@miitbeian.gov.cn │ │ 169.113.235.40 │ South Africa │ │ │
│ 2016-02-03 07:22:34.000000000 │ 6 │ Kathryn │ White │ kwhite5@google.com │ Female │ 195.131.81.179 │ Indonesia │ 2/25/1983 │ Account Executive │
│ 2016-02-03 08:33:08.000000000 │ 7 │ Samuel │ Holmes │ sholmes6@foxnews.com │ Male │ 232.234.81.197 │ Portugal │ 12/18/1987 │ Senior Financial Analyst │
│ 2016-02-03 06:47:06.000000000 │ 8 │ Harry │ Howell │ hhowell7@eepurl.com │ Male │ 91.235.51.73 │ Bosnia and Herzegovina │ 3/1/1962 │ Web Developer IV │
│ 2016-02-03 03:52:53.000000000 │ 9 │ Jose │ Foster │ jfoster8@yelp.com │ Male │ 132.31.53.61 │ South Korea │ 3/27/1992 │ Software Test Engineer I │
│ 2016-02-03 18:29:47.000000000 │ 10 │ Emily │ Stewart │ estewart9@opensource.org │ Female │ 143.28.251.245 │ Nigeria │ 1/28/1997 │ Health Coach IV │
│ 2016-02-03 00:10:42.000000000 │ 11 │ Susan │ Perkins │ sperkinsa@patch.com │ Female │ 180.85.0.62 │ Russia │ │ │
│ 2016-02-03 18:04:34.000000000 │ 12 │ Alice │ Berry │ aberryb@wikipedia.org │ Female │ 246.225.12.189 │ China │ 8/12/1968 │ Quality Engineer │
│ 2016-02-03 18:48:17.000000000 │ 13 │ Justin │ Berry │ jberryc@usatoday.com │ Male │ 157.7.146.43 │ Zambia │ 8/15/1975 │ Structural Analysis Engineer │
│ 2016-02-03 21:46:52.000000000 │ 14 │ Kathy │ Reynolds │ kreynoldsd@redcross.org │ Female │ 81.254.172.13 │ Bosnia and Herzegovina │ 6/27/1970 │ Librarian │
│ 2016-02-03 08:53:23.000000000 │ 15 │ Dorothy │ Hudson │ dhudsone@blogger.com │ Female │ 8.59.7.0 │ Japan │ 12/20/1989 │ Nurse Practicioner │
│ 2016-02-03 00:44:01.000000000 │ 16 │ Bruce │ Willis │ bwillisf@bluehost.com │ Male │ 239.182.219.189 │ Brazil │ │ │
│ 2016-02-03 00:57:45.000000000 │ 17 │ Emily │ Andrews │ eandrewsg@cornell.edu │ Female │ 29.231.180.172 │ Russia │ 4/13/1990 │ Food Chemist │
│ 2016-02-03 16:44:24.000000000 │ 18 │ Stephen │ Wallace │ swallaceh@netvibes.com │ Male │ 152.49.213.62 │ Ukraine │ 1/15/1978 │ Account Representative I │
│ 2016-02-03 11:45:54.000000000 │ 19 │ Clarence │ Lawson │ clawsoni@vkontakte.ru │ Male │ 107.175.15.152 │ Russia │ │ │
│ 2016-02-03 10:30:36.000000000 │ 20 │ Rebecca │ Bell │ rbellj@bandcamp.com │ Female │ 172.215.104.127 │ China │ │ │
└───────────────────────────────┴────┴────────────┴───────────┴──────────────────────────┴────────┴─────────────────┴────────────────────────┴────────────┴──────────────────────────────┘
Sample connection scripts
We tested these scripts in the following environments:
Ubuntu 20.04
Python: 3.8.10, pip: 20.0.2
These instructions allow you to create a script that will perform all the actions to import your CSV file with a single terminal command. Moreover, it retains complete control over the process.
In short, you need to complete the following steps:
Install the necessary software.
Write a script to upload your data to the cluster.
Install the ClickHouse® driver and all the necessary dependencies to process .parquet files:
pip install clickhouse-driver pandas fastparquet
Write the script to upload your data to the cluster.
parquet_uploader_test.py
import fastparquet
import pandas as pd
from clickhouse_driver import Client
# Connect to your ClickHouse cluster
CH_Client = Client(
host='<FQDN of any ClickHouse® host on your cluster>',
port=9440,
user='<Cluster user name>',
password='<Cluster user password>',
secure=True)
# Read the Parquet file into a pandas DataFrame
df = pd.read_parquet('<path to your .parquet file>')
# Insert the DataFrame into ClickHouse
CH_Client.execute(
'INSERT INTO default.parquet_test_data FORMAT Parquet', df.to_dict(orient='records')
)