Upload Parquet data to your Managed ClickHouse® cluster

What these instructions are about

This article will show you how to upload Parquet files to your cluster using CLI software within DoubleCloud.

Before you start, download the Parquet file with sample data. You can use it to follow the examples in this article.

We tested these instructions in the following environments:

Ubuntu 20.04 LTS:
- Bash: 5.1.16
- clickhouse-client: 22.7.3.5
- Docker: 20.10.12

clickhouse-client

Before you start

Download the Parquet file with sample data.
Install the ClickHouse® client.

Connect to your cluster with a connection string:

Go to the Clusters page in the console.
Under Connection strings, find the Native interface string and click Copy.

Open your terminal and run a command to connect to your cluster:

Docker

Native clickhouse-client

docker run --network host --rm -it clickhouse/<Native interface connection string>

The complete Docker command structure

docker run --network host --rm -it \ 
            clickhouse/clickhouse-client \
            --host <FQDN of your cluster> \
            --secure \
            --user <cluster user name> \
            --password <cluster user password> \
            --port 9440

<Native interface connection string>

Create a table to import your CSV data to.

The table must have the same structure as your CSV file. For this tutorial, create a simple table containing performance statistics for a group of sales representatives. Call it parquet_test_data.

CREATE TABLE parquet_test_data ON CLUSTER default
(
   `registration_date` String, 
   `id` Int,
   `first_name` String,
   `last_name` String, 
   `email` String, 
   `gender` String, 
   `ip_address` String, 
   `country` String,
   `birthdate` String,
   `title` String
   ) 
ENGINE = ReplicatedMergeTree() 
ORDER BY (id)

Exit the cluster command interface by typing exit to access local files in the next step.

Upload your data

Import the data from your Parquet file into the table on your cluster.

You need to specify the path to your CSV file, connect to the cluster and perform an INSERT query. Type the following in your terminal:

Native clickhouse-client

Docker

cat <path to your Parquet file> | <Native Interface connection string> \
--query="INSERT INTO default.parquet_test_data FORMAT Parquet"

cat <path to sample_data.parquet> | docker run --network host --rm -i 
clickhouse/<Native interface connection string>
--query="INSERT INTO default.parquet_test_data FORMAT Parquet"

The complete Docker command structure

docker run --network host --rm -it  \ 
            clickhouse/clickhouse-client \
            --host <FQDN of your cluster> \
            --secure \
            --user <your cluster username> \
            --password <Cluster user password> \
            --port 9440
            --query="INSERT INTO default.parquet_test_data FORMAT Parquet"

Check your imported data:

Native clickhouse-client

Docker

<Native interface connection string> --query="SELECT * FROM parquet_test_data LIMIT 20"

docker run --network host --rm -it 
clickhouse/<Native interface connection string>
--query="SELECT * FROM parquet_test_data LIMIT 20"

The complete Docker command structure

docker run --network host --rm -it  \ 
            clickhouse/clickhouse-client \
            --host <FQDN of your cluster> \
            --secure \
            --user <your cluster username> \
            --password <Cluster user password> \
            --port 9440
            --query="SELECT * FROM parquet_test_data LIMIT 20"

You should see the data from your CSV file in the terminal output:

┌─registration_date─────────────┬─id─┬─first_name─┬─last_name─┬─email────────────────────┬─gender─┬─ip_address──────┬─country────────────────┬─birthdate──┬─title────────────────────────┐
│ 2016-02-03 07:55:29.000000000 │  1 │ Amanda     │ Jordan    │ ajordan0@com.com         │ Female │ 1.197.201.2     │ Indonesia              │ 3/8/1971   │ Internal Auditor             │
│ 2016-02-03 17:04:03.000000000 │  2 │ Albert     │ Freeman   │ afreeman1@is.gd          │ Male   │ 218.111.175.34  │ Canada                 │ 1/16/1968  │ Accountant IV                │
│ 2016-02-03 01:09:31.000000000 │  3 │ Evelyn     │ Morgan    │ emorgan2@altervista.org  │ Female │ 7.161.136.94    │ Russia                 │ 2/1/1960   │ Structural Engineer          │
│ 2016-02-03 00:36:21.000000000 │  4 │ Denise     │ Riley     │ driley3@gmpg.org         │ Female │ 140.35.109.83   │ China                  │ 4/8/1997   │ Senior Cost Accountant       │
│ 2016-02-03 05:05:31.000000000 │  5 │ Carlos     │ Burns     │ cburns4@miitbeian.gov.cn │        │ 169.113.235.40  │ South Africa           │            │                              │
│ 2016-02-03 07:22:34.000000000 │  6 │ Kathryn    │ White     │ kwhite5@google.com       │ Female │ 195.131.81.179  │ Indonesia              │ 2/25/1983  │ Account Executive            │
│ 2016-02-03 08:33:08.000000000 │  7 │ Samuel     │ Holmes    │ sholmes6@foxnews.com     │ Male   │ 232.234.81.197  │ Portugal               │ 12/18/1987 │ Senior Financial Analyst     │
│ 2016-02-03 06:47:06.000000000 │  8 │ Harry      │ Howell    │ hhowell7@eepurl.com      │ Male   │ 91.235.51.73    │ Bosnia and Herzegovina │ 3/1/1962   │ Web Developer IV             │
│ 2016-02-03 03:52:53.000000000 │  9 │ Jose       │ Foster    │ jfoster8@yelp.com        │ Male   │ 132.31.53.61    │ South Korea            │ 3/27/1992  │ Software Test Engineer I     │
│ 2016-02-03 18:29:47.000000000 │ 10 │ Emily      │ Stewart   │ estewart9@opensource.org │ Female │ 143.28.251.245  │ Nigeria                │ 1/28/1997  │ Health Coach IV              │
│ 2016-02-03 00:10:42.000000000 │ 11 │ Susan      │ Perkins   │ sperkinsa@patch.com      │ Female │ 180.85.0.62     │ Russia                 │            │                              │
│ 2016-02-03 18:04:34.000000000 │ 12 │ Alice      │ Berry     │ aberryb@wikipedia.org    │ Female │ 246.225.12.189  │ China                  │ 8/12/1968  │ Quality Engineer             │
│ 2016-02-03 18:48:17.000000000 │ 13 │ Justin     │ Berry     │ jberryc@usatoday.com     │ Male   │ 157.7.146.43    │ Zambia                 │ 8/15/1975  │ Structural Analysis Engineer │
│ 2016-02-03 21:46:52.000000000 │ 14 │ Kathy      │ Reynolds  │ kreynoldsd@redcross.org  │ Female │ 81.254.172.13   │ Bosnia and Herzegovina │ 6/27/1970  │ Librarian                    │
│ 2016-02-03 08:53:23.000000000 │ 15 │ Dorothy    │ Hudson    │ dhudsone@blogger.com     │ Female │ 8.59.7.0        │ Japan                  │ 12/20/1989 │ Nurse Practicioner           │
│ 2016-02-03 00:44:01.000000000 │ 16 │ Bruce      │ Willis    │ bwillisf@bluehost.com    │ Male   │ 239.182.219.189 │ Brazil                 │            │                              │
│ 2016-02-03 00:57:45.000000000 │ 17 │ Emily      │ Andrews   │ eandrewsg@cornell.edu    │ Female │ 29.231.180.172  │ Russia                 │ 4/13/1990  │ Food Chemist                 │
│ 2016-02-03 16:44:24.000000000 │ 18 │ Stephen    │ Wallace   │ swallaceh@netvibes.com   │ Male   │ 152.49.213.62   │ Ukraine                │ 1/15/1978  │ Account Representative I     │
│ 2016-02-03 11:45:54.000000000 │ 19 │ Clarence   │ Lawson    │ clawsoni@vkontakte.ru    │ Male   │ 107.175.15.152  │ Russia                 │            │                              │
│ 2016-02-03 10:30:36.000000000 │ 20 │ Rebecca    │ Bell      │ rbellj@bandcamp.com      │ Female │ 172.215.104.127 │ China                  │            │                              │
└───────────────────────────────┴────┴────────────┴───────────┴──────────────────────────┴────────┴─────────────────┴────────────────────────┴────────────┴──────────────────────────────┘

Amazon S3 URL

Uploading a CSV file in the Amazon S3 bucket gives you extra flexibility with DoubleCloud Managed Service for ClickHouse®. It's a sequence of manual steps that gives you complete control over every process step but lacks automation.

In short, you need to complete the following steps:

Connect to your cluster.
Create a table in the cluster that follows the CSV file structure you want to import.
Use the bash script to upload the contents of your CSV file from an Amazon S3 bucket into your Managed ClickHouse® cluster.

Before you start, it's essential to familiarize yourself with path-style URLs used by Amazon S3:

https://<Bucket-name>.s3.<Region>.amazonaws.com/<Key_name>

Region is the identifier of the AWS region.
Bucket-name is the name of the bucket that contains the file.
Key-name is the name of the file you want to give a link to.

For example, there's a bucket named doublecloud-docs located in Frankfurt(eu-central-1), and we want to access the userdata1.parquet file in the /data-sets folder. The path-style URL will be as follows:

https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/userdata1.parquet

Now let's follow every step in detail.

Connect to your cluster with a connection string:

Go to the Clusters page in the console.
Under Connection strings, find the Native interface string and click Copy.

Open your terminal and run a command to connect to your cluster:

Docker

Native clickhouse-client

docker run --network host --rm -it clickhouse/<Native interface connection string>

The complete Docker command structure

docker run --network host --rm -it \ 
            clickhouse/clickhouse-client \
            --host <FQDN of your cluster> \
            --secure \
            --user <cluster user name> \
            --password <cluster user password> \
            --port 9440

<Native interface connection string>

Create a table to import your CSV data to.

The table must have the same structure as your CSV file. For this tutorial, create a simple table containing performance statistics for a group of sales representatives. Call it parquet_test:

CREATE TABLE parquet_test ON CLUSTER default
(
   `registration_date` String, 
   `id` Int,
   `first_name` String,
   `last_name` String, 
   `email` String, 
   `gender` String, 
   `ip_address` String, 
   `country` String,
   `birthdate` String,
   `title` String
) 
ENGINE = ReplicatedMergeTree() 
ORDER BY (id)

Insert the data from your Parquet file located in the Amazon S3 bucket.

To do this, you need to use the URL function of ClickHouse®. The syntax of this function looks like this:
```
`<URL>', Parquet, '<structure>'
```
- <URL> is the path to your file.
- Parquet is the format of the file you want to import
- <structure> is the structure of the target table.
Pay attention to the structure of the URL section in the following code example - it incorporates the path to the Parquet file and the INSERT query:
```
INSERT INTO parquet_test
SELECT * FROM s3
(
   'https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/userdata1.parquet', 
   'Parquet', 
   'registration_dttm String, id Int, first_name String, last_name String, email String, gender String, ip_address String, country String, birthdate String, title String'
) 
```

Check your imported data:

SELECT * FROM parquet_test LIMIT 20

You should see the data from your Parquet file in the terminal output:

┌─registration_date─────────────┬─id─┬─first_name─┬─last_name─┬─email────────────────────┬─gender─┬─ip_address──────┬─country────────────────┬─birthdate──┬─title────────────────────────┐
│ 2016-02-03 07:55:29.000000000 │  1 │ Amanda     │ Jordan    │ ajordan0@com.com         │ Female │ 1.197.201.2     │ Indonesia              │ 3/8/1971   │ Internal Auditor             │
│ 2016-02-03 17:04:03.000000000 │  2 │ Albert     │ Freeman   │ afreeman1@is.gd          │ Male   │ 218.111.175.34  │ Canada                 │ 1/16/1968  │ Accountant IV                │
│ 2016-02-03 01:09:31.000000000 │  3 │ Evelyn     │ Morgan    │ emorgan2@altervista.org  │ Female │ 7.161.136.94    │ Russia                 │ 2/1/1960   │ Structural Engineer          │
│ 2016-02-03 00:36:21.000000000 │  4 │ Denise     │ Riley     │ driley3@gmpg.org         │ Female │ 140.35.109.83   │ China                  │ 4/8/1997   │ Senior Cost Accountant       │
│ 2016-02-03 05:05:31.000000000 │  5 │ Carlos     │ Burns     │ cburns4@miitbeian.gov.cn │        │ 169.113.235.40  │ South Africa           │            │                              │
│ 2016-02-03 07:22:34.000000000 │  6 │ Kathryn    │ White     │ kwhite5@google.com       │ Female │ 195.131.81.179  │ Indonesia              │ 2/25/1983  │ Account Executive            │
│ 2016-02-03 08:33:08.000000000 │  7 │ Samuel     │ Holmes    │ sholmes6@foxnews.com     │ Male   │ 232.234.81.197  │ Portugal               │ 12/18/1987 │ Senior Financial Analyst     │
│ 2016-02-03 06:47:06.000000000 │  8 │ Harry      │ Howell    │ hhowell7@eepurl.com      │ Male   │ 91.235.51.73    │ Bosnia and Herzegovina │ 3/1/1962   │ Web Developer IV             │
│ 2016-02-03 03:52:53.000000000 │  9 │ Jose       │ Foster    │ jfoster8@yelp.com        │ Male   │ 132.31.53.61    │ South Korea            │ 3/27/1992  │ Software Test Engineer I     │
│ 2016-02-03 18:29:47.000000000 │ 10 │ Emily      │ Stewart   │ estewart9@opensource.org │ Female │ 143.28.251.245  │ Nigeria                │ 1/28/1997  │ Health Coach IV              │
│ 2016-02-03 00:10:42.000000000 │ 11 │ Susan      │ Perkins   │ sperkinsa@patch.com      │ Female │ 180.85.0.62     │ Russia                 │            │                              │
│ 2016-02-03 18:04:34.000000000 │ 12 │ Alice      │ Berry     │ aberryb@wikipedia.org    │ Female │ 246.225.12.189  │ China                  │ 8/12/1968  │ Quality Engineer             │
│ 2016-02-03 18:48:17.000000000 │ 13 │ Justin     │ Berry     │ jberryc@usatoday.com     │ Male   │ 157.7.146.43    │ Zambia                 │ 8/15/1975  │ Structural Analysis Engineer │
│ 2016-02-03 21:46:52.000000000 │ 14 │ Kathy      │ Reynolds  │ kreynoldsd@redcross.org  │ Female │ 81.254.172.13   │ Bosnia and Herzegovina │ 6/27/1970  │ Librarian                    │
│ 2016-02-03 08:53:23.000000000 │ 15 │ Dorothy    │ Hudson    │ dhudsone@blogger.com     │ Female │ 8.59.7.0        │ Japan                  │ 12/20/1989 │ Nurse Practicioner           │
│ 2016-02-03 00:44:01.000000000 │ 16 │ Bruce      │ Willis    │ bwillisf@bluehost.com    │ Male   │ 239.182.219.189 │ Brazil                 │            │                              │
│ 2016-02-03 00:57:45.000000000 │ 17 │ Emily      │ Andrews   │ eandrewsg@cornell.edu    │ Female │ 29.231.180.172  │ Russia                 │ 4/13/1990  │ Food Chemist                 │
│ 2016-02-03 16:44:24.000000000 │ 18 │ Stephen    │ Wallace   │ swallaceh@netvibes.com   │ Male   │ 152.49.213.62   │ Ukraine                │ 1/15/1978  │ Account Representative I     │
│ 2016-02-03 11:45:54.000000000 │ 19 │ Clarence   │ Lawson    │ clawsoni@vkontakte.ru    │ Male   │ 107.175.15.152  │ Russia                 │            │                              │
│ 2016-02-03 10:30:36.000000000 │ 20 │ Rebecca    │ Bell      │ rbellj@bandcamp.com      │ Female │ 172.215.104.127 │ China                  │            │                              │
└───────────────────────────────┴────┴────────────┴───────────┴──────────────────────────┴────────┴─────────────────┴────────────────────────┴────────────┴──────────────────────────────┘

Sample connection scripts

We tested these scripts in the following environments:

Ubuntu 20.04
- Python: 3.8.10, pip: 20.0.2

These instructions allow you to create a script that will perform all the actions to import your CSV file with a single terminal command. Moreover, it retains complete control over the process.

In short, you need to complete the following steps:

Install the necessary software.
Write a script to upload your data to the cluster.
Upload the data to your cluster.

Let's follow every step in detail.

Install the software and dependencies:

sudo apt update && sudo apt install -y python3 python3-pip && pip3 install requests

Install the ClickHouse® driver and all the necessary dependencies to process .parquet files:
```
pip install clickhouse-driver pandas fastparquet
```

Write the script to upload your data to the cluster.

parquet_uploader_test.py

import fastparquet
import pandas as pd
from clickhouse_driver import Client

# Connect to your ClickHouse cluster
CH_Client = Client(
   host='<FQDN of any ClickHouse® host on your cluster>', 
   port=9440, 
   user='<Cluster user name>',
   password='<Cluster user password>', 
   secure=True)

# Read the Parquet file into a pandas DataFrame
df = pd.read_parquet('<path to your .parquet file>')

# Insert the DataFrame into ClickHouse
CH_Client.execute(
   'INSERT INTO default.parquet_test_data FORMAT Parquet', df.to_dict(orient='records')
)

Upload CSV to your Managed ClickHouse® cluster

Migrate to Managed ClickHouse using Transfer