Upload Parquet data to your Managed ClickHouse® cluster

What these instructions are about

This article will show you how to upload Parquet files to your cluster using CLI software within DoubleCloud.

Before you start, download the Parquet file with sample data. You can use it to follow the examples in this article.

We tested these instructions in the following environments:

  • Ubuntu 20.04 LTS:
    • Bash: 5.1.16
    • clickhouse-client: 22.7.3.5
    • Docker: 20.10.12

clickhouse-client

Before you start

  1. Download the Parquet file with sample data.

  2. Install the clickhouse-client:

    1. Open your terminal.

    2. (Optional) Start Docker if needed:

      service docker start
      
    3. Pull the clickhouse-client Docker image:

      docker pull clickhouse/clickhouse-client
      
    1. Open your terminal.

    2. Connect to the ClickHouse® official DEB repository from your Linux system:

      sudo apt update && sudo apt install -y apt-transport-https ca-certificates dirmngr && \
      sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754 && \
      echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \
      /etc/apt/sources.list.d/clickhouse.list
      
    3. Refresh the package list and install the clickhouse-client :

      sudo apt update && sudo apt install -y clickhouse-client
      
    1. Open your terminal.

    2. Connect to a ClickHouse® official RPM repository from your Linux system:

      sudo yum install -y yum-utils
      sudo yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo
      
    3. Install the clickhouse-client :

      sudo yum install -y clickhouse-client
      

    Warning

    If you run a RedHat 7-based Linux distribution, including Cent OS 7, Oracle Linux 7 and others, you need to download and install trusted certificates and manually add the path to them in the clickhouse-client configuration file as follows:

    1. Install the root certificate:

      curl https://letsencrypt.org/certs/isrg-root-x2-cross-signed.pem > \ 
      /etc/pki/ca-trust/source/anchors/isrg-root-x2-cross-signed.pem
      
    2. Install the intermediate certificate:

      curl https://letsencrypt.org/certs/lets-encrypt-r3-cross-signed.pem > \
      /etc/pki/ca-trust/source/anchors/lets-encrypt-r3-cross-signed.pem
      
    3. Update the list of trusted certificates:

      sudo update-ca-trust
      
    4. Locate your clickhouse-client configuration file (by default, you can find it at /etc/clickhouse-client/config.xml) and add the path to the certificates into the <openSSL> section:

      <client> <!-- Used for connection to server's secure tcp port -->
         <loadDefaultCAFile>true</loadDefaultCAFile>
         <cacheSessions>true</cacheSessions>
         <disableProtocols>sslv2,sslv3</disableProtocols>
         <preferServerCiphers>true</preferServerCiphers>
         <caConfig>/etc/ssl/certs/ca-bundle.crt</caConfig>
         <!-- Use for self-signed: <verificationMode>none</verificationMode> -->
         <invalidCertificateHandler>
         <!-- Use for self-signed: <name>AcceptCertificateHandler</name> -->
         <name>RejectCertificateHandler</name>
         </invalidCertificateHandler>
      </client>
      
  3. Connect to your cluster with a connection string:

    1. Go to the Clusters overview page.

    2. Under Connection strings, find the Native interface string and click Copy.

    3. Open your Linux terminal and run a command to connect to your cluster:

      <Native interface connection string>
      
      The complete structure of the Native interface connection string
      clickhouse-client --host <FQDN of the host> \
                        --secure \
                        --user <Cluster user name> \
                        --password <Cluster user password> \
                        --port 9440 
      
      docker run --network host --rm -it 
      clickhouse/<Native interface connection string>
      
      The complete Docker command structure
      docker run --network host --rm -it  \ 
                  clickhouse/clickhouse-client \
                  --host <FQDN of your cluster> \
                  --secure \
                  --user admin \
                  --password <Cluster user password> \
                  --port 9440 
      
  4. Create a table to import your CSV data to.

    Please note that the table must have the same structure as your CSV file. For this scenario, create a simple table containing performance statistics for a group of sales representatives. Let's call it parquet_test_data.

    CREATE TABLE parquet_test_data
    (
       `registration_date` String, 
       `id` Int,
       `first_name` String,
       `last_name` String, 
       `email` String, 
       `gender` String, 
       `ip_address` String, 
       `country` String,
       `birthdate` String,
       `title` String
       ) 
    ENGINE = MergeTree() 
    ORDER BY (id)
    
  5. Exit the cluster command interface by typing exit to access local files in the next step.

Upload your data

  1. Import the data from your Parquet file into the table on your cluster.

    You need to specify the path to your CSV file, connect to the cluster and perform an INSERT query. Type the following in your Linux terminal:

    cat <path to your Parquet file> | <Native Interface connection string> \
    --query="INSERT INTO default.parquet_test_data FORMAT Parquet"
    
    cat <path to sample_data.csv> | docker run --network host --rm -i 
    clickhouse/<Native interface connection string>
    --query="INSERT INTO default.parquet_test_data FORMAT Parquet"
    
    The complete Docker command structure
    docker run --network host --rm -it  \ 
                clickhouse/clickhouse-client \
                --host <FQDN of your cluster> \
                --secure \
                --user admin \
                --password <Cluster user password> \
                --port 9440 
    
  2. Check your imported data:

    <Native interface connection string> --query="SELECT * FROM parquet_test_data LIMIT 20"
    
    docker run --network host --rm -it 
    clickhouse/<Native interface connection string>
    --query="SELECT * FROM parquet_test_data LIMIT 20"
    
    The complete Docker command structure
    docker run --network host --rm -it  \ 
                clickhouse/clickhouse-client \
                --host <FQDN of your cluster> \
                --secure \
                --user admin \
                --password <Cluster user password> \
                --port 9440 
    

    You should see the data from your CSV file in the terminal output:

    ┌─registration_date─────────────┬─id─┬─first_name─┬─last_name─┬─email────────────────────┬─gender─┬─ip_address──────┬─country────────────────┬─birthdate──┬─title────────────────────────┐
    │ 2016-02-03 07:55:29.000000000 │  1 │ Amanda     │ Jordan    │ ajordan0@com.com         │ Female │ 1.197.201.2     │ Indonesia              │ 3/8/1971   │ Internal Auditor             │
    │ 2016-02-03 17:04:03.000000000 │  2 │ Albert     │ Freeman   │ afreeman1@is.gd          │ Male   │ 218.111.175.34  │ Canada                 │ 1/16/1968  │ Accountant IV                │
    │ 2016-02-03 01:09:31.000000000 │  3 │ Evelyn     │ Morgan    │ emorgan2@altervista.org  │ Female │ 7.161.136.94    │ Russia                 │ 2/1/1960   │ Structural Engineer          │
    │ 2016-02-03 00:36:21.000000000 │  4 │ Denise     │ Riley     │ driley3@gmpg.org         │ Female │ 140.35.109.83   │ China                  │ 4/8/1997   │ Senior Cost Accountant       │
    │ 2016-02-03 05:05:31.000000000 │  5 │ Carlos     │ Burns     │ cburns4@miitbeian.gov.cn │        │ 169.113.235.40  │ South Africa           │            │                              │
    │ 2016-02-03 07:22:34.000000000 │  6 │ Kathryn    │ White     │ kwhite5@google.com       │ Female │ 195.131.81.179  │ Indonesia              │ 2/25/1983  │ Account Executive            │
    │ 2016-02-03 08:33:08.000000000 │  7 │ Samuel     │ Holmes    │ sholmes6@foxnews.com     │ Male   │ 232.234.81.197  │ Portugal               │ 12/18/1987 │ Senior Financial Analyst     │
    │ 2016-02-03 06:47:06.000000000 │  8 │ Harry      │ Howell    │ hhowell7@eepurl.com      │ Male   │ 91.235.51.73    │ Bosnia and Herzegovina │ 3/1/1962   │ Web Developer IV             │
    │ 2016-02-03 03:52:53.000000000 │  9 │ Jose       │ Foster    │ jfoster8@yelp.com        │ Male   │ 132.31.53.61    │ South Korea            │ 3/27/1992  │ Software Test Engineer I     │
    │ 2016-02-03 18:29:47.000000000 │ 10 │ Emily      │ Stewart   │ estewart9@opensource.org │ Female │ 143.28.251.245  │ Nigeria                │ 1/28/1997  │ Health Coach IV              │
    │ 2016-02-03 00:10:42.000000000 │ 11 │ Susan      │ Perkins   │ sperkinsa@patch.com      │ Female │ 180.85.0.62     │ Russia                 │            │                              │
    │ 2016-02-03 18:04:34.000000000 │ 12 │ Alice      │ Berry     │ aberryb@wikipedia.org    │ Female │ 246.225.12.189  │ China                  │ 8/12/1968  │ Quality Engineer             │
    │ 2016-02-03 18:48:17.000000000 │ 13 │ Justin     │ Berry     │ jberryc@usatoday.com     │ Male   │ 157.7.146.43    │ Zambia                 │ 8/15/1975  │ Structural Analysis Engineer │
    │ 2016-02-03 21:46:52.000000000 │ 14 │ Kathy      │ Reynolds  │ kreynoldsd@redcross.org  │ Female │ 81.254.172.13   │ Bosnia and Herzegovina │ 6/27/1970  │ Librarian                    │
    │ 2016-02-03 08:53:23.000000000 │ 15 │ Dorothy    │ Hudson    │ dhudsone@blogger.com     │ Female │ 8.59.7.0        │ Japan                  │ 12/20/1989 │ Nurse Practicioner           │
    │ 2016-02-03 00:44:01.000000000 │ 16 │ Bruce      │ Willis    │ bwillisf@bluehost.com    │ Male   │ 239.182.219.189 │ Brazil                 │            │                              │
    │ 2016-02-03 00:57:45.000000000 │ 17 │ Emily      │ Andrews   │ eandrewsg@cornell.edu    │ Female │ 29.231.180.172  │ Russia                 │ 4/13/1990  │ Food Chemist                 │
    │ 2016-02-03 16:44:24.000000000 │ 18 │ Stephen    │ Wallace   │ swallaceh@netvibes.com   │ Male   │ 152.49.213.62   │ Ukraine                │ 1/15/1978  │ Account Representative I     │
    │ 2016-02-03 11:45:54.000000000 │ 19 │ Clarence   │ Lawson    │ clawsoni@vkontakte.ru    │ Male   │ 107.175.15.152  │ Russia                 │            │                              │
    │ 2016-02-03 10:30:36.000000000 │ 20 │ Rebecca    │ Bell      │ rbellj@bandcamp.com      │ Female │ 172.215.104.127 │ China                  │            │                              │
    └───────────────────────────────┴────┴────────────┴───────────┴──────────────────────────┴────────┴─────────────────┴────────────────────────┴────────────┴──────────────────────────────┘
    
    

Amazon S3 URL

Uploading a CSV file in the Amazon S3 bucket gives you extra flexibility with DoubleCloud Managed Service for ClickHouse®. It's a sequence of manual steps that gives you complete control over every process step but lacks automation.

In short, you need to complete the following steps:

  • Connect to your cluster.

  • Create a table in the cluster that follows the CSV file structure you want to import.

  • Use the bash script to upload the contents of your CSV file from an Amazon S3 bucket into your Managed ClickHouse® cluster.

Before you start, it's essential to familiarize yourself with path-style URLs used by Amazon S3:

https://<Bucket-name>.s3.<Region>.amazonaws.com/<Key_name>
  • Region is the identifier of the AWS region.

  • Bucket-name is the name of the bucket that contains the file.

  • Key-name is the name of the file you want to give a link to.

For example, there's a bucket named doublecloud-docs located in Frankfurt(eu-central-1), and we want to access the userdata1.parquet file in the /data-sets folder. The path-style URL will be as follows:

https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/userdata1.parquet

Now let's follow every step in detail.

  1. Connect to your cluster with a connection string:

    1. Go to the Clusters overview page.

    2. Under Connection strings, find the Native interface string and click Copy.

    3. Open your Linux terminal and run a command to connect to your cluster:

      <Native interface connection string>
      
      The complete structure of the Native interface connection string
      clickhouse-client --host <FQDN of the host> \
                        --secure \
                        --user <Cluster user name> \
                        --password <Cluster user password> \
                        --port 9440 
      
      docker run --network host --rm -it 
      clickhouse/<Native interface connection string>
      
      The complete Docker command structure
      docker run --network host --rm -it  \ 
                  clickhouse/clickhouse-client \
                  --host <FQDN of your cluster> \
                  --secure \
                  --user admin \
                  --password <Cluster user password> \
                  --port 9440 
      
  2. Create a table to import your CSV data to.

    Please note that the table must have the same structure as your CSV file. For this scenario, create a simple table containing performance statistics for a group of sales representatives. Let's call it parquet_test:

    CREATE TABLE parquet_test
    (
       `registration_date` String, 
       `id` Int,
       `first_name` String,
       `last_name` String, 
       `email` String, 
       `gender` String, 
       `ip_address` String, 
       `country` String,
       `birthdate` String,
       `title` String
    ) 
    ENGINE = MergeTree() 
    ORDER BY (id)
    
  3. Insert the data from your Parquet file located in the Amazon S3 bucket.

    To do this, you need to use the URL function of ClickHouse®. The syntax of this function looks like this:

    `<URL>', Parquet, '<structure>'
    
    • <URL> is the path to your file.
    • Parquet is the format of the file you want to import
    • <structure> is the structure of the target table.

    Pay attention to the structure of the URL section in the following code example - it incorporates the path to the Parquet file and the INSERT query:

    INSERT INTO parquet_test
    SELECT * FROM s3
    (
       'https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/userdata1.parquet', 
       'Parquet', 
       'registration_dttm String, id Int, first_name String, last_name String, email String, gender String, ip_address String, country String, birthdate String, title String'
    ) 
    
  4. Check your imported data:

    SELECT * FROM parquet_test LIMIT 20
    

    You should see the data from your Parquet file in the terminal output:

    ┌─registration_date─────────────┬─id─┬─first_name─┬─last_name─┬─email────────────────────┬─gender─┬─ip_address──────┬─country────────────────┬─birthdate──┬─title────────────────────────┐
    │ 2016-02-03 07:55:29.000000000 │  1 │ Amanda     │ Jordan    │ ajordan0@com.com         │ Female │ 1.197.201.2     │ Indonesia              │ 3/8/1971   │ Internal Auditor             │
    │ 2016-02-03 17:04:03.000000000 │  2 │ Albert     │ Freeman   │ afreeman1@is.gd          │ Male   │ 218.111.175.34  │ Canada                 │ 1/16/1968  │ Accountant IV                │
    │ 2016-02-03 01:09:31.000000000 │  3 │ Evelyn     │ Morgan    │ emorgan2@altervista.org  │ Female │ 7.161.136.94    │ Russia                 │ 2/1/1960   │ Structural Engineer          │
    │ 2016-02-03 00:36:21.000000000 │  4 │ Denise     │ Riley     │ driley3@gmpg.org         │ Female │ 140.35.109.83   │ China                  │ 4/8/1997   │ Senior Cost Accountant       │
    │ 2016-02-03 05:05:31.000000000 │  5 │ Carlos     │ Burns     │ cburns4@miitbeian.gov.cn │        │ 169.113.235.40  │ South Africa           │            │                              │
    │ 2016-02-03 07:22:34.000000000 │  6 │ Kathryn    │ White     │ kwhite5@google.com       │ Female │ 195.131.81.179  │ Indonesia              │ 2/25/1983  │ Account Executive            │
    │ 2016-02-03 08:33:08.000000000 │  7 │ Samuel     │ Holmes    │ sholmes6@foxnews.com     │ Male   │ 232.234.81.197  │ Portugal               │ 12/18/1987 │ Senior Financial Analyst     │
    │ 2016-02-03 06:47:06.000000000 │  8 │ Harry      │ Howell    │ hhowell7@eepurl.com      │ Male   │ 91.235.51.73    │ Bosnia and Herzegovina │ 3/1/1962   │ Web Developer IV             │
    │ 2016-02-03 03:52:53.000000000 │  9 │ Jose       │ Foster    │ jfoster8@yelp.com        │ Male   │ 132.31.53.61    │ South Korea            │ 3/27/1992  │ Software Test Engineer I     │
    │ 2016-02-03 18:29:47.000000000 │ 10 │ Emily      │ Stewart   │ estewart9@opensource.org │ Female │ 143.28.251.245  │ Nigeria                │ 1/28/1997  │ Health Coach IV              │
    │ 2016-02-03 00:10:42.000000000 │ 11 │ Susan      │ Perkins   │ sperkinsa@patch.com      │ Female │ 180.85.0.62     │ Russia                 │            │                              │
    │ 2016-02-03 18:04:34.000000000 │ 12 │ Alice      │ Berry     │ aberryb@wikipedia.org    │ Female │ 246.225.12.189  │ China                  │ 8/12/1968  │ Quality Engineer             │
    │ 2016-02-03 18:48:17.000000000 │ 13 │ Justin     │ Berry     │ jberryc@usatoday.com     │ Male   │ 157.7.146.43    │ Zambia                 │ 8/15/1975  │ Structural Analysis Engineer │
    │ 2016-02-03 21:46:52.000000000 │ 14 │ Kathy      │ Reynolds  │ kreynoldsd@redcross.org  │ Female │ 81.254.172.13   │ Bosnia and Herzegovina │ 6/27/1970  │ Librarian                    │
    │ 2016-02-03 08:53:23.000000000 │ 15 │ Dorothy    │ Hudson    │ dhudsone@blogger.com     │ Female │ 8.59.7.0        │ Japan                  │ 12/20/1989 │ Nurse Practicioner           │
    │ 2016-02-03 00:44:01.000000000 │ 16 │ Bruce      │ Willis    │ bwillisf@bluehost.com    │ Male   │ 239.182.219.189 │ Brazil                 │            │                              │
    │ 2016-02-03 00:57:45.000000000 │ 17 │ Emily      │ Andrews   │ eandrewsg@cornell.edu    │ Female │ 29.231.180.172  │ Russia                 │ 4/13/1990  │ Food Chemist                 │
    │ 2016-02-03 16:44:24.000000000 │ 18 │ Stephen    │ Wallace   │ swallaceh@netvibes.com   │ Male   │ 152.49.213.62   │ Ukraine                │ 1/15/1978  │ Account Representative I     │
    │ 2016-02-03 11:45:54.000000000 │ 19 │ Clarence   │ Lawson    │ clawsoni@vkontakte.ru    │ Male   │ 107.175.15.152  │ Russia                 │            │                              │
    │ 2016-02-03 10:30:36.000000000 │ 20 │ Rebecca    │ Bell      │ rbellj@bandcamp.com      │ Female │ 172.215.104.127 │ China                  │            │                              │
    └───────────────────────────────┴────┴────────────┴───────────┴──────────────────────────┴────────┴─────────────────┴────────────────────────┴────────────┴──────────────────────────────┘