Upload CSV to your Managed ClickHouse® cluster

What these instructions are about

This article describes how to upload CSV files to your Managed ClickHouse® cluster with:

Before you start, download the CSV file with sample data. You can use it to follow the examples in this article.

Upload with GUI-based IDEs

We tested these scenarios in the following environments:

  • Windows 10 Enterprise 1909:

    • JetBrains DataGrip: 2021.2
    • DBeaver Community: 21.2.0

JetBrains DataGrip natively supports importing CSV files into ClickHouse®. You can accomplish the whole process without any third-party plugins or extensions:

  1. In the Database pane on the left of the main workspace window, right-click the database schema where you want to upload your CSV table.

  2. Click Import data from file...

  3. Locate the file you want to upload and click OK.

  4. In the Import file window, the application will set the settings automatically. Review them, adjust if necessary and click Import.

  5. Your table will appear in the Tables folder.

DBeaver Database Manager natively supports importing CSV files into ClickHouse®. You can accomplish the whole process without any third-party plugins or extensions:

  1. Right-click the name of the ClickHouse® database where you want to import data.

  2. Choose Import Data from the menu. You will see a Data transfer wizard.

  3. Make sure to select CSV at the Import source step and click Next.

  4. Locate the CSV file you want to import and click Open.

  5. Make sure the filename is correct and click Next.

  6. Ensure that the database name in the Target container field is correct and click Next.

  7. Default data load settings are sufficient for most cases. Click Next.

    Look at the settings. If you need to change any of them, return to any process step by clicking on the corresponding line in the left pane wizard's left pane.

  8. Click Proceed to import the CSV table.

clickhouse-client

We tested these instructions in the following environments:

  • Ubuntu 20.04 LTS:
    • Bash: 5.1.16
    • clickhouse-client: 22.7.3.5
    • Docker: 20.10.12

Before you start

  1. If you haven't created a cluster yet, create a Managed ClickHouse® cluster.

  2. Download the CSV file with sample data.

  3. Install the clickhouse-client:

    1. Open your terminal.

    2. (Optional) Start Docker if needed:

      service docker start
      
    3. Pull the clickhouse-client Docker image:

      docker pull clickhouse/clickhouse-client
      
    1. Open your terminal.

    2. Connect to the ClickHouse® official DEB repository from your Linux system:

      sudo apt update && sudo apt install -y apt-transport-https ca-certificates dirmngr && \
      sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754 && \
      echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \
      /etc/apt/sources.list.d/clickhouse.list
      
    3. Refresh the package list and install the clickhouse-client :

      sudo apt update && sudo apt install -y clickhouse-client
      
    1. Open your terminal.

    2. Connect to a ClickHouse® official RPM repository from your Linux system:

      sudo yum install -y yum-utils
      sudo yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo
      
    3. Install the clickhouse-client :

      sudo yum install -y clickhouse-client
      

    Warning

    If you run a RedHat 7-based Linux distribution, including Cent OS 7, Oracle Linux 7 and others, you need to download and install trusted certificates and manually add the path to them in the clickhouse-client configuration file as follows:

    1. Install the root certificate:

      curl https://letsencrypt.org/certs/isrg-root-x2-cross-signed.pem > \ 
      /etc/pki/ca-trust/source/anchors/isrg-root-x2-cross-signed.pem
      
    2. Install the intermediate certificate:

      curl https://letsencrypt.org/certs/lets-encrypt-r3-cross-signed.pem > \
      /etc/pki/ca-trust/source/anchors/lets-encrypt-r3-cross-signed.pem
      
    3. Update the list of trusted certificates:

      sudo update-ca-trust
      
    4. Locate your clickhouse-client configuration file (by default, you can find it at /etc/clickhouse-client/config.xml) and add the path to the certificates into the <openSSL> section:

      <client> <!-- Used for connection to server's secure tcp port -->
         <loadDefaultCAFile>true</loadDefaultCAFile>
         <cacheSessions>true</cacheSessions>
         <disableProtocols>sslv2,sslv3</disableProtocols>
         <preferServerCiphers>true</preferServerCiphers>
         <caConfig>/etc/ssl/certs/ca-bundle.crt</caConfig>
         <!-- Use for self-signed: <verificationMode>none</verificationMode> -->
         <invalidCertificateHandler>
         <!-- Use for self-signed: <name>AcceptCertificateHandler</name> -->
         <name>RejectCertificateHandler</name>
         </invalidCertificateHandler>
      </client>
      

Upload your data

Uploading a CSV file with the native clickhouse-client application is a sequence of manual steps. This gives you complete control over every step of the process without applying any automation to it.

In short, you need to complete the following steps:

  • Connect to your cluster.

  • Create a table in the cluster that follows the CSV file structure you want to import.

  • Disconnect from the cluster.

  • Use the bash script to upload the contents of your CSV file into your Managed ClickHouse® cluster.

Let's follow every step in detail.

  1. Connect to your cluster with a connection string and create a table to import your CSV data to:

    1. Go to the Clusters overview page.

    2. Select the name of the cluster you want to connect to.

    3. Under Connection strings, find the Native interface string and click Copy.

    4. Open your Linux terminal and run a command to connect to your cluster:

      <Native interface connection string>
      
      The complete structure of the Native interface connection string
      clickhouse-client --host <FQDN of the host> \
                        --secure \
                        --user <Cluster user name> \
                        --password <Cluster user password> \
                        --port 9440 
      
      docker run --network host --rm -i 
      clickhouse/<Native interface connection string>
      
      The complete Docker command structure
      docker run --network host --rm -it  \ 
                  clickhouse/clickhouse-client \
                  --host <FQDN of your cluster> \
                  --secure \
                  --user admin \
                  --password <Cluster user password> \
                  --port 9440 
      
  2. Create a table to import your CSV data to.

    Please note that the table must have the same structure as your CSV file. For this scenario, create a simple table containing performance statistics for a group of sales representatives. Let's call it region_sales:

    CREATE TABLE region_sales
    (
       `order_date` Date, 
       `city` String,
       `district` String,
       `borough` String, 
       `rep` String, 
       `item` String, 
       `sold` Int, 
       `price` Float, 
       `total` Float
    ) 
    ENGINE = MergeTree() 
    ORDER BY (order_date)
    
  3. Exit the cluster command interface by typing exit to access local files in the next step.

  4. Import the data from the CSV file you downloaded earlier into the table on your cluster. You need to specify the path to your CSV file, connect to the cluster and perform an INSERT query. To do this, type the following sequence of commands in your Linux terminal:

    cat <path to sample_data.csv> | <Native Interface connection string> \
    --query="INSERT INTO region_sales FORMAT CSV"
    
    cat <path to sample_data.csv> | docker run --network host --rm -i \
    clickhouse/<Native interface connection string> \
    --query="INSERT INTO region_sales FORMAT CSV"
    
    The complete Docker command structure
    docker run --network host --rm -i \
                clickhouse/clickhouse-client \
                --host <FQDN of your cluster> \
                --secure \
                --user admin \
                --password <Cluster user password> \
                --port 9440 
    

    You should see the following output:

      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100  3189  100  3189    0     0   7532      0 --:--:-- --:--:-- --:--:--  7556
    
  5. Check your imported data:

    <Native interface connection string> --query="SELECT * FROM region_sales LIMIT 10"
    
    cat <path to sample_data.csv> | docker run --network host --rm -it \
    clickhouse/<Native interface connection string> \
    --query="SELECT * FROM region_sales LIMIT 10"
    
    The complete Docker command structure
    docker run --network host --rm -i \
                clickhouse/clickhouse-client \
                --host <FQDN of your cluster> \
                --secure \
                --user admin \
                --password <Cluster user password> \
                --port 9440 
    

    You should see the data from your CSV file in the terminal output:

    2075-01-23	Night City	City Center	Corporate Plaza	Kivell	Binder	50	19.99	999.5
    2075-02-26	Night City	City Center	Corporate Plaza	Gill	Pen	27	19.99	539.73
    2075-03-15	Night City	Heywood	Japantown	Sorvino	Pencil	56	2.99	167.44
    2075-04-01	Night City	Watson	Kabuki	Jones	Binder	60	4.99	299.4
    2075-04-18	Night City	City Center	Corporate Plaza	Andrews	Pencil	75	1.99	149.25
    2075-05-05	Night City	City Center	Corporate Plaza	Jardine	Pencil	90	4.99	449.1
    2075-05-22	Night City	Heywood	Japantown	Thompson	Pencil	32	1.99	63.68
    2075-06-01	Night City	Watson	Kabuki	Jones	Pencil	95	1.99	189.05
    2075-06-08	Night City	Watson	Kabuki	Jones	Binder	60	8.99	539.4
    2075-06-25	Night City	City Center	Corporate Plaza	Morgan	Pencil	90	4.99	449.1
    

Amazon S3 URL

Uploading a CSV file in the Amazon S3 bucket gives you extra flexibility with DoubleCloud Managed Service for ClickHouse®. It's a sequence of manual steps that gives you complete control over every process step but lacks automation.

In short, you need to complete the following steps:

  • Connect to your cluster.
  • Create a table in the cluster that follows the CSV file structure you want to import.
  • Use the bash script to upload the contents of your CSV file from an Amazon S3 bucket into your Managed ClickHouse® cluster.

Before you start this scenario, it's essential to familiarize yourself with path-style URLs used by Amazon S3:

https://<Bucket-name>.s3.<Region>.amazonaws.com/<Key_name>
  • Region is the identifier of the AWS region.
  • Bucket-name is the name of the bucket that contains the file.
  • Key-name is the name of the file to which you want to give a link.

For example, there's a bucket named doublecloud-docs located in Frankfurt(eu-central-1), and we want to access the sample_data.csv file in the /data-sets folder. The path-style URL will be as follows:

https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/sample_data.csv

Now let's follow every step in detail.

  1. Connect to your cluster with a connection string:

    1. Go to the Clusters overview page.

    2. Under Connection strings, find the Native interface string and click Copy.

    3. Open your Linux terminal and run a command to connect to your cluster:

      <Native interface connection string>
      
      The complete structure of the Native interface connection string
      clickhouse-client --host <FQDN of the host> \
                        --secure \
                        --user <Cluster user name> \
                        --password <Cluster user password> \
                        --port 9440 
      
      docker run --network host --rm -it 
      clickhouse/<Native interface connection string>
      
      The complete Docker command structure
      docker run --network host --rm -it  \ 
                  clickhouse/clickhouse-client \
                  --host <FQDN of your cluster> \
                  --secure \
                  --user admin \
                  --password <Cluster user password> \
                  --port 9440 
      
  2. Create a table to import your CSV data to. Please note that the table must have the same structure as your CSV file. For this scenario, create a simple table containing performance statistics for a group of sales representatives. Let's call it nc_sales:

    CREATE TABLE nc_sales
    (
       `order_date` Date, 
       `city` String,
       `district` String,
       `borough` String, 
       `rep` String, 
       `item` String, 
       `sold` Int, 
       `price` Float, 
       `total` Float
    ) 
    ENGINE = MergeTree() 
    ORDER BY (order_date)
    
  3. Insert the data from your CSV file located in the Amazon S3 bucket.

    To do this, you need to use the URL function of ClickHouse®. The syntax of this function looks like this:

    '<URL>', 'CSV', '<structure>'
    
    • <URL> is the path to your file
    • CSV is the format of the file you want to import
    • <structure> is the structure of the target table.

    Pay attention to the structure of the <URL> section in the following code example - it incorporates both the Path-style URL to the CSV file and the INSERT query:

    INSERT INTO nc_sales 
    SELECT * FROM s3
    (
       'https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/sample_data.csv', 
       'CSV',
       'order_date Date, city String, district String, borough String, rep String, item String, sold Int, price Float, total Float'
    )
    
  4. Use the following query to check your imported data:

    SELECT * FROM nc_sales LIMIT 20
    

    You should see the data from your CSV file in the terminal output:

    ┌─order_date─┬─city───────┬─district────┬─borough─────────┬─rep──────┬─item────┬─sold─┬─price─┬───total─┐
    │ 2075-01-23 │ Night City │ City Center │ Corporate Plaza │ Kivell   │ Binder  │   50 │ 19.99 │   999.5 │
    │ 2075-02-26 │ Night City │ City Center │ Corporate Plaza │ Gill     │ Pen     │   27 │ 19.99 │  539.73 │
    │ 2075-03-15 │ Night City │ Heywood     │ Japantown       │ Sorvino  │ Pencil  │   56 │  2.99 │  167.44 │
    │ 2075-04-01 │ Night City │ Watson      │ Kabuki          │ Jones    │ Binder  │   60 │  4.99 │   299.4 │
    │ 2075-04-18 │ Night City │ City Center │ Corporate Plaza │ Andrews  │ Pencil  │   75 │  1.99 │  149.25 │
    │ 2075-05-05 │ Night City │ City Center │ Corporate Plaza │ Jardine  │ Pencil  │   90 │  4.99 │   449.1 │
    │ 2075-05-22 │ Night City │ Heywood     │ Japantown       │ Thompson │ Pencil  │   32 │  1.99 │   63.68 │
    │ 2075-06-01 │ Night City │ Watson      │ Kabuki          │ Jones    │ Pencil  │   95 │  1.99 │  189.05 │
    │ 2075-06-08 │ Night City │ Watson      │ Kabuki          │ Jones    │ Binder  │   60 │  8.99 │   539.4 │
    │ 2075-06-25 │ Night City │ City Center │ Corporate Plaza │ Morgan   │ Pencil  │   90 │  4.99 │   449.1 │
    │ 2075-07-12 │ Night City │ Watson      │ Kabuki          │ Howard   │ Binder  │   29 │  1.99 │   57.71 │
    │ 2075-07-29 │ Night City │ Watson      │ Kabuki          │ Parent   │ Binder  │   81 │ 19.99 │ 1619.19 │
    │ 2075-08-11 │ Night City │ Watson      │ Kabuki          │ Parent   │ Pen     │   15 │ 19.99 │  299.85 │
    │ 2075-08-15 │ Night City │ Watson      │ Kabuki          │ Jones    │ Pencil  │   35 │  4.99 │  174.65 │
    │ 2075-09-01 │ Night City │ City Center │ Corporate Plaza │ Smith    │ Desk    │    2 │   125 │     250 │
    │ 2075-09-02 │ Night City │ City Center │ Corporate Plaza │ Jardine  │ Pencil  │   36 │  4.99 │  179.64 │
    │ 2075-09-18 │ Night City │ Watson      │ Kabuki          │ Jones    │ Pen Set │   16 │ 15.99 │  255.84 │
    │ 2075-10-05 │ Night City │ City Center │ Corporate Plaza │ Morgan   │ Binder  │   28 │  8.99 │  251.72 │
    │ 2075-10-22 │ Night City │ Watson      │ Kabuki          │ Jones    │ Pen     │   64 │  8.99 │  575.36 │
    │ 2075-11-25 │ Night City │ City Center │ Corporate Plaza │ Kivell   │ Pen Set │   96 │  4.99 │  479.04 │
    └────────────┴────────────┴─────────────┴─────────────────┴──────────┴─────────┴──────┴───────┴─────────┘
    

Sample connection scripts

We tested these scripts in the following environments:

  • Ubuntu 20.04
    • Python: 3.8.10, pip: 20.0.2
    • Go: 1.13.8

Python

These instructions allow you to create a script that will perform all the actions to import your CSV file with a single terminal command. Moreover, it retains complete control over the process.

In short, you need to complete the following steps:

  • Install the necessary software.
  • Write a script to upload your data to the cluster.
  • Upload the data to your cluster.

Let's follow every step in detail.

  1. Install the software and dependencies:

    sudo apt update && sudo apt install -y python3 python3-pip && pip3 install requests
    
  2. Install the ClickHouse® driver for Python:

    pip install clickhouse-driver
    
  3. Write the script to upload your data to the cluster. For the purpose of this scenario, create a simple table that will contain performance statistics for a group of sales representatives. Let's call it RegionSales:

    csv_uploader_test.py

    import csv
    
    # Declare the Client interface class from the ClickHouse® driver
    from clickhouse_driver import Client
    
    # Provide the necessary information to connect to your cluster
    CH_Сlient = Client(
        '<FQDN of any ClickHouse® host on your cluster>',
        user='<Cluster user name>',
        password='<Cluster user password>',
        secure=True)
    
    # Create a table on your cluster
    CH_Сlient.execute(
        'CREATE TABLE RegionSales ('
        '`OrderDate` Date, '
        '`City` String, '
        '`District` String, '
        '`Borough` String, '
        '`Rep` String, '
        '`Item` String, '
        '`Sold` Int, '
        '`Price` Float, '
        '`Total` Float) ENGINE = MergeTree() '
        'PARTITION BY (City) ORDER BY (OrderDate)')
    
    
    # Create a generator to fetch parsed rows
    def row_reader():
        with open('<Path to your CSV file>') as TableName:
            for line in csv.reader(TableName):
               yield line
    
    # Use the generator to load values into the table on your cluster
    CH_Сlient.execute("INSERT INTO RegionSales VALUES", (line for line in row_reader()))
    
    
  4. Run the script to upload the data to your cluster:

    python3 csv_uploader_test.py
    
  5. Check your imported data:

    <Native interface connection string> --query="SELECT * FROM RegionSales"
    

    You should see the data from your CSV file in the terminal output.

Go

These instructions allow you to create a script that will perform all the actions to import your CSV file with a single terminal command. Moreover, it retains complete control over the process.

In short, you need to complete the following steps:

  • Install the necessary software.
  • Write a script to upload your data to the cluster.
  • Upload the data to your cluster.

Let's follow every step in detail.

  1. Install the necessary software and dependencies:

    1. Install the Go language interpreter:

      sudo apt update && sudo apt install -y golang-go
      
    2. Install Git:

      sudo apt update && sudo apt install -y git
      
    3. Install clickhouse-go and the necessary packages:

      go get -v -u github.com/ClickHouse/clickhouse-go/v2 \ 
                   github.com/google/uuid \
                   github.com/pierrec/lz4/v4 \ 
                   github.com/ClickHouse/clickhouse-go/lib/proto/query.go:25:2
      
  2. Write the script to upload your data to the cluster. For the purpose of this scenario, create a simple table that will contain performance statistics for a group of sales representatives. Let's call it RegionSales:

    csv_uploader_test.go

    package main
    
    import (
       "database/sql"
       "encoding/csv"
       "fmt"
       _ "github.com/ClickHouse/clickhouse-go/v2"
       "io"
       "log"
       "os"
       "strconv"
    )
    
    func main() {
       host := "<FQDN of any ClickHouse® host on your cluster>"
       user := "<Cluster user name>"
       password := "<Cluster user password>"
    
       dsn := fmt.Sprintf("clickhouse://%s:%s@%s:9440?username=%s&password=%s&secure=true", user, password, host)
       client, err := sql.Open("clickhouse", dsn)
       if err != nil {
          log.Fatal(err)
       }
       if err := client.Ping(); err != nil {
          panic(err)
       }
    
       if _, err := client.Exec(`CREATE TABLE IF NOT EXISTS test_table ON CLUSTER '{cluster}' (
          name String,
          value Int
          ) ENGINE = ReplicatedMergeTree() PARTITION BY (name) ORDER BY (value)`); err != nil {
          panic(err)
       }
    
       scope, err := client.Begin()
       if err != nil {
          panic(err)
       }
       batch, err := scope.Prepare("INSERT INTO test_table")
       if err != nil {
          panic(err)
       }
    
       data := readCSV()
       opened := true
       var row []interface{}
    
       for opened {
          row, opened = <-data
          if !opened {
             break
          }
          if _, err := batch.Exec(row...); err != nil {
             panic(err)
          }
       }
    
       if err := scope.Commit(); err != nil {
          panic(err)
       }
    }
    
    func readCSV() chan []interface{} {
       c := make(chan []interface{})
    
       go func() {
          file, err := os.Open("<FILENAME>")
          if err != nil {
             panic(err)
          }
          defer file.Close()
    
          csvReader := csv.NewReader(file)
          csvReader.TrimLeadingSpace = true
          var record []string
          for err == nil {
             record, err = csvReader.Read()
             if err == io.EOF {
                continue
             }
             if err != nil {
                panic(err)
             }
    
             row := make([]interface{}, len(record))
             for i, col := range record {
                row[i] = col
             }
    
             value, err := strconv.Atoi(record[1])
             if err != nil {
                panic(err)
             }
    
             c <- []interface{}{
                record[0],
                int32(value),
             }
          }
    
          close(c)
       }()
    
       return c
    }
    
  3. Upload the data to your cluster:

    go run csv_uploader_test.go
    
  4. Check your imported data:

    <Native interface connection string> --query="SELECT * FROM RegionSales"
    
    The complete structure of the Native interface connection string
    clickhouse-client --host <FQDN of the host> \
                      --secure \
                      --user <Cluster user name> \
                      --password <Cluster user password> \
                      --port 9440 
    

    You should see the data from your CSV file in the terminal output.