Upload CSV to your Managed ClickHouse® cluster

What these instructions are about

This article describes how to upload CSV files to your Managed ClickHouse® cluster with:

Before you start, download the CSV file with sample data. You can use it to follow the examples in this article.

Upload with GUI-based IDEs

We tested these scenarios in the following environments:

  • Windows 10 Enterprise 1909:

    • JetBrains DataGrip: 2021.2
    • DBeaver Community: 21.2.0

JetBrains DataGrip natively supports importing CSV files into ClickHouse®. You can accomplish the whole process without any third-party plugins or extensions:

  1. In the Database pane on the left of the main workspace window, right-click the database schema where you want to upload your CSV table.

  2. Click Import data from file...

  3. Locate the file you want to upload and click OK.

  4. In the Import file window, the application will set the settings automatically. Review them, adjust if necessary and click Import.

  5. Your table will appear in the Tables folder.

DBeaver Database Manager natively supports importing CSV files into ClickHouse®. You can accomplish the whole process without any third-party plugins or extensions:

  1. Right-click the name of the ClickHouse® database where you want to import data.

  2. Choose Import Data from the menu. You will see a Data transfer wizard.

  3. Make sure to select CSV at the Import source step and click Next.

  4. Locate the CSV file you want to import and click Open.

  5. Make sure the filename is correct and click Next.

  6. Ensure that the database name in the Target container field is correct and click Next.

  7. Default data load settings are sufficient for most cases. Click Next.

    Look at the settings. If you need to change any of them, return to any process step by clicking on the corresponding line in the left pane wizard's left pane.

  8. Click Proceed to import the CSV table.

clickhouse-client

We tested these instructions in the following environments:

  • Ubuntu 20.04 LTS:
    • Bash: 5.1.16
    • clickhouse-client: 22.7.3.5
    • Docker: 20.10.12

Before you start

  1. If you haven't already, create a Managed ClickHouse® cluster.

  2. Download the CSV file with sample data.

  3. Install the ClickHouse® client.

Upload your data

Uploading a CSV file with the native clickhouse-client application is a sequence of manual steps. This gives you complete control over every step of the process without applying any automation to it.

In short, you need to complete the following steps:

  • Connect to your cluster.

  • Create a table in the cluster that follows the CSV file structure you want to import.

  • Disconnect from the cluster.

  • Use the bash script to upload the contents of your CSV file into your Managed ClickHouse® cluster.

Let's follow every step in detail.

  1. Connect to your cluster with a connection string and create a table to import your CSV data to:

    1. Go to the Clusters page in the console.

    2. Select the name of the cluster you want to connect to.

    3. Under Connection strings, find the Native interface string and click Copy.

    4. Open your terminal and run a command to connect to your cluster:

      docker run --network host --rm -it clickhouse/<Native interface connection string>
      
      The complete Docker command structure
      docker run --network host --rm -it \ 
                  clickhouse/clickhouse-client \
                  --host <FQDN of your cluster> \
                  --secure \
                  --user <cluster user name> \
                  --password <cluster user password> \
                  --port 9440 
      
      <Native interface connection string>
      
  2. Create a table to import your CSV data to.

    The table must have the same structure as your CSV file. For this scenario, create a simple table containing performance statistics for a group of sales representatives. Let's call it region_sales:

    CREATE TABLE region_sales ON CLUSTER default
    (
       `order_date` Date, 
       `city` String,
       `district` String,
       `borough` String, 
       `rep` String, 
       `item` String, 
       `sold` Int, 
       `price` Float, 
       `total` Float
    ) 
    ENGINE = ReplicatedMergeTree() 
    ORDER BY (order_date)
    
  3. Exit the cluster command interface by typing exit to access local files in the next step.

  4. Import the data from the CSV file you downloaded earlier into the table on your cluster. You need to specify the path to your CSV file, connect to the cluster and perform an INSERT query. To do this, type the following sequence of commands in your terminal:

    cat <path to sample_data.csv> | <Native Interface connection string> \
    --query="INSERT INTO region_sales FORMAT CSV"
    
    cat <path to sample_data.csv> | docker run --network host --rm -i \
    clickhouse/<Native interface connection string> \
    --query="INSERT INTO region_sales FORMAT CSV"
    
    The complete Docker command structure
    cat <path to sample_data.csv> | docker run --network host --rm -i \
                clickhouse/clickhouse-client \
                --host <FQDN of your cluster> \
                --secure \
                --user <your cluster username> \
                --password <Cluster user password> \
                --port 9440
                --query="INSERT INTO region_sales FORMAT CSV"
    

    You should see the following output:

      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100  3189  100  3189    0     0   7532      0 --:--:-- --:--:-- --:--:--  7556
    
  5. Check your imported data:

    <Native interface connection string> --query="SELECT * FROM region_sales LIMIT 10"
    
    cat <path to sample_data.csv> | docker run --network host --rm -it \
    clickhouse/<Native interface connection string> \
    --query="SELECT * FROM region_sales LIMIT 10"
    
    The complete Docker command structure
    docker run --network host --rm -i \
                clickhouse/clickhouse-client \
                --host <FQDN of your cluster> \
                --secure \
                --user <your cluster username> \
                --password <Cluster user password> \
                --port 9440 
    

    You should see the data from your CSV file in the terminal output:

    2075-01-23	Night City	City Center	Corporate Plaza	Kivell	Binder	50	19.99	999.5
    2075-02-26	Night City	City Center	Corporate Plaza	Gill	Pen	27	19.99	539.73
    2075-03-15	Night City	Heywood	Japantown	Sorvino	Pencil	56	2.99	167.44
    2075-04-01	Night City	Watson	Kabuki	Jones	Binder	60	4.99	299.4
    2075-04-18	Night City	City Center	Corporate Plaza	Andrews	Pencil	75	1.99	149.25
    2075-05-05	Night City	City Center	Corporate Plaza	Jardine	Pencil	90	4.99	449.1
    2075-05-22	Night City	Heywood	Japantown	Thompson	Pencil	32	1.99	63.68
    2075-06-01	Night City	Watson	Kabuki	Jones	Pencil	95	1.99	189.05
    2075-06-08	Night City	Watson	Kabuki	Jones	Binder	60	8.99	539.4
    2075-06-25	Night City	City Center	Corporate Plaza	Morgan	Pencil	90	4.99	449.1
    

Amazon S3 URL

Uploading a CSV file in the Amazon S3 bucket gives you extra flexibility with DoubleCloud Managed Service for ClickHouse®. It's a sequence of manual steps that gives you complete control over every process step but lacks automation.

In short, you need to complete the following steps:

  • Connect to your cluster.
  • Create a table in the cluster that follows the CSV file structure you want to import.
  • Use the bash script to upload the contents of your CSV file from an Amazon S3 bucket into your Managed ClickHouse® cluster.

Before you start this scenario, it's essential to familiarize yourself with path-style URLs used by Amazon S3:

https://<Bucket-name>.s3.<Region>.amazonaws.com/<Key_name>
  • Region is the identifier of the AWS region.
  • Bucket-name is the name of the bucket that contains the file.
  • Key-name is the name of the file to which you want to give a link.

For example, there's a bucket named doublecloud-docs located in Frankfurt(eu-central-1), and we want to access the sample_data.csv file in the /data-sets folder. The path-style URL will be as follows:

https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/sample_data.csv

Now let's follow every step in detail.

  1. Install the ClickHouse® client.

  2. Connect to your cluster with a connection string:

    1. Go to the Clusters page in the console.

    2. Under Connection strings, find the Native interface string and click Copy.

    3. Open your terminal and connect to the cluster using the ClickHouse® client.

  3. Create a table to import your CSV data to. The table must have the same structure as your CSV file. For this tutorial, create a simple table containing performance statistics for a group of sales representatives. Call it nc_sales:

    CREATE TABLE nc_sales ON CLUSTER default
    (
       `order_date` Date, 
       `city` String,
       `district` String,
       `borough` String, 
       `rep` String, 
       `item` String, 
       `sold` Int, 
       `price` Float, 
       `total` Float
    ) 
    ENGINE = ReplicatedMergeTree() 
    ORDER BY (order_date)
    
  4. Insert the data from your CSV file located in the Amazon S3 bucket.

    To do this, you need to use the URL function of ClickHouse®. The syntax of this function looks like this:

    '<URL>', 'CSV', '<structure>'
    
    • <URL> is the path to your file
    • CSV is the format of the file you want to import
    • <structure> is the structure of the target table.

    Pay attention to the structure of the <URL> section in the following code example - it incorporates both the Path-style URL to the CSV file and the INSERT query:

    INSERT INTO nc_sales 
    SELECT * FROM s3
    (
       'https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/sample_data.csv', 
       'CSV',
       'order_date Date, city String, district String, borough String, rep String, item String, sold Int, price Float, total Float'
    )
    
  5. Use the following query to check your imported data:

    SELECT * FROM nc_sales LIMIT 20
    

    You should see the data from your CSV file in the terminal output:

    ┌─order_date─┬─city───────┬─district────┬─borough─────────┬─rep──────┬─item────┬─sold─┬─price─┬───total─┐
    │ 2075-01-23 │ Night City │ City Center │ Corporate Plaza │ Kivell   │ Binder  │   50 │ 19.99 │   999.5 │
    │ 2075-02-26 │ Night City │ City Center │ Corporate Plaza │ Gill     │ Pen     │   27 │ 19.99 │  539.73 │
    │ 2075-03-15 │ Night City │ Heywood     │ Japantown       │ Sorvino  │ Pencil  │   56 │  2.99 │  167.44 │
    │ 2075-04-01 │ Night City │ Watson      │ Kabuki          │ Jones    │ Binder  │   60 │  4.99 │   299.4 │
    │ 2075-04-18 │ Night City │ City Center │ Corporate Plaza │ Andrews  │ Pencil  │   75 │  1.99 │  149.25 │
    │ 2075-05-05 │ Night City │ City Center │ Corporate Plaza │ Jardine  │ Pencil  │   90 │  4.99 │   449.1 │
    │ 2075-05-22 │ Night City │ Heywood     │ Japantown       │ Thompson │ Pencil  │   32 │  1.99 │   63.68 │
    │ 2075-06-01 │ Night City │ Watson      │ Kabuki          │ Jones    │ Pencil  │   95 │  1.99 │  189.05 │
    │ 2075-06-08 │ Night City │ Watson      │ Kabuki          │ Jones    │ Binder  │   60 │  8.99 │   539.4 │
    │ 2075-06-25 │ Night City │ City Center │ Corporate Plaza │ Morgan   │ Pencil  │   90 │  4.99 │   449.1 │
    │ 2075-07-12 │ Night City │ Watson      │ Kabuki          │ Howard   │ Binder  │   29 │  1.99 │   57.71 │
    │ 2075-07-29 │ Night City │ Watson      │ Kabuki          │ Parent   │ Binder  │   81 │ 19.99 │ 1619.19 │
    │ 2075-08-11 │ Night City │ Watson      │ Kabuki          │ Parent   │ Pen     │   15 │ 19.99 │  299.85 │
    │ 2075-08-15 │ Night City │ Watson      │ Kabuki          │ Jones    │ Pencil  │   35 │  4.99 │  174.65 │
    │ 2075-09-01 │ Night City │ City Center │ Corporate Plaza │ Smith    │ Desk    │    2 │   125 │     250 │
    │ 2075-09-02 │ Night City │ City Center │ Corporate Plaza │ Jardine  │ Pencil  │   36 │  4.99 │  179.64 │
    │ 2075-09-18 │ Night City │ Watson      │ Kabuki          │ Jones    │ Pen Set │   16 │ 15.99 │  255.84 │
    │ 2075-10-05 │ Night City │ City Center │ Corporate Plaza │ Morgan   │ Binder  │   28 │  8.99 │  251.72 │
    │ 2075-10-22 │ Night City │ Watson      │ Kabuki          │ Jones    │ Pen     │   64 │  8.99 │  575.36 │
    │ 2075-11-25 │ Night City │ City Center │ Corporate Plaza │ Kivell   │ Pen Set │   96 │  4.99 │  479.04 │
    └────────────┴────────────┴─────────────┴─────────────────┴──────────┴─────────┴──────┴───────┴─────────┘
    

Sample connection scripts

We tested these scripts in the following environments:

  • Ubuntu 20.04
    • Python 3: 3.10.6, pip: 22.0.2
    • Go: 1.13.8

Python

These instructions allow you to create a script that will perform all the actions to import your CSV file with a single terminal command. Moreover, it retains complete control over the process.

In this example, we'll use the clickhouse-driver package. However, our Managed Service for ClickHouse® is compatible with multiple Python solutions - for example, clickhouse-connect . For more information on how to use it, see the ClickHouse® documentation .

In short, you need to complete the following steps:

  • Install the necessary software.
  • Write a script to upload your data to the cluster.
  • Upload the data to your cluster.

Let's follow every step in detail.

  1. Install the software and dependencies:

    sudo apt update && sudo apt install -y python3 python3-pip && pip3 install requests
    
  2. Install the ClickHouse® driver for Python:

    pip install clickhouse-driver
    
  3. Write the script to upload your data to the cluster. For the purpose of this scenario, create a simple table that will contain performance statistics for a group of sales representatives. Let's call it RegionSales:

    csv_uploader_test.py

    import csv
    
    # Declare the Client interface class from the ClickHouse® driver
    from clickhouse_driver import Client
    
    # Provide the necessary information to connect to your cluster
    CH_Client = Client(
        '<FQDN of any ClickHouse® host on your cluster>',
        user='<Cluster user name>',
        password='<Cluster user password>',
        secure=True)
    
    # Create a table on your cluster
    CH_Client.execute(
        'CREATE TABLE RegionSales ON CLUSTER default ('
        '`OrderDate` Date, '
        '`City` String, '
        '`District` String, '
        '`Borough` String, '
        '`Rep` String, '
        '`Item` String, '
        '`Sold` Int, '
        '`Price` Float, '
        '`Total` Float) ENGINE = ReplicatedMergeTree() '
        'PARTITION BY (City) ORDER BY (OrderDate)')
    
    
    # Create a generator to fetch parsed rows
    def row_reader():
        with open('<Path to your CSV file>') as TableName:
            for line in csv.reader(TableName):
               yield line
    
    # Use the generator to load values into the table on your cluster
    CH_Client.execute("INSERT INTO RegionSales VALUES", (line for line in row_reader()))
    
    
  4. Run the script to upload the data to your cluster:

    python3 csv_uploader_test.py
    
  5. Check your imported data:

    <Native interface connection string> --query="SELECT * FROM RegionSales"
    

    You should see the data from your CSV file in the terminal output.

Go

These instructions allow you to create a script that will perform all the actions to import your CSV file with a single terminal command. Moreover, it retains complete control over the process.

In short, you need to complete the following steps:

  • Install the necessary software.
  • Write a script to upload your data to the cluster.
  • Upload the data to your cluster.

Let's follow every step in detail.

  1. Install the necessary software and dependencies:

    1. Install the Go language interpreter:

      sudo apt update && sudo apt install -y golang-go
      
    2. Install Git:

      sudo apt update && sudo apt install -y git
      
    3. Install clickhouse-go and the necessary packages:

      go get -v -u github.com/ClickHouse/clickhouse-go/v2 \ 
                   github.com/google/uuid \
                   github.com/pierrec/lz4/v4 \ 
                   github.com/ClickHouse/clickhouse-go/lib/proto/query.go:25:2
      
  2. Write the script to upload your data to the cluster. For the purpose of this scenario, create a simple table that will contain performance statistics for a group of sales representatives. Let's call it RegionSales:

    csv_uploader_test.go

    package main
    
    import (
       "database/sql"
       "encoding/csv"
       "fmt"
       _ "github.com/ClickHouse/clickhouse-go/v2"
       "io"
       "log"
       "os"
       "strconv"
    )
    
    func main() {
       host := "<FQDN of any ClickHouse® host on your cluster>"
       user := "<Cluster user name>"
       password := "<Cluster user password>"
    
       dsn := fmt.Sprintf("clickhouse://%s:%s@%s:9440?username=%s&password=%s&secure=true", user, password, host)
       client, err := sql.Open("clickhouse", dsn)
       if err != nil {
          log.Fatal(err)
       }
       if err := client.Ping(); err != nil {
          panic(err)
       }
    
       if _, err := client.Exec(`CREATE TABLE IF NOT EXISTS test_table ON CLUSTER '{cluster}' (
          name String,
          value Int
          ) ENGINE = ReplicatedMergeTree() PARTITION BY (name) ORDER BY (value)`); err != nil {
          panic(err)
       }
    
       scope, err := client.Begin()
       if err != nil {
          panic(err)
       }
       batch, err := scope.Prepare("INSERT INTO test_table")
       if err != nil {
          panic(err)
       }
    
       data := readCSV()
       opened := true
       var row []interface{}
    
       for opened {
          row, opened = <-data
          if !opened {
             break
          }
          if _, err := batch.Exec(row...); err != nil {
             panic(err)
          }
       }
    
       if err := scope.Commit(); err != nil {
          panic(err)
       }
    }
    
    func readCSV() chan []interface{} {
       c := make(chan []interface{})
    
       go func() {
          file, err := os.Open("<FILENAME>")
          if err != nil {
             panic(err)
          }
          defer file.Close()
    
          csvReader := csv.NewReader(file)
          csvReader.TrimLeadingSpace = true
          var record []string
          for err == nil {
             record, err = csvReader.Read()
             if err == io.EOF {
                continue
             }
             if err != nil {
                panic(err)
             }
    
             row := make([]interface{}, len(record))
             for i, col := range record {
                row[i] = col
             }
    
             value, err := strconv.Atoi(record[1])
             if err != nil {
                panic(err)
             }
    
             c <- []interface{}{
                record[0],
                int32(value),
             }
          }
    
          close(c)
       }()
    
       return c
    }
    
  3. Upload the data to your cluster:

    go run csv_uploader_test.go
    
  4. Check your imported data:

    docker run --network host --rm -it clickhouse/<Native interface connection string> \
    --query="SELECT * FROM RegionSales"
    
    The complete Docker command structure
    docker run --network host --rm -it \ 
                clickhouse/clickhouse-client \
                --host <FQDN of your cluster> \
                --secure \
                --user <cluster user name> \
                --password <cluster user password> \
                --port 9440
                --query="SELECT * FROM RegionSales"
    
    <Native interface connection string> --query="SELECT * FROM RegionSales"
    

    You should see the data from your CSV file in the terminal output.