Transform your data in a Managed ClickHouse® cluster with dbt

Why do I need to transform my data?

Data transformation is the process of converting a raw data source into a ready-to-use format and structure. It's crucial for any entity or organization seeking to get the most out of its data and produce meaningful and usable business insights.

One of the most easy-to-use data transformation tools is dbt (data build tool) .

It allows developers and data scientists to write transformations as queries and organize them in a straightforward way. dbt reads your project's structure, compiles it into SQL and runs it against your database.

Below, you’ll see how to use dbt with a variety of tools to create a working data transformation flow with a DoubleCloud Managed ClickHouse® cluster:

scenario-diagram

First, we get raw table data from your DoubleCloud Managed ClickHouse® cluster and send it through the raw data processing models to get it ready for transformation.

Then, we send the resulting prepared data through another set of transformation models to get the data in a format that we need.

Let's get right to it.

Configure your cluster and local environment

To prepare for this scenario, you'll need to perform the following steps:

Create a Managed ClickHouse® cluster
Install dbt on your system
Install the dbt-clickhouse plugin
Create the database and upload the raw data
Write a dbt profile to work with your Managed ClickHouse® cluster

Let's get right to it.

Create a Managed ClickHouse® cluster

Create a ClickHouse® cluster using a method of your choice.

Install dbt and dbt-clickhouse on your system

First, make sure you have pip installed on your system:

sudo apt-get install python3-pip

Now let's make sure your setuptools and wheel packages are the latest version:

pip install --upgrade setuptools wheel

The dbt toolkit has a modular structure, so it isn't necessary to install the complete list of available packages. For now you just need to install the dbt-core package:

pip install dbt-core

Good job!

The core dbt service is ready. Now we need to install the adapter that'll make your dbt installation compatible with ClickHouse®:

pip install dbt-clickhouse

Check if the core and the plugin work correctly:

dbt --version

You should get the following output:

Core:
  - installed: 1.1.1
  - latest:    1.1.1 - Up to date!

Plugins:
  - clickhouse: 1.1.6 - Up to date!

Nice! It's time to prepare your database and upload the raw data.

Create the database and upload the raw data

First, install and then start the Docker service if needed:
```
sudo service docker start
```

Pull the clickhouse-client Docker image:

docker pull clickhouse/clickhouse-client

In the container, create the db_corner_shop database by running a command that contains a connection string. Use the value of the Native Interface field from the Overview tab on your cluster information page as part of the Docker connection string:

connection-string

The command has the following structure:

docker run --network host --rm -it clickhouse/<Native interface connection string> \
--query="CREATE DATABASE IF NOT EXISTS db_corner_shop"

The complete Docker command structure

docker run --network host --rm -it clickhouse/clickhouse-client
      --host <FQDN of the host>
      --secure
      --user <Cluster user name>
      --password <Cluster user password>
      --port 9440
      --query="CREATE DATABASE IF NOT EXISTS db_corner_shop"

Connect to the database to upload the data. For this scenario, we'll use the GUI-based DBeaver Community database management application:
1. Open Database → New Database Connection.
2. Select ClickHouse® from the list of database types.
3. Click Next.
4. On the Main tab:
  1. Host - paste the host's domain name (FQDN) from the Hosts tab on your cluster information page.
  2. Port - 8443 by default.
  3. Database/Schema - enter db_corner_shop.
5. Copy Username and Password for your cluster from the Overview tab on your cluster information page.
6. On the Driver Properties tab:
  1. Find ssl on the list of properties and change its value to true.
7. Specify the socket_timeout. The recommended value is 300000 or higher. Enter a larger value if you are going to perform longer queries.
8. Click Test Connection.... If the connection is successful, you'll see a window with the connection status, information on the DBMS, and the driver.
9. Click Finish.
Download the ZIP archive with the sample CSV data tables and extract the archive into a suitable directory.
Use DBeaver to upload the CSV tables to your database:
1. In the Database Navigator, right-click db_corner_shop → Tables and select Import Data:
2. In the Import source section, keep everything as-is and click Next.
3. In the Select input files dialog, choose all three tables (raw_customers, raw_orders, raw_payments) and click Open. You should see the following:
4. Click Next.
5. In the Tables mapping section, you'll see the following:
  
  Keep everything as-is and click Next.
6. In the Data load settings, keep everything with default values, and click Next.
7. In the Confirm section, ensure everything is correct. This is what it should look like:
8. Click Proceed. Transferring the data will take some time.
  
  After the process is complete, you'll see the imported tables in the Database Navigator:

Great job! We've got a database and a set of raw data to transform.

Create a dbt project to transform your data

Let's take a quick overview of what we need to do next.

We've got three tables with raw data:

raw_customers
raw_orders
raw_payments

This dataset simulates the data from a small retail CRM system and isn't suitable for transforming into something a BI solution can work with. To address that we need to:

Parse the data and apply appropriate data types to it.
Use the resulting structured data to create the constructs suitable for most BI solutions to ingest and process.

To achieve that, we'll use dbt as our one-stop solution for both steps above.

Write a dbt profile to work with your Managed ClickHouse® cluster

Now let's make sure your dbt installation can connect to your Managed ClickHouse® cluster. Create a profile with all the necessary credentials:

Create the ~/.dbt/ folder and make the profiles.yml file inside it:

corner_shop:
   target: dev
   outputs:
      dev:
         type: clickhouse
         driver: native
         host: <FQDN of the host>
         user: admin
         password: <your ClickHouse® cluster password>
         port: 9440
         secure: true
         schema: db_corner_shop

Good, Now we can use this profile to build a new dbt project.

Create a dbt project structure

Let's start with creating a dbt project structure.

For this scenario, create the /corner_shop project folder. Let's make the simplest possible file tree:

corner_shop # the project folder
├─ models # folder with configurations and scripts for transforming prepared data
│  ├─ staging # folder with configuration and scripts for transforming raw data 
│  │  ├─ schema.yml
│  │  ├─ stg_customers.sql
│  │  ├─ stg_orders.sql
│  │  └─ stg_payments.sql
│  ├─ customers.sql
│  ├─ orders.sql
│  └─ schema.yml
└─ dbt_project.yml # configuration for the whole project

Download the complete project structure

You can download the archive with all the files listed in the above project structure. Unpack it to your /corner_shop folder.

Now let's go through the project structure in detail.

Create the dbt_project.yml file

Create the base of the project - the dbt_project.yml file:

name: 'corner_shop' # project folder name

config-version: 2
version: '0.1'

profile: 'corner_shop' # the profile to connect to your cluster

model-paths: ["models"] # folder containing models

target-path: "target" 
clean-targets: # these folders will be created by dbt to store service files
   - "target"
   - "dbt_modules"
   - "logs"

require-dbt-version: [">=1.0.0", "<2.0.0"]

models:
   corner_shop: # here we provide the necessary ClickHouse®-specific settings
      materialized: table
      staging:
         materialized: view

Create raw data processing models

Create the /models/staging folder.

Make the /models/staging/schema.yml configuration file:

/models/staging/schema.yml

version: 2

models:
  - name: stg_customers # the name of the .sql file with the model
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
            values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']

  - name: stg_payments
    columns:
      - name: payment_id
        tests:
         - unique
         - not_null
      - name: payment_method
        tests:
          - accepted_values:
            values: ['credit_card', 'coupon', 'bank_transfer', 'gift_card']

Create the models for assigning appropriate data types to raw data:

stg_customers:

This model prepares the data from the raw_customers table for transformation:
- We take the id column and rename it to the CRM-friendly customer_id.
- The other two columns don't need any modifications, so we just transfer them as-is to the new table.
You can see the full listing of this model below:
/models/staging/stg_customers.sql
```
WITH source AS (
   SELECT * FROM db_corner_shop.raw_customers --path to raw data table on your cluster
),

renamed AS ( --renaming sequence for raw data columns

   SELECT
      id AS customer_id,
      first_name,
      last_name

   FROM source

)

SELECT * FROM renamed --writing the result into the stg_customers table
```
stg_orders:

This model prepares the data from the raw_orders table for transformation:
- We take the id column and rename it to the CRM-friendly order_id.
- The user_id column contains the same data as the id column from the raw_customers table, so we rename it to customer_id for consistency further upstream.
- The other two columns don't need any modifications, so we just transfer them as-is to the new table.
You can see the full listing of this model below:
/models/staging/stg_orders.sql
```
WITH source AS (
   SELECT * FROM db_corner_shop.raw_orders --path to raw data table on your cluster
),

renamed AS ( --renaming sequence for raw data columns

   SELECT
      id AS order_id,
      user_id AS customer_id,
      order_date,
      status

   FROM source

)

SELECT * FROM renamed --writing the result into the stg_orders table
```
stg_payments:

This model prepares the data from the raw_payments table for transformation:
- We take the id column and rename it to the CRM-friendly payment_id.
- The account column contains numbers, so we change its data type to int32 to be able to perform calculations with its contents further upstream.
- The other two columns don't need any modifications, so we just transfer them as-is to the new table.
You can see the full listing of this model below:
/models/staging/stg_payments.sql
```
WITH source AS (
   SELECT * FROM db_corner_shop.raw_payments --path to raw data table on your cluster
),

renamed AS ( --renaming sequence for raw data columns

   SELECT
      id aAS payment_id,
      order_id,
      payment_method,
      CAST (amount AS INTEGER) AS amount

   FROM source

)

SELECT * FROM renamed --writing the result into the stg_payments table
```
The infrastructure to prepare your raw data for transformation is complete.

Create data transformation models

Put together the schema for transforming the prepared data:

/models/schema.yml

version: 2

models:
  - name: customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null
      - name: first_name
      - name: last_name
      - name: first_order
      - name: most_recent_order
      - name: number_of_orders
      - name: total_order_amount
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
           - not_null
           - relationships:
               to: ref('customers')
               field: customer_id
      - name: order_date
      - name: status
        tests:
          - accepted_values:
            values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
      - name: amount
        tests:
          - not_null
      - name: credit_card_amount
        tests:
          - not_null
      - name: coupon_amount
        tests:
          - not_null
      - name: bank_transfer_amount
        tests:
          - not_null
      - name: gift_card_amount
        tests:
          - not_null

Create the models to transform prepared data:

customers:

This model transforms the data from the stg_customers, stg_orders and stg_payments to make the BI-ready customers table:

diag-stg-customers

We take the customer_id columns and join them.
We take the frist_name and the last_name column and put them directly into the customers table.
We perform some analytics on the order_date column:
- We take the minimum value and write it to the new first_order column.
- We take the maximum value and write it to the new most_recent_order column.
- We count the number of orders for each customer_id and write the result into the new number_of_orders column.
We sum up the numbers for each customer stored in the amount column of the stg_payments table and write the result into the new total_amount column. Then we rename it into customer_lifetime_value.

You can see the full listing of this model below:

/models/customers.sql

WITH customers AS (

   SELECT * FROM {{ REF('stg_customers') }} -- getting the prepared data from our stg_ models

),

orders AS (

   SELECT * FROM {{ REF('stg_orders') }}

),

payments AS (

   SELECT * FROM {{ REF('stg_payments') }}

),

customer_orders AS ( --transforming the customer_orders segment into 4 columns: 

   SELECT
      customer_id,

      MIN(order_date) AS first_order,
      MAX(order_date) AS most_recent_order,
      COUNT(order_id) AS number_of_orders
   FROM orders

   GROUP BY customer_id --grouping the above columns

),

customer_payments AS ( -- transforming the customer_payments segment into the total_amount coulmn:

   SELECT
      orders.customer_id,
      SUM(amount) AS total_amount

   FROM payments

   LEFT JOIN orders ON --adding the order_id column from the payments table as is:
         payments.order_id = orders.order_id

   GROUP BY orders.customer_id --grouping the above columns

),

final AS ( --describing the structure of the final table:

   SELECT
      customers.customer_id,
      customers.first_name,
      customers.last_name,
      customer_orders.first_order,
      customer_orders.most_recent_order,
      customer_orders.number_of_orders,
      customer_payments.total_amount AS customer_lifetime_value

   FROM customers

  LEFT JOIN customer_orders --joining the columns with the same content
      ON customers.customer_id = customer_orders.customer_id

   LEFT JOIN customer_payments --joining the columns with the same content
      ON customers.customer_id = customer_payments.customer_id

)

SELECT * FROM final --writing all the data into the table

orders:

This model transforms the data from the stg_orders and stg_payments to make the BI-ready orders table:

diag-stg-customers

We take the order_id columns and join them.
We take the oder_date and the status columns and put them directly into the orders table.
We perform analytics on the order_date column:
- We declare the payment method types that occur in the stg_payments table: credit_card, coupon, bank_transfer, gift_card.
- We write a CASE statement that works as follows:
  
  When we encounter one of the payment methods associated with a current order_id, we take the value from the amount column of the same row. If there's no value in the source table, consider it 0.
- Then we take the resulting value from the step above, sum all the available values with the same order_id and payment method, and write it into the new column named as not_var{{ payment_method }}_amount.
We sum up the numbers for each customer stored in the amount column of the stg_payments table and write the result into the new total_amount column. Then we rename it back into amount.

You can see the full listing of this model below:

/models/orders.sql

{% SET payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %} -- set up payment methods column names for further calculations

WITH orders AS (

   SELECT * FROM {{ REF('stg_orders') }} -- getting the prepared data from our stg_ models

),

payments AS (

   SELECT * FROM {{ REF('stg_payments') }}

),

order_payments AS ( --calculating payment amount and attributing a writing results into appropriate columns

   SELECT
      order_id,

      {% for payment_method IN payment_methods %}
      SUM(CASE WHEN payment_method = 'not_var{{ payment_method }}' THEN amount ELSE 0 END) AS not_var{{ payment_method }}_amount,
      {% endfor -%}

      SUM(amount) AS total_amount --calculating the value of the total_amount column

   FROM payments

   GROUP BY order_id --grouping the above columns

),

final AS ( --describing the structure of the final table:

   SELECT
      orders.order_id,
      orders.customer_id,
      orders.order_date,
      orders.status,

      {% for payment_method IN payment_methods %} --adding payment method columns to the table

      order_payments.not_var{{ payment_method }}_amount,

      {% endfor -%}

      order_payments.total_amount AS amount --renaming the total_amount column

   FROM orders


   LEFT JOIN order_payments
      ON orders.order_id = order_payments.order_id

)

SELECT * FROM final --writing all the data into the table

All the files for the project are ready. Now let's make sure everything is ready to go.

Test and run your dbt project

Go back to your terminal and navigate to the /corner_shop folder with your dbt project.

First, test the project integrity and the connection between dbt and your Managed ClickHouse® cluster:

dbt debug

If everything works, you should see the following output:

Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]

Required dependencies:
- git [OK found]

Connection:
driver: native
host: <FQDN of your ClickHouse® cluster>
port: 9440
user: admin
schema: db_corner_shop
secure: True
verify: True
connect_timeout: 10
send_receive_timeout: 300
sync_request_timeout: 5
compress_block_size: 1048576
compression: 
custom_settings: None
Connection test: [OK connection ok]

All checks passed!

Run your dbt project:

dbt run

Give dbt some time to perform data transformation. As a result, you should see the following output:

Running with dbt=1.1.1
Found 5 models, 20 tests, 0 snapshots, 0 analyses, 186 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
  
Concurrency: 1 threads (target='dev')
  
1 of 5 START table model db_corner_shop.stg_customers .......................... [RUN]
1 of 5 OK created table model db_corner_shop.stg_customers ..................... [OK in 1.85s]
2 of 5 START table model db_corner_shop.stg_orders ............................. [RUN]
2 of 5 OK created table model db_corner_shop.stg_orders ........................ [OK in 1.97s]
3 of 5 START table model db_corner_shop.stg_payments ........................... [RUN]
3 of 5 OK created table model db_corner_shop.stg_payments ...................... [OK in 1.68s]
4 of 5 START table model db_corner_shop.customers .............................. [RUN]
4 of 5 OK created table model db_corner_shop.customers ......................... [OK in 1.56s]
5 of 5 START table model db_corner_shop.orders ................................. [RUN]
5 of 5 OK created table model db_corner_shop.orders ............................ [OK in 2.24s]
  
Finished running 5 table models in 10.90s.
  
Completed successfully

Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5

Great job!. Data transformation is complete.

Query the results of data transformation

Let's check if all the tables are present in the db_corner_shop database:

CLI

GUI

Let's use your terminal to send the query to your Managed ClickHouse® cluster:

docker run --network host --rm -it clickhouse/<Native interface connection string> \
--query="SHOW TABLES FROM db_corner_shop"

You should see the following output:

customers
orders
raw_customers
raw_orders
raw_payments
stg_customers
stg_orders
stg_payments

The tables named customers and orders are the results of your data transformation. Feel free to see the results of your work with the following query:

docker run --network host --rm -it clickhouse/<Native interface connection string> \
--query="SELECT * FROM db_corner_shop.customers LIMIT 20"

Your output should look like this:

> --query="SELECT * FROM db_corner_shop.customers LIMIT 20"
1	Michael	P.	2018-01-01	2018-02-10	2	3300
2	Shawn	M.	2018-01-11	2018-01-11	1	2300
3	Kathleen	P.	2018-01-02	2018-03-11	3	6500
4	Jimmy	C.			0	0
5	Katherine	R.			0	0
6	Sarah	R.	2018-02-19	2018-02-19	1	800
7	Martin	M.	2018-01-14	2018-01-14	1	2600
8	Frank	R.	2018-01-29	2018-03-12	2	4500
9	Jennifer	F.	2018-03-17	2018-03-17	1	3000
10	Henry	W.			0	0
11	Fred	S.	2018-03-23	2018-03-23	1	300
12	Amy	D.	2018-03-03	2018-03-03	1	400
13	Kathleen	M.	2018-03-07	2018-03-07	1	2600
14	Steve	F.			0	0
15	Teresa	H.			0	0
16	Amanda	H.	2018-02-02	2018-02-02	1	1200
17	Kimberly	R.			0	0
18	Johnny	K.	2018-02-27	2018-02-27	1	2900
19	Virginia	F.	2018-03-16	2018-03-16	1	300
20	Anna	A.	2018-01-23	2018-01-23	1	1500

Let's look at your Database Navigator. You should see the customers and orders table:

database-navigator-complete

Double-click the customers table. This is what its contents should look like:

dbeaver-customers-output

This concludes our tutorial on how to use dbt-based data transformations with DoubleCloud Managed ClickHouse®. Take a look at other instructions related to ClickHouse® and keep exploring!