Transform your data in a Managed ClickHouse® cluster with dbt

Why do I need to transform my data?

Data transformation is the process of converting a raw data source into a ready-to-use format and structure. It's crucial for any entity or organization seeking to get the most out of its data and produce meaningful and usable business insights.

One of the most easy-to-use data transformation tools is dbt (data build tool) .

It allows developers and data scientists to write transformations as queries and organize them in a straightforward way. dbt reads your project's structure, compiles it into SQL and runs it against your database.

Below, you’ll see how to use dbt with a variety of tools to create a working data transformation flow with a DoubleCloud Managed ClickHouse® cluster:

scenario-diagram

First, we get raw table data from your DoubleCloud Managed ClickHouse® cluster and send it through the raw data processing models to get it ready for transformation.

Then, we send the resulting prepared data through another set of transformation models to get the data in a format that we need.

Let's get right to it.

Configure your cluster and local environment

To prepare for this scenario, you'll need to perform the following steps:

  1. Create a Managed ClickHouse® cluster

  2. Install dbt on your system

  3. Install the dbt-clickhouse plugin

  4. Create the database and upload the raw data

  5. Write a dbt profile to work with your Managed ClickHouse® cluster

Let's get right to it.

Create a Managed ClickHouse® cluster

  1. Create a ClickHouse® cluster using a method of your choice.

Install dbt and dbt-clickhouse on your system

First, make sure you have pip installed on your system:

sudo apt-get install python3-pip

Now let's make sure your setuptools and wheel packages are the latest version:

pip install --upgrade setuptools wheel

The dbt toolkit has a modular structure, so it isn't necessary to install the complete list of available packages. For now you just need to install the dbt-core package:

pip install dbt-core

Good job!

The core dbt service is ready. Now we need to install the adapter that'll make your dbt installation compatible with ClickHouse®:

pip install dbt-clickhouse

Check if the core and the plugin work correctly:

dbt --version

You should get the following output:

Core:
  - installed: 1.1.1
  - latest:    1.1.1 - Up to date!

Plugins:
  - clickhouse: 1.1.6 - Up to date!

Nice! It's time to prepare your database and upload the raw data.

Create the database and upload the raw data

  1. First, install and then start the Docker service if needed:

    sudo service docker start
    
  2. Pull the clickhouse-client Docker image:

    docker pull clickhouse/clickhouse-client
    
  3. In the container, create the db_corner_shop database by running a command that contains a connection string. Use the value of the Native Interface field from the Overview tab on your cluster information page as part of the Docker connection string:

    connection-string

    The command has the following structure:

    docker run --network host --rm -it clickhouse/<Native interface connection string> \
    --query="CREATE DATABASE IF NOT EXISTS db_corner_shop"
    
    The complete Docker command structure
    docker run --network host --rm -it clickhouse/clickhouse-client
          --host <FQDN of the host>
          --secure
          --user <Cluster user name>
          --password <Cluster user password>
          --port 9440
          --query="CREATE DATABASE IF NOT EXISTS db_corner_shop"
    
  4. Connect to the database to upload the data. For this scenario, we'll use the GUI-based DBeaver Community database management application:

    1. Open DatabaseNew Database Connection.

    2. Select ClickHouse® from the list of database types.

    3. Click Next.

    4. On the Main tab:

      1. Host - paste the host's domain name (FQDN) from the Hosts tab on your cluster information page.

      2. Port - 8443 by default.

      3. Database/Schema - enter db_corner_shop.

    5. Copy Username and Password for your cluster from the Overview tab on your cluster information page.

    6. On the Driver Properties tab:

      1. Find ssl on the list of properties and change its value to true.
    7. Specify the socket_timeout. The recommended value is 300000 or higher. Enter a larger value if you are going to perform longer queries.

    8. Click Test Connection.... If the connection is successful, you'll see a window with the connection status, information on the DBMS, and the driver.

    9. Click Finish.

  5. Download the ZIP archive with the sample CSV data tables and extract the archive into a suitable directory.

  6. Use DBeaver to upload the CSV tables to your database:

    1. In the Database Navigator, right-click db_corner_shopTables and select Import Data:

      dbeaver-import-menu

    2. In the Import source section, keep everything as-is and click Next.

    3. In the Select input files dialog, choose all three tables (raw_customers, raw_orders, raw_payments) and click Open. You should see the following:

      dbeaver-input-files

    4. Click Next.

    5. In the Tables mapping section, you'll see the following:

      dbeaver-tables-mapping

      Keep everything as-is and click Next.

    6. In the Data load settings, keep everything with default values, and click Next.

    7. In the Confirm section, ensure everything is correct. This is what it should look like:

      dbeaver-confirm

    8. Click Proceed. Please wait - transferring the data will take some time.

      After the process is complete, you'll see the imported tables in the Database Navigator:

      dbeaver-import-success

Great job! We've got a database and a set of raw data to transform.

Create a dbt project to transform your data

Let's take a quick overview of what we need to do next.

We've got three tables with raw data:

  • raw_customers
  • raw_orders
  • raw_payments

This dataset simulates the data from a small retail CRM system and isn't suitable for transforming into something a BI solution can work with. To address that we need to:

  1. Parse the data and apply appropriate data types to it.

  2. Use the resulting structured data to create the constructs suitable for most BI solutions to ingest and process.

To achieve that, we'll use dbt as our one-stop solution for both steps above.

Write a dbt profile to work with your Managed ClickHouse® cluster

Now let's make sure your dbt installation can connect to your Managed ClickHouse® cluster. Create a profile with all the necessary credentials:

  1. Create the ~/.dbt/ folder and make the profiles.yml file inside it:

    corner_shop:
       target: dev
       outputs:
          dev:
             type: clickhouse
             driver: native
             host: <FQDN of the host>
             user: admin
             password: <your ClickHouse® cluster password>
             port: 9440
             secure: true
             schema: db_corner_shop
    

Good, Now we can use this profile to build a new dbt project.

Create a dbt project structure

Let's start with creating a dbt project structure.

For this scenario, create the /corner_shop project folder. Let's make the simplest possible file tree:

corner_shop # the project folder
├─ models # folder with configurations and scripts for transforming prepared data
│  ├─ staging # folder with configuration and scripts for transforming raw data 
│  │  ├─ schema.yml
│  │  ├─ stg_customers.sql
│  │  ├─ stg_orders.sql
│  │  └─ stg_payments.sql
│  ├─ customers.sql
│  ├─ orders.sql
│  └─ schema.yml
└─ dbt_project.yml # configuration for the whole project
Download the complete project structure

You can download the archive with all the files listed in the above project structure. Please unpack it to your /corner_shop folder.

Now let's go through the project structure in detail.

Create the dbt_project.yml file

Create the base of the project - the dbt_project.yml file:

name: 'corner_shop' # project folder name

config-version: 2
version: '0.1'

profile: 'corner_shop' # the profile to connect to your cluster

model-paths: ["models"] # folder containing models

target-path: "target" 
clean-targets: # these folders will be created by dbt to store service files
   - "target"
   - "dbt_modules"
   - "logs"

require-dbt-version: [">=1.0.0", "<2.0.0"]

models:
   corner_shop: # here we provide the necessary ClickHouse®-specific settings
      materialized: table
      staging:
         materialized: view

Create raw data processing models

  1. Create the /models/staging folder.

  2. Make the /models/staging/schema.yml configuration file:

    /models/staging/schema.yml
    version: 2
    
    models:
      - name: stg_customers # the name of the .sql file with the model
        columns:
          - name: customer_id
            tests:
              - unique
              - not_null
    
      - name: stg_orders
        columns:
          - name: order_id
            tests:
              - unique
              - not_null
          - name: status
            tests:
              - accepted_values:
                values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
    
      - name: stg_payments
        columns:
          - name: payment_id
            tests:
             - unique
             - not_null
          - name: payment_method
            tests:
              - accepted_values:
                values: ['credit_card', 'coupon', 'bank_transfer', 'gift_card']
    
  3. Create the models for assigning appropriate data types to raw data:

    stg_customers:

    This model prepares the data from the raw_customers table for transformation:

    diag-stg-customers

    • We take the id column and rename it to the CRM-friendly customer_id.

    • The other two columns don't need any modifications, so we just transfer them as-is to the new table.

    You can see the full listing of this model below:

    /models/staging/stg_customers.sql
    WITH source AS (
       SELECT * FROM db_corner_shop.raw_customers --path to raw data table on your cluster
    ),
    
    renamed AS ( --renaming sequence for raw data columns
    
       SELECT
          id AS customer_id,
          first_name,
          last_name
    
       FROM source
    
    )
    
    SELECT * FROM renamed --writing the result into the stg_customers table
    

    stg_orders:

    This model prepares the data from the raw_orders table for transformation:

    diag-stg-customers

    • We take the id column and rename it to the CRM-friendly order_id.

    • The user_id column contains the same data as the id column from the raw_customers table, so we rename it to customer_id for consistency further upstream.

    • The other two columns don't need any modifications, so we just transfer them as-is to the new table.

    You can see the full listing of this model below:

    /models/staging/stg_orders.sql
    WITH source AS (
       SELECT * FROM db_corner_shop.raw_orders --path to raw data table on your cluster
    ),
    
    renamed AS ( --renaming sequence for raw data columns
    
       SELECT
          id AS order_id,
          user_id AS customer_id,
          order_date,
          status
    
       FROM source
    
    )
    
    SELECT * FROM renamed --writing the result into the stg_orders table
    

    stg_payments:

    This model prepares the data from the raw_payments table for transformation:

    diag-stg-customers

    • We take the id column and rename it to the CRM-friendly payment_id.

    • The account column contains numbers, so we change its data type to int32 to be able to perform calculations with its contents further upstream.

    • The other two columns don't need any modifications, so we just transfer them as-is to the new table.

    You can see the full listing of this model below:

    /models/staging/stg_payments.sql
    WITH source AS (
       SELECT * FROM db_corner_shop.raw_payments --path to raw data table on your cluster
    ),
    
    renamed AS ( --renaming sequence for raw data columns
    
       SELECT
          id aAS payment_id,
          order_id,
          payment_method,
          CAST (amount AS INTEGER) AS amount
    
       FROM source
    
    )
    
    SELECT * FROM renamed --writing the result into the stg_payments table
    

    The infrastructure to prepare your raw data for transformation is complete.

Create data transformation models

  1. Put together the schema for transforming the prepared data:

    /models/schema.yml
    version: 2
    
    models:
      - name: customers
        columns:
          - name: customer_id
            tests:
              - unique
              - not_null
          - name: first_name
          - name: last_name
          - name: first_order
          - name: most_recent_order
          - name: number_of_orders
          - name: total_order_amount
      - name: orders
        columns:
          - name: order_id
            tests:
              - unique
              - not_null
          - name: customer_id
            tests:
               - not_null
               - relationships:
                   to: ref('customers')
                   field: customer_id
          - name: order_date
          - name: status
            tests:
              - accepted_values:
                values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
          - name: amount
            tests:
              - not_null
          - name: credit_card_amount
            tests:
              - not_null
          - name: coupon_amount
            tests:
              - not_null
          - name: bank_transfer_amount
            tests:
              - not_null
          - name: gift_card_amount
            tests:
              - not_null
    
  2. Create the models to transform prepared data:

    customers:

    This model transforms the data from the stg_customers, stg_orders and stg_payments to make the BI-ready customers table:

    diag-stg-customers

    • We take the customer_id columns and join them.

    • We take the frist_name and the last_name column and put them directly into the customers table.

    • We perform some analytics on the order_date column:

      • We take the minimum value and write it to the new first_order column.

      • We take the maximum value and write it to the new most_recent_order column.

      • We count the number of orders for each customer_id and write the result into the new number_of_orders column.

    • We sum up the numbers for each customer stored in the amount column of the stg_payments table and write the result into the new total_amount column. Then we rename it into customer_lifetime_value.

    You can see the full listing of this model below:

    /models/customers.sql
    WITH customers AS (
    
       SELECT * FROM {{ REF('stg_customers') }} -- getting the prepared data from our stg_ models
    
    ),
    
    orders AS (
    
       SELECT * FROM {{ REF('stg_orders') }}
    
    ),
    
    payments AS (
    
       SELECT * FROM {{ REF('stg_payments') }}
    
    ),
    
    customer_orders AS ( --transforming the customer_orders segment into 4 columns: 
    
       SELECT
          customer_id,
    
          MIN(order_date) AS first_order,
          MAX(order_date) AS most_recent_order,
          COUNT(order_id) AS number_of_orders
       FROM orders
    
       GROUP BY customer_id --grouping the above columns
    
    ),
    
    customer_payments AS ( -- transforming the customer_payments segment into the total_amount coulmn:
    
       SELECT
          orders.customer_id,
          SUM(amount) AS total_amount
    
       FROM payments
    
       LEFT JOIN orders ON --adding the order_id column from the payments table as is:
             payments.order_id = orders.order_id
    
       GROUP BY orders.customer_id --grouping the above columns
    
    ),
    
    final AS ( --describing the structure of the final table:
    
       SELECT
          customers.customer_id,
          customers.first_name,
          customers.last_name,
          customer_orders.first_order,
          customer_orders.most_recent_order,
          customer_orders.number_of_orders,
          customer_payments.total_amount AS customer_lifetime_value
    
       FROM customers
    
      LEFT JOIN customer_orders --joining the columns with the same content
          ON customers.customer_id = customer_orders.customer_id
    
       LEFT JOIN customer_payments --joining the columns with the same content
          ON customers.customer_id = customer_payments.customer_id
    
    )
    
    SELECT * FROM final --writing all the data into the table
    

    orders:

    This model transforms the data from the stg_orders and stg_payments to make the BI-ready orders table:

    diag-stg-customers

    • We take the order_id columns and join them.

    • We take the oder_date and the status columns and put them directly into the orders table.

    • We perform analytics on the order_date column:

      • We declare the payment method types that occur in the stg_payments table: credit_card, coupon, bank_transfer, gift_card.

      • We write a CASE statement that works as follows:

        When we encounter one of the payment methods associated with a current order_id, we take the value from the amount column of the same row. If there's no value in the source table, consider it 0.

      • Then we take the resulting value from the step above, sum all the available values with the same order_id and payment method, and write it into the new column named as not_var{{ payment_method }}_amount.

    • We sum up the numbers for each customer stored in the amount column of the stg_payments table and write the result into the new total_amount column. Then we rename it back into amount.

    You can see the full listing of this model below:

    /models/orders.sql
    {% SET payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %} -- set up payment methods column names for further calculations
    
    WITH orders AS (
    
       SELECT * FROM {{ REF('stg_orders') }} -- getting the prepared data from our stg_ models
    
    ),
    
    payments AS (
    
       SELECT * FROM {{ REF('stg_payments') }}
    
    ),
    
    order_payments AS ( --calculating payment amount and attributing a writing results into appropriate columns
    
       SELECT
          order_id,
    
          {% for payment_method IN payment_methods %}
          SUM(CASE WHEN payment_method = 'not_var{{ payment_method }}' THEN amount ELSE 0 END) AS not_var{{ payment_method }}_amount,
          {% endfor -%}
    
          SUM(amount) AS total_amount --calculating the value of the total_amount column
    
       FROM payments
    
       GROUP BY order_id --grouping the above columns
    
    ),
    
    final AS ( --describing the structure of the final table:
    
       SELECT
          orders.order_id,
          orders.customer_id,
          orders.order_date,
          orders.status,
    
          {% for payment_method IN payment_methods %} --adding payment method columns to the table
    
          order_payments.not_var{{ payment_method }}_amount,
    
          {% endfor -%}
    
          order_payments.total_amount AS amount --renaming the total_amount column
    
       FROM orders
    
    
       LEFT JOIN order_payments
          ON orders.order_id = order_payments.order_id
    
    )
    
    SELECT * FROM final --writing all the data into the table
    

    All the files for the project are ready. Now let's make sure everything is ready to go.

Test and run your dbt project

  1. Go back to your terminal and navigate to the /corner_shop folder with your dbt project.

  2. First, test the project integrity and the connection between dbt and your Managed ClickHouse® cluster:

    dbt debug
    

    If everything works, you should see the following output:

    Configuration:
    profiles.yml file [OK found and valid]
    dbt_project.yml file [OK found and valid]
    
    Required dependencies:
    - git [OK found]
    
    Connection:
    driver: native
    host: <FQDN of your ClickHouse® cluster>
    port: 9440
    user: admin
    schema: db_corner_shop
    secure: True
    verify: True
    connect_timeout: 10
    send_receive_timeout: 300
    sync_request_timeout: 5
    compress_block_size: 1048576
    compression: 
    custom_settings: None
    Connection test: [OK connection ok]
    
    All checks passed!
    
  3. Run your dbt project:

    dbt run
    

    Give dbt some time to perform data transformation. As a result, you should see the following output:

    Running with dbt=1.1.1
    Found 5 models, 20 tests, 0 snapshots, 0 analyses, 186 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
      
    Concurrency: 1 threads (target='dev')
      
    1 of 5 START table model db_corner_shop.stg_customers .......................... [RUN]
    1 of 5 OK created table model db_corner_shop.stg_customers ..................... [OK in 1.85s]
    2 of 5 START table model db_corner_shop.stg_orders ............................. [RUN]
    2 of 5 OK created table model db_corner_shop.stg_orders ........................ [OK in 1.97s]
    3 of 5 START table model db_corner_shop.stg_payments ........................... [RUN]
    3 of 5 OK created table model db_corner_shop.stg_payments ...................... [OK in 1.68s]
    4 of 5 START table model db_corner_shop.customers .............................. [RUN]
    4 of 5 OK created table model db_corner_shop.customers ......................... [OK in 1.56s]
    5 of 5 START table model db_corner_shop.orders ................................. [RUN]
    5 of 5 OK created table model db_corner_shop.orders ............................ [OK in 2.24s]
      
    Finished running 5 table models in 10.90s.
      
    Completed successfully
    
    Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
    

Great job!. Data transformation is complete.

Query the results of data transformation

Let's check if all the tables are present in the db_corner_shop database:

Let's use your terminal to send the query to your Managed ClickHouse® cluster:

docker run --network host --rm -it clickhouse/<Native interface connection string> \
--query="SHOW TABLES FROM db_corner_shop"

You should see the following output:

customers
orders
raw_customers
raw_orders
raw_payments
stg_customers
stg_orders
stg_payments

The tables named customers and orders are the results of your data transformation. Feel free to see the results of your work with the following query:

docker run --network host --rm -it clickhouse/<Native interface connection string> \
--query="SELECT * FROM db_corner_shop.customers LIMIT 20"

Your output should look like this:

> --query="SELECT * FROM db_corner_shop.customers LIMIT 20"
1	Michael	P.	2018-01-01	2018-02-10	2	3300
2	Shawn	M.	2018-01-11	2018-01-11	1	2300
3	Kathleen	P.	2018-01-02	2018-03-11	3	6500
4	Jimmy	C.			0	0
5	Katherine	R.			0	0
6	Sarah	R.	2018-02-19	2018-02-19	1	800
7	Martin	M.	2018-01-14	2018-01-14	1	2600
8	Frank	R.	2018-01-29	2018-03-12	2	4500
9	Jennifer	F.	2018-03-17	2018-03-17	1	3000
10	Henry	W.			0	0
11	Fred	S.	2018-03-23	2018-03-23	1	300
12	Amy	D.	2018-03-03	2018-03-03	1	400
13	Kathleen	M.	2018-03-07	2018-03-07	1	2600
14	Steve	F.			0	0
15	Teresa	H.			0	0
16	Amanda	H.	2018-02-02	2018-02-02	1	1200
17	Kimberly	R.			0	0
18	Johnny	K.	2018-02-27	2018-02-27	1	2900
19	Virginia	F.	2018-03-16	2018-03-16	1	300
20	Anna	A.	2018-01-23	2018-01-23	1	1500

Let's look at your Database Navigator. You should see the customers and orders table:

database-navigator-complete

Double-click the customers table. This is what its contents should look like:

dbeaver-customers-output

This concludes our tutorial on how to use dbt-based data transformations with DoubleCloud Managed ClickHouse®. Take a look at other instructions related to ClickHouse® and keep exploring!

See also