Transform your data in a Managed ClickHouse® cluster with dbt
Why do I need to transform my data?
Data transformation is the process of converting a raw data source into a ready-to-use format and structure. It's crucial for any entity or organization seeking to get the most out of its data and produce meaningful and usable business insights.
One of the most easy-to-use data transformation tools is dbt (data build tool)
It allows developers and data scientists to write transformations as queries and organize them in a straightforward way. dbt reads your project's structure, compiles it into SQL and runs it against your database.
Below, you’ll see how to use dbt with a variety of tools to create a working data transformation flow with a DoubleCloud Managed ClickHouse® cluster:
First, we get raw table data from your DoubleCloud Managed ClickHouse® cluster and send it through the raw data processing models to get it ready for transformation.
Then, we send the resulting prepared data through another set of transformation models to get the data in a format that we need.
Let's get right to it.
Configure your cluster and local environment
To prepare for this scenario, you'll need to perform the following steps:
-
Create a Managed ClickHouse® cluster
-
Install dbt on your system
-
Install the
dbt-clickhouse
plugin -
Create the database and upload the raw data
-
Write a dbt profile to work with your Managed ClickHouse® cluster
Let's get right to it.
Create a Managed ClickHouse® cluster
- Create a ClickHouse® cluster using a method of your choice.
Install dbt and dbt-clickhouse on your system
First, make sure you have pip
installed on your system:
sudo apt-get install python3-pip
Now let's make sure your setuptools
and wheel
packages are the latest version:
pip install --upgrade setuptools wheel
The dbt toolkit has a modular structure, so it isn't necessary to install the complete list of available packages. For now you just need to install the dbt-core
package:
pip install dbt-core
Good job!
The core dbt service is ready. Now we need to install the adapter that'll make your dbt installation compatible with ClickHouse®:
pip install dbt-clickhouse
Check if the core and the plugin work correctly:
dbt --version
You should get the following output:
Core:
- installed: 1.1.1
- latest: 1.1.1 - Up to date!
Plugins:
- clickhouse: 1.1.6 - Up to date!
Nice! It's time to prepare your database and upload the raw data.
Create the database and upload the raw data
-
First, install
sudo service docker start
-
Pull the clickhouse-client
docker pull clickhouse/clickhouse-client
-
In the container, create the
db_corner_shop
database by running a command that contains a connection string. Use the value of the Native Interface field from the Overview tab on your cluster information page as part of the Docker connection string:The command has the following structure:
docker run --network host --rm -it clickhouse/<Native interface connection string> \ --query="CREATE DATABASE IF NOT EXISTS db_corner_shop"
The complete Docker command structure
docker run --network host --rm -it clickhouse/clickhouse-client --host <FQDN of the host> --secure --user <Cluster user name> --password <Cluster user password> --port 9440 --query="CREATE DATABASE IF NOT EXISTS db_corner_shop"
-
Connect to the database to upload the data. For this scenario, we'll use the GUI-based DBeaver Community
-
Open Database → New Database Connection.
-
Select ClickHouse® from the list of database types.
-
Click Next.
-
On the Main tab:
-
Host - paste the host's domain name (FQDN) from the Hosts tab on your cluster information page.
-
Port -
8443
by default. -
Database/Schema - enter
db_corner_shop
.
-
-
Copy Username and Password for your cluster from the Overview tab on your cluster information page.
-
On the Driver Properties tab:
- Find
ssl
on the list of properties and change its value totrue
.
- Find
-
Specify the
socket_timeout
. The recommended value is300000
or higher. Enter a larger value if you are going to perform longer queries. -
Click Test Connection.... If the connection is successful, you'll see a window with the connection status, information on the DBMS, and the driver.
-
Click Finish.
-
-
Download the ZIP archive with the sample CSV data tables and extract the archive into a suitable directory.
-
Use DBeaver to upload the CSV tables to your database:
-
In the Database Navigator, right-click db_corner_shop → Tables and select Import Data:
-
In the Import source section, keep everything as-is and click Next.
-
In the Select input files dialog, choose all three tables (
raw_customers
,raw_orders
,raw_payments
) and click Open. You should see the following: -
Click Next.
-
In the Tables mapping section, you'll see the following:
Keep everything as-is and click Next.
-
In the Data load settings, keep everything with default values, and click Next.
-
In the Confirm section, ensure everything is correct. This is what it should look like:
-
Click Proceed. Transferring the data will take some time.
After the process is complete, you'll see the imported tables in the Database Navigator:
-
Great job! We've got a database and a set of raw data to transform.
Create a dbt project to transform your data
Let's take a quick overview of what we need to do next.
We've got three tables with raw data:
raw_customers
raw_orders
raw_payments
This dataset simulates the data from a small retail CRM system and isn't suitable for transforming into something a BI solution can work with. To address that we need to:
-
Parse the data and apply appropriate data types to it.
-
Use the resulting structured data to create the constructs suitable for most BI solutions to ingest and process.
To achieve that, we'll use dbt as our one-stop solution for both steps above.
Write a dbt profile to work with your Managed ClickHouse® cluster
Now let's make sure your dbt installation can connect to your Managed ClickHouse® cluster. Create a profile with all the necessary credentials:
-
Create the
~/.dbt/
folder and make theprofiles.yml
file inside it:corner_shop: target: dev outputs: dev: type: clickhouse driver: native host: <FQDN of the host> user: admin password: <your ClickHouse® cluster password> port: 9440 secure: true schema: db_corner_shop
Good, Now we can use this profile to build a new dbt project.
Create a dbt project structure
Let's start with creating a dbt project structure.
For this scenario, create the /corner_shop
project folder. Let's make the simplest possible file tree:
corner_shop # the project folder
├─ models # folder with configurations and scripts for transforming prepared data
│ ├─ staging # folder with configuration and scripts for transforming raw data
│ │ ├─ schema.yml
│ │ ├─ stg_customers.sql
│ │ ├─ stg_orders.sql
│ │ └─ stg_payments.sql
│ ├─ customers.sql
│ ├─ orders.sql
│ └─ schema.yml
└─ dbt_project.yml # configuration for the whole project
Download the complete project structure
You can download the archive with all the files
listed in the above project structure.
Unpack it to your /corner_shop
folder.
Now let's go through the project structure in detail.
Create the dbt_project.yml file
Create the base of the project - the dbt_project.yml
file:
name: 'corner_shop' # project folder name
config-version: 2
version: '0.1'
profile: 'corner_shop' # the profile to connect to your cluster
model-paths: ["models"] # folder containing models
target-path: "target"
clean-targets: # these folders will be created by dbt to store service files
- "target"
- "dbt_modules"
- "logs"
require-dbt-version: [">=1.0.0", "<2.0.0"]
models:
corner_shop: # here we provide the necessary ClickHouse®-specific settings
materialized: table
staging:
materialized: view
Create raw data processing models
-
Create the
/models/staging
folder. -
Make the
/models/staging/schema.yml
configuration file:/models/staging/schema.yml
version: 2 models: - name: stg_customers # the name of the .sql file with the model columns: - name: customer_id tests: - unique - not_null - name: stg_orders columns: - name: order_id tests: - unique - not_null - name: status tests: - accepted_values: values: ['placed', 'shipped', 'completed', 'return_pending', 'returned'] - name: stg_payments columns: - name: payment_id tests: - unique - not_null - name: payment_method tests: - accepted_values: values: ['credit_card', 'coupon', 'bank_transfer', 'gift_card']
-
Create the models for assigning appropriate data types to raw data:
stg_customers:
This model prepares the data from the
raw_customers
table for transformation:-
We take the
id
column and rename it to the CRM-friendlycustomer_id
. -
The other two columns don't need any modifications, so we just transfer them as-is to the new table.
You can see the full listing of this model below:
/models/staging/stg_customers.sql
WITH source AS ( SELECT * FROM db_corner_shop.raw_customers --path to raw data table on your cluster ), renamed AS ( --renaming sequence for raw data columns SELECT id AS customer_id, first_name, last_name FROM source ) SELECT * FROM renamed --writing the result into the stg_customers table
stg_orders:
This model prepares the data from the
raw_orders
table for transformation:-
We take the
id
column and rename it to the CRM-friendlyorder_id
. -
The
user_id
column contains the same data as theid
column from theraw_customers
table, so we rename it tocustomer_id
for consistency further upstream. -
The other two columns don't need any modifications, so we just transfer them as-is to the new table.
You can see the full listing of this model below:
/models/staging/stg_orders.sql
WITH source AS ( SELECT * FROM db_corner_shop.raw_orders --path to raw data table on your cluster ), renamed AS ( --renaming sequence for raw data columns SELECT id AS order_id, user_id AS customer_id, order_date, status FROM source ) SELECT * FROM renamed --writing the result into the stg_orders table
stg_payments:
This model prepares the data from the
raw_payments
table for transformation:-
We take the
id
column and rename it to the CRM-friendlypayment_id
. -
The
account
column contains numbers, so we change its data type toint32
to be able to perform calculations with its contents further upstream. -
The other two columns don't need any modifications, so we just transfer them as-is to the new table.
You can see the full listing of this model below:
/models/staging/stg_payments.sql
WITH source AS ( SELECT * FROM db_corner_shop.raw_payments --path to raw data table on your cluster ), renamed AS ( --renaming sequence for raw data columns SELECT id aAS payment_id, order_id, payment_method, CAST (amount AS INTEGER) AS amount FROM source ) SELECT * FROM renamed --writing the result into the stg_payments table
The infrastructure to prepare your raw data for transformation is complete.
-
Create data transformation models
-
Put together the schema for transforming the prepared data:
/models/schema.yml
version: 2 models: - name: customers columns: - name: customer_id tests: - unique - not_null - name: first_name - name: last_name - name: first_order - name: most_recent_order - name: number_of_orders - name: total_order_amount - name: orders columns: - name: order_id tests: - unique - not_null - name: customer_id tests: - not_null - relationships: to: ref('customers') field: customer_id - name: order_date - name: status tests: - accepted_values: values: ['placed', 'shipped', 'completed', 'return_pending', 'returned'] - name: amount tests: - not_null - name: credit_card_amount tests: - not_null - name: coupon_amount tests: - not_null - name: bank_transfer_amount tests: - not_null - name: gift_card_amount tests: - not_null
-
Create the models to transform prepared data:
customers:
This model transforms the data from the
stg_customers
,stg_orders
andstg_payments
to make the BI-readycustomers
table:-
We take the
customer_id
columns and join them. -
We take the
frist_name
and thelast_name
column and put them directly into thecustomers
table. -
We perform some analytics on the
order_date
column:-
We take the minimum value and write it to the new
first_order
column. -
We take the maximum value and write it to the new
most_recent_order
column. -
We count the number of orders for each
customer_id
and write the result into the newnumber_of_orders
column.
-
-
We sum up the numbers for each customer stored in the
amount
column of thestg_payments
table and write the result into the newtotal_amount
column. Then we rename it intocustomer_lifetime_value
.
You can see the full listing of this model below:
/models/customers.sql
WITH customers AS ( SELECT * FROM {{ REF('stg_customers') }} -- getting the prepared data from our stg_ models ), orders AS ( SELECT * FROM {{ REF('stg_orders') }} ), payments AS ( SELECT * FROM {{ REF('stg_payments') }} ), customer_orders AS ( --transforming the customer_orders segment into 4 columns: SELECT customer_id, MIN(order_date) AS first_order, MAX(order_date) AS most_recent_order, COUNT(order_id) AS number_of_orders FROM orders GROUP BY customer_id --grouping the above columns ), customer_payments AS ( -- transforming the customer_payments segment into the total_amount coulmn: SELECT orders.customer_id, SUM(amount) AS total_amount FROM payments LEFT JOIN orders ON --adding the order_id column from the payments table as is: payments.order_id = orders.order_id GROUP BY orders.customer_id --grouping the above columns ), final AS ( --describing the structure of the final table: SELECT customers.customer_id, customers.first_name, customers.last_name, customer_orders.first_order, customer_orders.most_recent_order, customer_orders.number_of_orders, customer_payments.total_amount AS customer_lifetime_value FROM customers LEFT JOIN customer_orders --joining the columns with the same content ON customers.customer_id = customer_orders.customer_id LEFT JOIN customer_payments --joining the columns with the same content ON customers.customer_id = customer_payments.customer_id ) SELECT * FROM final --writing all the data into the table
orders:
This model transforms the data from the
stg_orders
andstg_payments
to make the BI-readyorders
table:-
We take the
order_id
columns and join them. -
We take the
oder_date
and thestatus
columns and put them directly into theorders
table. -
We perform analytics on the
order_date
column:-
We declare the payment method types that occur in the
stg_payments
table:credit_card
,coupon
,bank_transfer
,gift_card
. -
We write a CASE
When we encounter one of the payment methods associated with a current
order_id
, we take the value from theamount
column of the same row. If there's no value in the source table, consider it0
. -
Then we take the resulting value from the step above, sum all the available values with the same
order_id
and payment method, and write it into the new column named asnot_var{{ payment_method }}_amount
.
-
-
We sum up the numbers for each customer stored in the
amount
column of thestg_payments
table and write the result into the newtotal_amount
column. Then we rename it back intoamount
.
You can see the full listing of this model below:
/models/orders.sql
{% SET payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %} -- set up payment methods column names for further calculations WITH orders AS ( SELECT * FROM {{ REF('stg_orders') }} -- getting the prepared data from our stg_ models ), payments AS ( SELECT * FROM {{ REF('stg_payments') }} ), order_payments AS ( --calculating payment amount and attributing a writing results into appropriate columns SELECT order_id, {% for payment_method IN payment_methods %} SUM(CASE WHEN payment_method = 'not_var{{ payment_method }}' THEN amount ELSE 0 END) AS not_var{{ payment_method }}_amount, {% endfor -%} SUM(amount) AS total_amount --calculating the value of the total_amount column FROM payments GROUP BY order_id --grouping the above columns ), final AS ( --describing the structure of the final table: SELECT orders.order_id, orders.customer_id, orders.order_date, orders.status, {% for payment_method IN payment_methods %} --adding payment method columns to the table order_payments.not_var{{ payment_method }}_amount, {% endfor -%} order_payments.total_amount AS amount --renaming the total_amount column FROM orders LEFT JOIN order_payments ON orders.order_id = order_payments.order_id ) SELECT * FROM final --writing all the data into the table
All the files for the project are ready. Now let's make sure everything is ready to go.
-
Test and run your dbt project
-
Go back to your terminal and navigate to the
/corner_shop
folder with your dbt project. -
First, test the project integrity and the connection between dbt and your Managed ClickHouse® cluster:
dbt debug
If everything works, you should see the following output:
Configuration: profiles.yml file [OK found and valid] dbt_project.yml file [OK found and valid] Required dependencies: - git [OK found] Connection: driver: native host: <FQDN of your ClickHouse® cluster> port: 9440 user: admin schema: db_corner_shop secure: True verify: True connect_timeout: 10 send_receive_timeout: 300 sync_request_timeout: 5 compress_block_size: 1048576 compression: custom_settings: None Connection test: [OK connection ok] All checks passed!
-
Run your dbt project:
dbt run
Give dbt some time to perform data transformation. As a result, you should see the following output:
Running with dbt=1.1.1 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 186 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics Concurrency: 1 threads (target='dev') 1 of 5 START table model db_corner_shop.stg_customers .......................... [RUN] 1 of 5 OK created table model db_corner_shop.stg_customers ..................... [OK in 1.85s] 2 of 5 START table model db_corner_shop.stg_orders ............................. [RUN] 2 of 5 OK created table model db_corner_shop.stg_orders ........................ [OK in 1.97s] 3 of 5 START table model db_corner_shop.stg_payments ........................... [RUN] 3 of 5 OK created table model db_corner_shop.stg_payments ...................... [OK in 1.68s] 4 of 5 START table model db_corner_shop.customers .............................. [RUN] 4 of 5 OK created table model db_corner_shop.customers ......................... [OK in 1.56s] 5 of 5 START table model db_corner_shop.orders ................................. [RUN] 5 of 5 OK created table model db_corner_shop.orders ............................ [OK in 2.24s] Finished running 5 table models in 10.90s. Completed successfully Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
Great job!. Data transformation is complete.
Query the results of data transformation
Let's check if all the tables are present in the db_corner_shop
database:
Let's use your terminal to send the query to your Managed ClickHouse® cluster:
docker run --network host --rm -it clickhouse/<Native interface connection string> \
--query="SHOW TABLES FROM db_corner_shop"
You should see the following output:
customers
orders
raw_customers
raw_orders
raw_payments
stg_customers
stg_orders
stg_payments
The tables named customers
and orders
are the results of your data transformation. Feel free to see the results of your work with the following query:
docker run --network host --rm -it clickhouse/<Native interface connection string> \
--query="SELECT * FROM db_corner_shop.customers LIMIT 20"
Your output should look like this:
> --query="SELECT * FROM db_corner_shop.customers LIMIT 20"
1 Michael P. 2018-01-01 2018-02-10 2 3300
2 Shawn M. 2018-01-11 2018-01-11 1 2300
3 Kathleen P. 2018-01-02 2018-03-11 3 6500
4 Jimmy C. 0 0
5 Katherine R. 0 0
6 Sarah R. 2018-02-19 2018-02-19 1 800
7 Martin M. 2018-01-14 2018-01-14 1 2600
8 Frank R. 2018-01-29 2018-03-12 2 4500
9 Jennifer F. 2018-03-17 2018-03-17 1 3000
10 Henry W. 0 0
11 Fred S. 2018-03-23 2018-03-23 1 300
12 Amy D. 2018-03-03 2018-03-03 1 400
13 Kathleen M. 2018-03-07 2018-03-07 1 2600
14 Steve F. 0 0
15 Teresa H. 0 0
16 Amanda H. 2018-02-02 2018-02-02 1 1200
17 Kimberly R. 0 0
18 Johnny K. 2018-02-27 2018-02-27 1 2900
19 Virginia F. 2018-03-16 2018-03-16 1 300
20 Anna A. 2018-01-23 2018-01-23 1 1500
Let's look at your Database Navigator. You should see the customers
and orders
table:
Double-click the customers
table. This is what its contents should look like:
This concludes our tutorial on how to use dbt-based data transformations with DoubleCloud Managed ClickHouse®. Take a look at other instructions related to ClickHouse® and keep exploring!