Data transformation is the process of converting a raw data source into a ready-to-use format and structure. It's crucial for any entity or organization seeking to get the most out of its data and produce meaningful and usable business insights.
It allows developers and data scientists to write transformations as queries and organize them in a straightforward way. dbt reads your project's structure, compiles it into SQL and runs it against your database.
Below, you’ll see how to use dbt with a variety of tools to create a working data transformation flow with a DoubleCloud Managed ClickHouse® cluster:
First, we get raw table data from your DoubleCloud Managed ClickHouse® cluster and send it through the raw data processing models to get it ready for transformation.
Then, we send the resulting prepared data through another set of transformation models to get the data in a format that we need.
Let's get right to it.
Configure your cluster and local environment
To prepare for this scenario, you'll need to perform the following steps:
Create a Managed ClickHouse® cluster
Install dbt on your system
Install the dbt-clickhouse plugin
Create the database and upload the raw data
Write a dbt profile to work with your Managed ClickHouse® cluster
First, make sure you have pip installed on your system:
sudo apt-get install python3-pip
Now let's make sure your setuptools and wheel packages are the latest version:
pip install --upgrade setuptools wheel
The dbt toolkit has a modular structure, so it isn't necessary to install the complete list of available packages. For now you just need to install the dbt-core package:
pip install dbt-core
Good job!
The core dbt service is ready. Now we need to install the adapter that'll make your dbt installation compatible with ClickHouse®:
pip install dbt-clickhouse
Check if the core and the plugin work correctly:
dbt --version
You should get the following output:
Core:
- installed: 1.1.1
- latest: 1.1.1 - Up to date!
Plugins:
- clickhouse: 1.1.6 - Up to date!
Nice! It's time to prepare your database and upload the raw data.
Create the database and upload the raw data
First, install and then start the Docker service if needed:
In the container, create the db_corner_shop database by running a command that contains a connection string. Use the value of the Native Interface field from the Overview tab on your cluster information page as part of the Docker connection string:
The command has the following structure:
docker run --network host --rm -it clickhouse/<Native interface connection string> \
--query="CREATE DATABASE IF NOT EXISTS db_corner_shop"
The complete Docker command structure
docker run --network host --rm -it clickhouse/clickhouse-client
--host <FQDN of the host>
--secure
--user <Cluster user name>
--password <Cluster user password>
--port 9440
--query="CREATE DATABASE IF NOT EXISTS db_corner_shop"
Connect to the database to upload the data. For this scenario, we'll use the GUI-based DBeaver Community database management application:
Open Database → New Database Connection.
Select ClickHouse® from the list of database types.
Click Next.
On the Main tab:
Host - paste the host's domain name (FQDN) from the Hosts tab on your cluster information page.
Port - 8443 by default.
Database/Schema - enter db_corner_shop.
Copy Username and Password for your cluster from the Overview tab on your cluster information page.
On the Driver Properties tab:
Find ssl on the list of properties and change its value to true.
Specify the socket_timeout. The recommended value is 300000 or higher. Enter a larger value if you are going to perform longer queries.
Click Test Connection.... If the connection is successful, you'll see a window with the connection status, information on the DBMS, and the driver.
Click Finish.
Download the ZIP archive with the sample CSV data tables and extract the archive into a suitable directory.
Use DBeaver to upload the CSV tables to your database:
In the Database Navigator, right-click db_corner_shop → Tables and select Import Data:
In the Import source section, keep everything as-is and click Next.
In the Select input files dialog, choose all three tables (raw_customers, raw_orders, raw_payments) and click Open. You should see the following:
Click Next.
In the Tables mapping section, you'll see the following:
Keep everything as-is and click Next.
In the Data load settings, keep everything with default values, and click Next.
In the Confirm section, ensure everything is correct. This is what it should look like:
Click Proceed.
Transferring the data will take some time.
After the process is complete, you'll see the imported tables in the Database Navigator:
Great job! We've got a database and a set of raw data to transform.
Create a dbt project to transform your data
Let's take a quick overview of what we need to do next.
We've got three tables with raw data:
raw_customers
raw_orders
raw_payments
This dataset simulates the data from a small retail CRM system and isn't suitable for transforming into something a BI solution can work with. To address that we need to:
Parse the data and apply appropriate data types to it.
Use the resulting structured data to create the constructs suitable for most BI solutions to ingest and process.
To achieve that, we'll use dbt as our one-stop solution for both steps above.
Write a dbt profile to work with your Managed ClickHouse® cluster
Now let's make sure your dbt installation can connect to your Managed ClickHouse® cluster. Create a profile with all the necessary credentials:
Create the ~/.dbt/ folder and make the profiles.yml file inside it:
Good, Now we can use this profile to build a new dbt project.
Create a dbt project structure
Let's start with creating a dbt project structure.
For this scenario, create the /corner_shop project folder. Let's make the simplest possible file tree:
corner_shop # the project folder
├─ models # folder with configurations and scripts for transforming prepared data
│ ├─ staging # folder with configuration and scripts for transforming raw data
│ │ ├─ schema.yml
│ │ ├─ stg_customers.sql
│ │ ├─ stg_orders.sql
│ │ └─ stg_payments.sql
│ ├─ customers.sql
│ ├─ orders.sql
│ └─ schema.yml
└─ dbt_project.yml # configuration for the whole project
Download the complete project structure
You can download the archive with all the files
listed in the above project structure.
Unpack it to your /corner_shop folder.
Now let's go through the project structure in detail.
Create the dbt_project.yml file
Create the base of the project - the dbt_project.yml file:
name:'corner_shop'# project folder nameconfig-version:2version:'0.1'profile:'corner_shop'# the profile to connect to your clustermodel-paths: ["models"] # folder containing modelstarget-path:"target"clean-targets:# these folders will be created by dbt to store service files-"target"-"dbt_modules"-"logs"require-dbt-version: [">=1.0.0", "<2.0.0"]
models:corner_shop:# here we provide the necessary ClickHouse®-specific settingsmaterialized:tablestaging:materialized:view
Create raw data processing models
Create the /models/staging folder.
Make the /models/staging/schema.yml configuration file:
/models/staging/schema.yml
version:2models:-name:stg_customers# the name of the .sql file with the modelcolumns:-name:customer_idtests:-unique-not_null-name:stg_orderscolumns:-name:order_idtests:-unique-not_null-name:statustests:-accepted_values:values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
-name:stg_paymentscolumns:-name:payment_idtests:-unique-not_null-name:payment_methodtests:-accepted_values:values: ['credit_card', 'coupon', 'bank_transfer', 'gift_card']
Create the models for assigning appropriate data types to raw data:
stg_customers:
This model prepares the data from the raw_customers table for transformation:
We take the id column and rename it to the CRM-friendly customer_id.
The other two columns don't need any modifications, so we just transfer them as-is to the new table.
You can see the full listing of this model below:
/models/staging/stg_customers.sql
WITH source AS (
SELECT*FROM db_corner_shop.raw_customers --path to raw data table on your cluster
),
renamed AS ( --renaming sequence for raw data columnsSELECT
id AS customer_id,
first_name,
last_name
FROM source
)
SELECT*FROM renamed --writing the result into the stg_customers table
stg_orders:
This model prepares the data from the raw_orders table for transformation:
We take the id column and rename it to the CRM-friendly order_id.
The user_id column contains the same data as the id column from the raw_customers table, so we rename it to customer_id for consistency further upstream.
The other two columns don't need any modifications, so we just transfer them as-is to the new table.
You can see the full listing of this model below:
/models/staging/stg_orders.sql
WITH source AS (
SELECT*FROM db_corner_shop.raw_orders --path to raw data table on your cluster
),
renamed AS ( --renaming sequence for raw data columnsSELECT
id AS order_id,
user_id AS customer_id,
order_date,
status
FROM source
)
SELECT*FROM renamed --writing the result into the stg_orders table
stg_payments:
This model prepares the data from the raw_payments table for transformation:
We take the id column and rename it to the CRM-friendly payment_id.
The account column contains numbers, so we change its data type to int32 to be able to perform calculations with its contents further upstream.
The other two columns don't need any modifications, so we just transfer them as-is to the new table.
You can see the full listing of this model below:
/models/staging/stg_payments.sql
WITH source AS (
SELECT*FROM db_corner_shop.raw_payments --path to raw data table on your cluster
),
renamed AS ( --renaming sequence for raw data columnsSELECT
id aAS payment_id,
order_id,
payment_method,
CAST (amount ASINTEGER) AS amount
FROM source
)
SELECT*FROM renamed --writing the result into the stg_payments table
The infrastructure to prepare your raw data for transformation is complete.
Create data transformation models
Put together the schema for transforming the prepared data:
This model transforms the data from the stg_customers, stg_orders and stg_payments to make the BI-ready customers table:
We take the customer_id columns and join them.
We take the frist_name and the last_name column and put them directly into the customers table.
We perform some analytics on the order_date column:
We take the minimum value and write it to the new first_order column.
We take the maximum value and write it to the new most_recent_order column.
We count the number of orders for each customer_id and write the result into the new number_of_orders column.
We sum up the numbers for each customer stored in the amount column of the stg_payments table and write the result into the new total_amount column. Then we rename it into customer_lifetime_value.
You can see the full listing of this model below:
/models/customers.sql
WITH customers AS (
SELECT*FROM {{ REF('stg_customers') }} -- getting the prepared data from our stg_ models
),
orders AS (
SELECT*FROM {{ REF('stg_orders') }}
),
payments AS (
SELECT*FROM {{ REF('stg_payments') }}
),
customer_orders AS ( --transforming the customer_orders segment into 4 columns: SELECT
customer_id,
MIN(order_date) AS first_order,
MAX(order_date) AS most_recent_order,
COUNT(order_id) AS number_of_orders
FROM orders
GROUPBY customer_id --grouping the above columns
),
customer_payments AS ( -- transforming the customer_payments segment into the total_amount coulmn:SELECT
orders.customer_id,
SUM(amount) AS total_amount
FROM payments
LEFTJOIN orders ON--adding the order_id column from the payments table as is:
payments.order_id = orders.order_id
GROUPBY orders.customer_id --grouping the above columns
),
finalAS ( --describing the structure of the final table:SELECT
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order,
customer_orders.most_recent_order,
customer_orders.number_of_orders,
customer_payments.total_amount AS customer_lifetime_value
FROM customers
LEFTJOIN customer_orders --joining the columns with the same contentON customers.customer_id = customer_orders.customer_id
LEFTJOIN customer_payments --joining the columns with the same contentON customers.customer_id = customer_payments.customer_id
)
SELECT*FROMfinal--writing all the data into the table
orders:
This model transforms the data from the stg_orders and stg_payments to make the BI-ready orders table:
We take the order_id columns and join them.
We take the oder_date and the status columns and put them directly into the orders table.
We perform analytics on the order_date column:
We declare the payment method types that occur in the stg_payments table: credit_card, coupon, bank_transfer, gift_card.
When we encounter one of the payment methods associated with a current order_id, we take the value from the amount column of the same row. If there's no value in the source table, consider it 0.
Then we take the resulting value from the step above, sum all the available values with the same order_id and payment method, and write it into the new column named as not_var{{ payment_method }}_amount.
We sum up the numbers for each customer stored in the amount column of the stg_payments table and write the result into the new total_amount column. Then we rename it back into amount.
You can see the full listing of this model below:
/models/orders.sql
{%SET payment_methods = ['credit_card', 'coupon', 'bank_transfer', 'gift_card'] %} -- set up payment methods column names for further calculationsWITH orders AS (
SELECT*FROM {{ REF('stg_orders') }} -- getting the prepared data from our stg_ models
),
payments AS (
SELECT*FROM {{ REF('stg_payments') }}
),
order_payments AS ( --calculating payment amount and attributing a writing results into appropriate columnsSELECT
order_id,
{%for payment_method IN payment_methods %}
SUM(CASEWHEN payment_method ='not_var{{ payment_method }}'THEN amount ELSE0END) AS not_var{{ payment_method }}_amount,
{% endfor -%}
SUM(amount) AS total_amount --calculating the value of the total_amount columnFROM payments
GROUPBY order_id --grouping the above columns
),
finalAS ( --describing the structure of the final table:SELECT
orders.order_id,
orders.customer_id,
orders.order_date,
orders.status,
{%for payment_method IN payment_methods %} --adding payment method columns to the table
order_payments.not_var{{ payment_method }}_amount,
{% endfor -%}
order_payments.total_amount AS amount --renaming the total_amount columnFROM orders
LEFTJOIN order_payments
ON orders.order_id = order_payments.order_id
)
SELECT*FROMfinal--writing all the data into the table
All the files for the project are ready. Now let's make sure everything is ready to go.
Test and run your dbt project
Go back to your terminal and navigate to the /corner_shop folder with your dbt project.
First, test the project integrity and the connection between dbt and your Managed ClickHouse® cluster:
dbt debug
If everything works, you should see the following output:
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Required dependencies:
- git [OK found]
Connection:
driver: native
host: <FQDN of your ClickHouse® cluster>
port: 9440
user: admin
schema: db_corner_shop
secure: True
verify: True
connect_timeout: 10
send_receive_timeout: 300
sync_request_timeout: 5
compress_block_size: 1048576
compression:
custom_settings: None
Connection test: [OK connection ok]
All checks passed!
Run your dbt project:
dbt run
Give dbt some time to perform data transformation. As a result, you should see the following output:
Running with dbt=1.1.1
Found 5 models, 20 tests, 0 snapshots, 0 analyses, 186 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
Concurrency: 1 threads (target='dev')
1 of 5 START table model db_corner_shop.stg_customers .......................... [RUN]
1 of 5 OK created table model db_corner_shop.stg_customers ..................... [OK in 1.85s]
2 of 5 START table model db_corner_shop.stg_orders ............................. [RUN]
2 of 5 OK created table model db_corner_shop.stg_orders ........................ [OK in 1.97s]
3 of 5 START table model db_corner_shop.stg_payments ........................... [RUN]
3 of 5 OK created table model db_corner_shop.stg_payments ...................... [OK in 1.68s]
4 of 5 START table model db_corner_shop.customers .............................. [RUN]
4 of 5 OK created table model db_corner_shop.customers ......................... [OK in 1.56s]
5 of 5 START table model db_corner_shop.orders ................................. [RUN]
5 of 5 OK created table model db_corner_shop.orders ............................ [OK in 2.24s]
Finished running 5 table models in 10.90s.
Completed successfully
Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
Great job!. Data transformation is complete.
Query the results of data transformation
Let's check if all the tables are present in the db_corner_shop database:
CLI
GUI
Let's use your terminal to send the query to your Managed ClickHouse® cluster:
docker run --network host --rm -it clickhouse/<Native interface connection string> \
--query="SHOW TABLES FROM db_corner_shop"
The tables named customers and orders are the results of your data transformation. Feel free to see the results of your work with the following query:
docker run --network host --rm -it clickhouse/<Native interface connection string> \
--query="SELECT * FROM db_corner_shop.customers LIMIT 20"
Your output should look like this:
> --query="SELECT * FROM db_corner_shop.customers LIMIT 20"
1 Michael P. 2018-01-01 2018-02-10 2 3300
2 Shawn M. 2018-01-11 2018-01-11 1 2300
3 Kathleen P. 2018-01-02 2018-03-11 3 6500
4 Jimmy C. 0 0
5 Katherine R. 0 0
6 Sarah R. 2018-02-19 2018-02-19 1 800
7 Martin M. 2018-01-14 2018-01-14 1 2600
8 Frank R. 2018-01-29 2018-03-12 2 4500
9 Jennifer F. 2018-03-17 2018-03-17 1 3000
10 Henry W. 0 0
11 Fred S. 2018-03-23 2018-03-23 1 300
12 Amy D. 2018-03-03 2018-03-03 1 400
13 Kathleen M. 2018-03-07 2018-03-07 1 2600
14 Steve F. 0 0
15 Teresa H. 0 0
16 Amanda H. 2018-02-02 2018-02-02 1 1200
17 Kimberly R. 0 0
18 Johnny K. 2018-02-27 2018-02-27 1 2900
19 Virginia F. 2018-03-16 2018-03-16 1 300
20 Anna A. 2018-01-23 2018-01-23 1 1500
Let's look at your Database Navigator. You should see the customers and orders table:
Double-click the customers table. This is what its contents should look like:
This concludes our tutorial on how to use dbt-based data transformations with DoubleCloud Managed ClickHouse®. Take a look at other instructions related to ClickHouse® and keep exploring!