Send data from Apache Kafka® to Managed Service for ClickHouse® in DoubleCloud
Prepare your workspace
Before you start with this tutorial, you need to install the software to use later:
-
Install the kafkacat (kcat) client to manage Apache Kafka®:
DockerNative kcat (DEB)Pull the kcat
1.7.1
version, but you can use the latest one:docker pull edenhill/kcat:1.7.1
Install kcat from your package repository:
sudo apt install kafkacat
-
Install jq
DockerNative jq (DEB)docker pull stedolan/jq
Install jq from your Linux distribution's repository:
sudo apt install jq -y
-
Take a look at the JSON file
Configure Managed Service for Apache Kafka®
Create an Apache Kafka® cluster
-
Go to the Clusters
-
Click Create cluster in the upper-right corner of the page.
-
Select Apache Kafka®.
-
Choose a provider and a region. Try to create your Apache Kafka® and ClickHouse® clusters in the same region.
-
Under Resources:
-
Select the
s2-c2-m4
preset for CPU, RAM capacity, and storage space to create a cluster with minimal configuration. -
Select the number of zones and brokers. The number of zones is multiplied by the number of brokers and determines the number of hosts. This tutorial creates a cluster with two zones and two brokers.
-
-
Under Basic settings:
-
Enter the cluster Name, for example,
kafka-tutorial-cluster
. -
Select the version of Apache Kafka® for your cluster from the Version drop-down list. For most clusters, we recommend using the latest version.
-
-
Click Submit.
-
Your cluster will appear with the Creating
status on the Clusters page in the console. Setting everything up may take some time. When the cluster is ready, it changes its state to Alive
.
Create a topic
When you've created a cluster, create a topic in it:
-
On the cluster's page, go to the Topics tab.
-
Click Create.
-
Under Topic Settings, specify the connection properties:
-
Cleanup policy -
Delete
. This policy deletes log segments when their retention time or log size reaches the limit. -
Compression Type -
Uncompressed
. We don't need compression for our use case. Let's disable it. -
Retention Bytes -
1048576
(1 mb). -
Retention Ms -
600000
(10 minutes).
-
-
Specify the Basic Settings:
-
Name
A topic's name. Let's call it
hits_topic
. -
Partitions
A number of topic's partitions. Keep
1
to create the simplest topic. -
Replication factor
Specifies the number of copies of a topic in a cluster. This parameter's value should not exceed the number of brokers in the cluster. Let's keep the default setting -
1
.
-
-
Click Submit.
Create a Managed ClickHouse® cluster and configure settings
-
Go to the Clusters
-
Click Create cluster in the upper-right corner of the page.
-
Select ClickHouse®.
-
Choose a provider and a region. Try to create your Apache Kafka® and ClickHouse® clusters in the same region.
-
Under Resources:
-
Select the
s2-c2-m4
preset for CPU, RAM capacity, and storage space to create a cluster with minimal configuration. -
Choose a number of replicas. Let's keep it as is with a single replica.
-
Select a number of shards. Keep a single shard.
-
-
Under Basic settings:
-
Enter the cluster Name, in this tutorial -
clickhouse-tutorial-cluster
-
From the Version drop-down list, select the ClickHouse® version the Managed ClickHouse® cluster will use. For most clusters, we recommend using the latest version.
-
-
Under Advanced → Cluster settings → clickhouseConfig → kafka, specify the details of the connection to your previously created Apache Kafka® cluster:
-
securityProtocol:
SASL_SSL
-
saslMechanism:
SCRAM-SHA-512
-
saslUsername: Your Apache Kafka® cluster username
-
saslPassword: Your Apache Kafka® cluster password.
Warning
If you apply these settings to an already existing cluster, restart the cluster to apply the changes.
-
-
Click Submit.
Your cluster will appear with the
Creating
status on the Clusters page in the console. Setting everything up may take some time. -
Your cluster will appear with the Creating
status on the Clusters page in the console. Setting everything up may take some time. When the cluster is ready, it changes its state to Alive
.
Tip
DoubleCloud creates the admin
superuser and its password automatically.
You can the username and password under Credentials in the Overview tab on the cluster page.
To learn how to create users for other roles, refer to Manage ClickHouse® users
Set up integration to store and process data from Apache Kafka® topic
Access the ClickHouse® cluster
Reading data from Apache Kafka® topics requires creating two different tables and a materialized view
The table structure is the following:
CREATE TABLE [db.]table_name ON CLUSTER default
(
name1 [type1],
name2 [type2],
...
) ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'host:port',
kafka_topic_list = 'topic1,topic2,...',
kafka_group_name = 'group_name',
kafka_format = 'data_format'[,]
Required parameters:
|
A comma-separated list of brokers. |
|
A comma-separated list of Kafka topics. |
|
A group of Kafka consumers. Reading margins for each group are tracked separately. If you don't want messages to be duplicated in the ClickHouse® cluster, use the same group name everywhere. |
|
A message format |
Refer to the ClickHouse® documentation
Create tables to receive and process messages from the topic
-
Go to the Clusters
-
Select the cluster you created earlier.
-
Click WebSQL at the top right.
-
In WebSQL, click on any database in the connection manager on the left to open the query editor.
-
Create a table with the Kafka engine
Communication between different DoubleCloud services is usually done with private FQDN addresses and a special port. You can find them under the Connection strings → Private tab on the overview page.
CREATE TABLE topic_messages ON CLUSTER default ( Hit_ID Int32, Date Date, Time_Spent Float32, Cookie_Enabled Bool, Region_ID Int32, Gender String, Browser String, Traffic_Source String, Technology String ) ENGINE = Kafka('<Public FQDN of your Kafka cluster>:9091', 'hits_topic', 'group1', 'JSONEachRow');
-
Create a ReplicatedMergeTree
CREATE TABLE hits_storage ON CLUSTER default ( Hit_ID Int32, Date Date, Time_Spent Float32, Cookie_Enabled Bool, Region_ID Int32, Gender String, Browser String, Traffic_Source String, Technology String ) ENGINE = ReplicatedMergeTree() ORDER BY (Hit_ID, Date)
-
Create a materialized view to copy messages data from the Kafka table to the MergeTree table. The materialized view collects data in the background. This allows you to continually receive messages from Kafka and convert them to the required format using the
SELECT
statement.CREATE MATERIALIZED VIEW materialized_messages TO hits_storage AS SELECT * FROM topic_messages;
Run the data pipeline
The pipeline consists of the following entities:
-
The Apache Kafka® producer
-
A ClickHouse® table with the Kafka engine as a consumer
-
A materialized view
-
A table to which the materialized view copies data.
To send data through this pipeline, access the JSON file from our Amazon S3 bucket and send the data from this file to the Apache Kafka® topic.
Create a consumer and a producer
-
(Optional) Run an Apache Kafka® consumer to see if the data is successfully received by the Apache Kafka® cluster. The data will appear in your ClickHouse® anyway but it will be easier for you to track the process and debug it if needed.
DockerNative kafkacatdocker run --network host --rm -it \ edenhill/kcat:1.7.1 -C \ -b <Cluster FQDN from the Overview page>:9091 \ -t hits_topic \ -X security.protocol=SASL_SSL \ -X sasl.mechanisms=SCRAM-SHA-512 \ -X sasl.username="<username>" \ -X sasl.password="<your cluster password>"
kafkacat -C \ -b <Cluster FQDN from the Overview page>:9091 \ -t hits_topic \ -X security.protocol=SASL_SSL \ -X sasl.mechanisms=SCRAM-SHA-512 \ -X sasl.username="<username>" \ -X sasl.password="<your cluster password>"
Now, when you run a separate terminal for the producer, the messages received will be displayed in the consumer terminal too.
-
Execute the following command in a separate terminal instance to create a producer and push the data. This command also processes the JSON file the way it's required for:
DockerDEB-basedcurl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.json | docker run -i stedolan/jq | docker run --name kcat --rm -i edenhill/kcat:1.7.1 -P \ -b <Cluster FQDN from the Overview page>:9091 \ -t hits_topic \ -X security.protocol=SASL_SSL \ -X sasl.mechanisms=SCRAM-SHA-512 \ -X sasl.username="<username>" \ -X sasl.password="<password>"
curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.json | jq -rc | kafkacat -P \ -b <Cluster FQDN from the Overview page>:9091 \ -t hits_topic \ -X security.protocol=SASL_SSL \ -X sasl.mechanisms=SCRAM-SHA-512 \ -X sasl.username="<username>" \ -X sasl.password="<password>"
If you've completed all the steps successfully and have an opened consumer, the terminal will show the uploaded data:
},
{
"Hit_ID": 40668,
"Date": "2017-09-09",
"Time_Spent": "730.875",
"Cookie_Enabled": 0,
"Redion_ID": 11,
"Gender": "Female",
"Browser": "Chrome",
"Traffic_Source": "Social network",
"Technology": "PC (Windows)"
}
]
% Reached end of topic first-topic [0] at offset 1102
Query the data in your ReplicatedMergeTree table
-
Go to the Clusters
-
Click WebSQL at the top right.
-
Execute the following
SELECT
query:SELECT * FROM hits_storage LIMIT 5
┌─Hit_ID─┬───────Date─┬─Time_Spent─┬─Cookie_Enabled─┬─Region_ID─┬─Gender──┬─Browser────────┬─Traffic_Source──┬─Technology───────────┐ │ 14230 │ 2017-01-30 │ 265.70175 │ true │ 2 │ Female │ Firefox │ Direct │ PC (Windows) │ │ 14877 │ 2017-04-12 │ 317.82758 │ false │ 229 │ Female │ Firefox │ Direct │ PC (Windows) │ │ 14892 │ 2017-07-29 │ 191.0125 │ true │ 55 │ Female │ Safari │ Recommendations │ Smartphone (Android) │ │ 15071 │ 2017-06-11 │ 148.58064 │ true │ 159 │ Female │ Chrome │ Ad traffic │ PC (Windows) │ │ 15110 │ 2016-09-02 │ 289.48334 │ true │ 169 │ Female │ Chrome │ Search engine │ Smartphone (IOS) │ └────────┴────────────┴────────────┴────────────────┴───────────┴─────────┴────────────────┴─────────────────┴──────────────────────┘
This is the end of the tutorial for the Apache Kafka® and ClickHouse® integration in DoubleCloud. Take a look at the articles listed below to see more about both services and other combinations.