Click Create cluster in the upper-right corner of the page.
Select Apache Kafka®.
Choose a provider and a region. Try to create your Apache Kafka® and ClickHouse® clusters in the same region.
Under Resources:
Select the s2-c2-m4preset for CPU, RAM capacity, and storage space to create a cluster with minimal configuration.
Select the number of zones and brokers. The number of zones is multiplied by the number of brokers and determines the number of hosts. This tutorial creates a cluster with two zones and two brokers.
Under Basic settings:
Enter the cluster Name, for example, kafka-tutorial-cluster.
Select the version of Apache Kafka® for your cluster from the Version drop-down list. For most clusters, we recommend using the latest version.
Click Submit.
Your cluster will appear with the Creating status on the Clusters page in the console. Setting everything up may take some time. When the cluster is ready, it changes its state to Alive.
Create a topic
When you've created a cluster, create a topic in it:
On the cluster's page, go to the Topics tab.
Click Create.
Under Topic Settings, specify the connection properties:
Cleanup policy - Delete. This policy deletes log segments when their retention time or log size reaches the limit.
Compression Type - Uncompressed. We don't need compression for our use case. Let's disable it.
Retention Bytes - 1048576 (1 mb).
Retention Ms - 600000 (10 minutes).
Specify the Basic Settings:
Name
A topic's name. Let's call it hits_topic.
Partitions
A number of topic's partitions. Keep 1 to create the simplest topic.
Replication factor
Specifies the number of copies of a topic in a cluster. This parameter's value should not exceed the number of brokers in the cluster. Let's keep the default setting - 1.
Click Submit.
Create a Managed ClickHouse® cluster and configure settings
Click Create cluster in the upper-right corner of the page.
Select ClickHouse®.
Choose a provider and a region. Try to create your Apache Kafka® and ClickHouse® clusters in the same region.
Under Resources:
Select the s2-c2-m4 preset for CPU, RAM capacity, and storage space to create a cluster with minimal configuration.
Choose a number of replicas. Let's keep it as is with a single replica.
Select a number of shards. Keep a single shard.
Under Basic settings:
Enter the cluster Name, in this tutorial - clickhouse-tutorial-cluster
From the Version drop-down list, select the ClickHouse® version the Managed ClickHouse® cluster will use. For most clusters, we recommend using the latest version.
Under Advanced → Cluster settings → clickhouseConfig → kafka, specify the details of the connection to your previously created Apache Kafka® cluster:
securityProtocol: SASL_SSL
saslMechanism: SCRAM-SHA-512
saslUsername: Your Apache Kafka® cluster username
saslPassword: Your Apache Kafka® cluster password.
Warning
If you apply these settings to an already existing cluster, restart the cluster to apply the changes.
Click Submit.
Your cluster will appear with the Creating status on the Clusters page in the console. Setting everything up may take some time.
Your cluster will appear with the Creating status on the Clusters page in the console. Setting everything up may take some time. When the cluster is ready, it changes its state to Alive.
Tip
DoubleCloud creates the admin superuser and its password automatically.
You can the username and password under Credentials in the Overview tab on the cluster page.
Set up integration to store and process data from Apache Kafka® topic
Access the ClickHouse® cluster
Reading data from Apache Kafka® topics requires creating two different tables and a materialized view . The reason is that users can't query the Kafka table directly - it operates only as a consumer. To convert the data to the required format, create a materialized view that reads from the Kafka table. Finally, create a table with an engine that allows storing and processing data.
A group of Kafka consumers. Reading margins for each group are tracked separately. If you don't want messages to be duplicated in the ClickHouse® cluster, use the same group name everywhere.
In WebSQL, click on any database in the connection manager on the left to open the query editor.
Create a table with the Kafka engine to receive messages from topics using the following statement.
Communication between different DoubleCloud services is usually done with private FQDN addresses and a special port. You can find them under the Connection strings → Private tab on the overview page.
Create a ReplicatedMergeTree table to permanently store and process data. After you create this table, it will be empty - the data will come from the materialized view.
Create a materialized view to copy messages data from the Kafka table to the MergeTree table. The materialized view collects data in the background. This allows you to continually receive messages from Kafka and convert them to the required format using the SELECT statement.
CREATE MATERIALIZED VIEW materialized_messages TO hits_storage
ASSELECT*FROM topic_messages;
Run the data pipeline
The pipeline consists of the following entities:
The Apache Kafka® producer
A ClickHouse® table with the Kafka engine as a consumer
A materialized view
A table to which the materialized view copies data.
To send data through this pipeline, access the JSON file from our Amazon S3 bucket and send the data from this file to the Apache Kafka® topic.
Create a consumer and a producer
(Optional) Run an Apache Kafka® consumer to see if the data is successfully received by the Apache Kafka® cluster. The data will appear in your ClickHouse® anyway but it will be easier for you to track the process and debug it if needed.
Now, when you run a separate terminal for the producer, the messages received will be displayed in the consumer terminal too.
Execute the following command in a separate terminal instance to create a producer and push the data. This command also processes the JSON file the way it's required for:
Docker
DEB-based
curl https://doublecloud-docs.s3.eu-central-1.amazonaws.com/data-sets/hits_sample.json | docker run -i stedolan/jq | docker run --name kcat --rm -i edenhill/kcat:1.7.1
-P \
-b <Cluster FQDN from the Overview page>:9091 \
-t hits_topic \
-X security.protocol=SASL_SSL \
-X sasl.mechanisms=SCRAM-SHA-512 \
-X sasl.username="<username>" \
-X sasl.password="<password>"
This is the end of the tutorial for the Apache Kafka® and ClickHouse® integration in DoubleCloud. Take a look at the articles listed below to see more about both services and other combinations.