How to combine Apache Kafka® and ClickHouse® to create data streams and visualizations

Modern organizations have a multitude of web-based analytic platforms to choose from that help measure business critical KPI’s. When used correctly, the information gathered allows them to make reliable, date-driven decisions to improve the usability of their applications, improve conversion rates and meet other, key business goals.

In this article, we’re going to show you how to build a real-time data pipeline for web analytics. Before we jump right into the technical details, let's think about the most common requirements and challenges we need to consider when building such a system.

A typical analytics solution will have three main components:

  • A data ingestion layer that accumulates the click stream data coming from the application

  • An analytical database that stores the data and handles the queries

  • A BI tool to easily visualize data, build business reports and dashboards, and generate insights from the data.

architecture-diagram

Diagram Legend:

  1. Users' actions are captured for further analysis as they browse a website or use a mobile application.

  2. Events need to be durably stored and quite often need to be transformed or enriched before they can be loaded into an analytical database. Modern event streaming platforms like Apache Kafka® combine the fast ingestion capabilities with a rich ecosystem of components to process and transform your data.

  3. (Optional) Perform data transformations, such as joining data from multiple sources, and aggregate it if required. Write results of your transformations into a separate Apache Kafka® topic.

  4. ClickHouse® has a built-in Apache Kafka® plugin, developed and supported by an open source community, that allows you to consume messages directly from Apache Kafka®. DoubleCloud also has a managed data transfer tool that you can use instead of the plugin.

  5. ClickHouse® supports a lot of open source BI / Data visualization tools like Apache Superset and Grafana, as well as traditional BI tools like Tableau or PowerBI. DoubleCloud also has a BI tool called Visualization letting you build dashboards quickly and easily.

At the end of this tutorial, you will have a dashboard with charts. Data for these charts will stream from the data source to an Apache Kafka® cluster, then to the Managed ClickHouse® cluster and, after that, to the Visualization connection. You'll see how to build this sequence by following the steps below:

  1. Prepare your workspace

  2. Configure Managed Service for Apache Kafka®

    1. Create an Apache Kafka® cluster

    2. Create a topic

  3. Configure Managed Service for ClickHouse®

    1. Create a Managed ClickHouse® cluster

    2. Create a database in the cluster

  4. Transfer data between clusters

    1. Create a source endpoint

    2. Create a target endpoint

    3. Configure and activate a transfer

  5. Start the Apache Kafka® REST endpoint

  6. Send data to your Apache Kafka® REST endpoint

  7. Visualize data

    1. Create a workbook and a connection

    2. Configure the dataset based on connection

    3. Create a dashboard and add charts

    4. Add selectors

    5. Link the selectors and charts

Prepare your workspace

Before you start with this tutorial, you need to install software to use later:

  1. Install the clickhouse-client:

    1. Open your terminal.

    2. (Optional) Start Docker if needed:

      service docker start
      
    3. Pull the clickhouse-client Docker image:

      docker pull clickhouse/clickhouse-client
      
    1. Open your terminal.

    2. Connect to the ClickHouse® official DEB repository from your Linux system:

      sudo apt update && sudo apt install -y apt-transport-https ca-certificates dirmngr && \
      sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 8919F6BD2B48D754 && \
      echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \
      /etc/apt/sources.list.d/clickhouse.list
      
    3. Refresh the package list and install the clickhouse-client :

      sudo apt update && sudo apt install -y clickhouse-client
      
    1. Open your terminal.

    2. Connect to a ClickHouse® official RPM repository from your Linux system:

      sudo yum install -y yum-utils
      sudo yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo
      
    3. Install the clickhouse-client :

      sudo yum install -y clickhouse-client
      

    Warning

    If you run a RedHat 7-based Linux distribution, including Cent OS 7, Oracle Linux 7 and others, you need to download and install trusted certificates, and manually add the path to them in the clickhouse-client configuration file as follows:

    1. Install the root certificate:

      curl https://letsencrypt.org/certs/isrg-root-x2-cross-signed.pem > \ 
      /etc/pki/ca-trust/source/anchors/isrg-root-x2-cross-signed.pem
      
    2. Install the intermediate certificate:

      curl https://letsencrypt.org/certs/lets-encrypt-r3-cross-signed.pem > \
      /etc/pki/ca-trust/source/anchors/lets-encrypt-r3-cross-signed.pem
      
    3. Update the list of trusted certificates:

      sudo update-ca-trust
      
    4. Locate your clickhouse-client configuration file (by default, you can find it at /etc/clickhouse-client/config.xml) and add the path to the certificates into the <openSSL> section:

      <client> <!-- Used for connection to server's secure tcp port -->
         <loadDefaultCAFile>true</loadDefaultCAFile>
         <cacheSessions>true</cacheSessions>
         <disableProtocols>sslv2,sslv3</disableProtocols>
         <preferServerCiphers>true</preferServerCiphers>
         <caConfig>/etc/ssl/certs/ca-bundle.crt</caConfig>
         <!-- Use for self-signed: <verificationMode>none</verificationMode> -->
         <invalidCertificateHandler>
         <!-- Use for self-signed: <name>AcceptCertificateHandler</name> -->
         <name>RejectCertificateHandler</name>
         </invalidCertificateHandler>
      </client>
      
  2. Install Python 3 and pip if you don't have them in your environment. Debian-based systems have Python 3 by default, so you usually need to install pip only:

    sudo apt update && sudo apt install python3-venv python3-pip
    
    sudo yum install python3-venv python3-pip
    

    Install the requests module:

    python3 -m pip install requests
    
  3. Take a look at the JSON file that contains 100 objects with sample user visits data. We'll download it later programmatically.

Configure Managed Service for Apache Kafka®

Website and mobile traffic tends to vary a lot during the day. It's prone to unpredictable traffic bursts which is why a distributed event streaming platform like Apache Kafka® is a popular architectural choice for the data ingestion layer. In addition to providing throughput scalability and durable storage, an event streaming platform also offers other important advantages:

  • The ability to support multiple independent consumers

  • Replay of captured events when required in the same order they have arrived and perform the data enrichment

  • Transformation and aggregation before it gets loaded into the database.

Create an Apache Kafka® cluster

The first infrastructure entity you need to create is an Apache Kafka® cluster with brokers and at least one topic:

Warning

During the trial period, you can create clusters with up to 8 cores, 32 GB RAM, and 400 GB storage. If you need to raise the quotas, don't hesitate to contact our support.

  1. Go to the console.

  2. Select Clusters from the list of services on the left.

  3. Click Create cluster in the upper-right corner of the page.

  4. Select Apache Kafka®.

  5. Choose a provider and a region.

    1. Under Resources:

      • Select the s1-c2-m4 preset for CPU, RAM capacity, and storage space to create a cluster with minimal configuration.

      • Select the number of zones and brokers. The number of zones is multiplied by the number of brokers and determines the number of hosts. This tutorial creates a cluster with two zones and two brokers.

    2. Under Basic settings:

      • Enter the cluster Name, for example, web_analytics_stream.

      • Select the version of Apache Kafka® for your cluster from the Version drop-down list. For most clusters, we recommend using the latest version.

    3. Under NetworkingVPC, specify in which DoubleCloud VPC to locate your cluster. Use the default value in the previously selected region if you don't need to create this cluster in a specific network.

    4. Click Submit.

Your cluster will appear with the Creating status on the Clusters page in the console. Setting everything up may take some time. When the cluster is ready, it changes its state to Alive.

Create a topic

When you've created a cluster, create a topic in it:

  1. On the cluster's page, go to the Topics tab.

  2. Click Create.

  3. Under Topic Settings, specify the connection properties:

    • Cleanup policy - Delete. This policy deletes log segments when their retention time or log size reaches the limit.

    • Compression Type - Uncompressed. We don't need compression for our use case. Let's disable it.

    • Retention Bytes - 1048576 (1 mb).

    • Retention Ms - 600000 (10 minutes).

  4. Specify the Basic Settings:

    • Name

      A topic's name. Let's call it analytics-topic.

    • Partitions

      A number of topic's partitions . Keep 1 to create the simplest topic.

    • Replication factor

      Specifies the number of copies of a topic in a cluster. This parameter's value should not exceed the number of brokers in the cluster. Let's keep the default setting - 1.

  5. Click Submit.

Configure Managed Service for ClickHouse®

Now, let's look closer at the database layer. There are plenty of options available among cloud-native and open source solutions, each with its pros and cons, making it hard to choose. To make a good web analytics database, the solution has to be able to ingest data quickly, store it efficiently, and provide fast access to the data using SQL.

ClickHouse® has been built from the ground-up to serve real-time web analytics at the scale of billions of clicks per day.

It features columnar storage with multiple data encoding and compression, SQL support with numerous functions on top of it, and, with recent support of asynchronous inserts, can handle thousands of inserts per second per node. ClickHouse® is also a multi-leader, so you can write and read data from any node and data is available to query immediately after insertion.

ClickHouse® is also known for its query processing speed, as fast as single-digit milliseconds in the best-case scenarios, which makes it the best option for latency-sensitive workloads. ClickHouse® can act as Apache Kafka® consumer and producer, which greatly reduces the complexity of the data pipeline.

Create a Managed ClickHouse® cluster

Warning

During the trial period, you can create clusters with up to 8 cores, 32 GB RAM, and 400 GB storage. If you need to raise the quotas, don't hesitate to contact our support.

  1. Select Clusters from the list of services on the left.

  2. Click Create cluster in the upper-right corner of the page.

    1. Select ClickHouse®.

    2. Choose a provider and a region.

    3. Under Resources:

      • Select the s1-c2-m4 preset for CPU, RAM capacity, and storage space to create a cluster with minimal configuration.

      • Choose a number of replicas. Let's keep it as is with a single replica.

      • Select a number of shards. Keep a single shard.

    4. Under Basic settings:

      • Enter the cluster Name, in this tutorial - data_storage

      • From the Version drop-down list, select the ClickHouse® version the Managed ClickHouse® cluster will use. For most clusters, we recommend using the latest version.

    5. Under NetworkingVPC, specify in which DoubleCloud VPC to locate your cluster. Use the default value in the previously selected region if you don't need to create this cluster in a specific network.

    6. Click Submit.

    Your cluster will appear with the Creating status on the Clusters page in the console. Setting everything up may take some time.

Your cluster will appear with the Creating status on the Clusters page in the console. Setting everything up may take some time. When the cluster is ready, it changes its state to Alive.

Tip

The DoubleCloud service creates the superuser admin and its password automatically. You can find both the User and the Password in the Overview tab on the cluster information page.

To create users for other roles, see Manage ClickHouse® users

Create a database in the cluster

  1. Run the following command in your Linux terminal. It contains the string from the Native interface field on the cluster Overview page:

    docker run --network host --rm -it clickhouse/<Native interface connection string>
    
    The complete Docker command structure
    docker run --network host --rm -it \ 
                clickhouse/clickhouse-client \
                --host <FQDN of your cluster> \
                --secure \
                --user admin \
                --password <Cluster user password> \
                --port 9440 
    
    <Native interface connection string>
    
  2. Create a database:

    CREATE DATABASE IF NOT EXISTS "analytics_db"
    

Tip

Alternatively, you can use a visual IDE of your choice to connect to a cluster and create a database.

Transfer data between clusters

Now it's time to configure a transfer between your clusters to keep their databases in sync.

To accomplish this task, you need to take three steps:

  1. Create a Apache Kafka® source endpoint

    This tool will take data from your Apache Kafka® cluster and prepare the data for transfer.

  2. Create a ClickHouse® target endpoint.

    This tool will receive the processed data from the source endpoint and replicate it correctly to your ClickHouse® database as table lines.

  3. Configure and activate a transfer.

    This tool will perform the transfer of the data from Apache Kafka® topic to your ClickHouse® database.

Create a source endpoint

  1. Go to the console.

  2. In the list of services, select Transfer.

  3. Select Endpoints tab, click Create endpoint and select Source.

  4. Select the Apache Kafka® Source type.

  5. Under Basic settings:

    • Enter the Name of the endpoint - kafka-source.
  6. Under Endpoint settingsManaged cluster:

    1. Select your Managed Apache Kafka® cluster ID from the drop-down list.

    2. Specify Authentication credentials:

      • Your User Name - admin

      • Password for your cluster

  7. Specify the existing Topic full name. In this case, it's analytics-topic.

    Your endpoint settings should look as follows:

    kafka-endpoint-settings

  8. Under Conversion rules:

    1. Click + Conversion rules.

    2. From the dropdown menu, select the JSON Data format.

    3. Choose the Field list Data Scheme.

    4. Click + Field and specify the following field properties as shown on the screenshot below. You need to specify this information for each field you want to transfer:

      1. The Name of the field.

      2. The field Type.

      3. Provide the Path to redefine the names of columns in the table.

    kafka-endpoint-fields

  9. Click Submit.

Create a target endpoint

  1. In the list of services, select Transfer.

  2. Open the Endpoints tab, click Create endpoint and select Target.

  3. In the Target type, choose the database type you want to transfer data to.

  4. Under Basic settings:

    1. Enter the Name of the endpoint - clickhouse-target.
  5. Specify endpoint parameters under Endpoint settings.

  6. Select the Managed cluster Connection type.

  7. Under Managed cluster, select your data_storage cluster from the drop-down list.

  8. Specify the connection properties:

    1. Specify admin as the User.

    2. Enter the cluster Password.

    3. Enter the name of the Database that we created earlier - analytics_db. When you will activate the transfer on the next step, the Transfer service will create a table for the transferred data in this database.

  9. You don't need to specify any other settings for this scenario, so you can click Submit.

Configure and activate a transfer

  1. In the list of services, select Transfer.

  2. Select the Transfer tab.

  3. Click Create transfer.

  4. Under Endpoints:

    1. Select the kafka-source Source endpoint.

    2. Select the clickhouse-target Target endpoint.

  5. Under Basic settings:

    1. Enter the transfer Name - analytics-kafka-clickhouse
  6. Under Transfer settings select the Increment Transfer type. It transfers the data and keeps the target database in sync with the source database.

  7. Under Runtime, keep the default Serverless runtime setting.

    Your transfer should look as follows:

    transfer-config

  8. Click Submit.

It's time to start the data transfer process:

  1. In the list of services, select Transfer.

  2. Select the Transfer tab.

  3. To the right of the transfer name, click Activate.

After you've activated the transfer, all the data that will appear or is currently present in the source database will be continuously transferred to the target database.

Start the Apache Kafka® REST endpoint

To emulate the work of a real-world application, we will send data through a special REST endpoint :

  1. Configure your environment. Create a .env file with the following contents:

    KAFKA_REST_LISTENERS=http://0.0.0.0:8082
    KAFKA_REST_BOOTSTRAP_SERVERS=<hostname from the Overview tab>:9091
    KAFKA_REST_CLIENT_SECURITY_PROTOCOL=SASL_SSL
    KAFKA_REST_CLIENT_SASL_MECHANISM=SCRAM-SHA-512
    KAFKA_REST_CLIENT_SASL_JAAS_CONFIG=org.apache.kafka.common.security.scram.ScramLoginModule required username="admin" password="<your cluster password>";
    
  2. Navigate to the directory with this file and run the container in a separate terminal:

    docker run --network host --env-file .env -it confluentinc/cp-kafka-rest
    

Keep this terminal open so that the endpoint will receive and send data.

Send data to your Apache Kafka® REST endpoint

After successfully creating your clusters and the REST endpoint, you can now upload data to your DoubleCloud Managed ClickHouse® cluster through Apache Kafka®. This tutorial creates a simple Python script to perform an HTTP query to your REST endpoint:

  1. Copy the following code:

    send-data.py

     import json
     import requests
    
     headers = {
         "Content-Type": "application/vnd.kafka.json.v2+json",
         "Accept": "application/vnd.kafka.v2+json, application/vnd.kafka+json, application/json"
     }
    
     URL = 'https://doublecloud-docs.s3.eu-central-1.amazonaws.com/visits.json'
    
     source = json.loads(requests.get(URL).text)
    
     for obj in source:
    
         # Prepare the the object structure 
         # required by the Kafka® REST endpoint
         processedData = {"records": [{"value": obj}]}
    
         res = requests.post('http://localhost:8082/topics/analytics-topic',
    
         # json.dumps() converts an object into a JSON string
         headers=headers, data=json.dumps(processedData))
         print(res)
    
    

    Before you run this script, ensure that your Apache Kafka® REST endpoint and the analytics-kafka-clickhouse Transfer are running.

  2. (Optional) Run an Apache Kafka® consumer to see if the data is successfully received by the Apache Kafka® cluster.

    Pull the kcat image that is available at Docker Hub. We use the 1.7.1 version, but you can use the latest one:

    docker pull edenhill/kcat:1.7.1
    

    Run the following code in a separate terminal to create a consumer:

    docker run --network host --rm -it \
                edenhill/kcat:1.7.1 -C     \
                -b <Hostname from the cluster page>:9091  \    
                -t analytics-topic         \
                -X security.protocol=SASL_SSL     \  
                -X sasl.mechanisms=SCRAM-SHA-512  \    
                -X sasl.username="admin"          \
                -X sasl.password="<your cluster password>" \
                -Z
    

    The data received by the Apache Kafka® cluster will appear here for each JSON object.

    Install kafkacat from your Linux districution's repository:

    sudo apt install kafkacat
    

    Run the following code in a separate terminal to create a consumer:

    kafkacat -C  \
             -b <Hostname from the cluster page>:9091  \    
             -t analytics-topic         \
             -X security.protocol=SASL_SSL     \  
             -X sasl.mechanisms=SCRAM-SHA-512  \    
             -X sasl.username="admin"          \
             -X sasl.password="<your cluster password>" \
             -Z
    
  3. Run the above script:

    python3 send-data.py
    
  4. (Optional) Connect to your Managed ClickHouse® cluster to see if the data is successfully transferred:

    docker run --network host --rm -i -t clickhouse/clickhouse-client \
                --host <domain name of the host> \
                --secure \
                --user <Cluster user name> \
                --password <Cluster user password> \
                --port 9440 \
                -q "SELECT * FROM analytics_bd.analytics_topic";
    
    <Native interface connection string> --query "SELECT * FROM analytics_bd.analytics_topic"
    

Visualize data

You now have a scalable data ingestion layer based on Apache Kafka® and ClickHouse® as a high-speed and scalable OLAP database.

Now it's time to think about the tools for visual representation of the data and business intelligence.

This section will show you how to visualize your data. The main steps include the creation of main Visualization entities:

  • Workbook
  • Connection
  • Dataset
  • Dashboard and charts.

As soon as you pass this stage, you will complete this scenario, and you will have an analytics dashboard with real-time updates.

Create a workbook and a connection

Workbooks are visualization environments for your data that contain all other Visualization entities.

  1. Go to the console.

  2. Select Visualization from the list of services on the left.

  3. Click Create a workbook in the upper-right corner of the page.

  4. Give a name to your workbook (Analytics) and click Create.

When you have a workbook, you can open it and create the first element you need to visualize your data - a Connection. The connections access hosts in your cluster to fetch data for processing and visualization.

  1. Click CreateConnection in the upper-right corner of your workbook page.

  2. Select ClickHouse®.

  3. Specify the connection parameters:

    • Hostname. Copy the host's name from the Hosts tab on the information page of your cluster.
    • HTTP Interface Port (by default, 8443).
    • Username and Password for your cluster. You can find them in the Overview tab on the information page of your cluster.
  4. Leave other settings unchanged and click Check connection. Your connections settings should look like this:

    connection-settings

  5. If your connection is ready to operate, click Create.

  6. In the dialog window, name your connection data_storage_connection.

Configure the dataset based on connection

  1. Click Create dataset in the upper-right corner of your connection page.

    You'll see the data_storage_connection already selected.

  2. Drag and drop the analytics_topic to the section on the right.

  3. Click Save in the upper-right corner of your dataset page, name it (web_data) and click Create.

At this point, you have a dataset whose data is automatically cast to specific types. Now, you can process this data to prepare it for visualization.

Let's add a calculated field. Calculated fields are helpful when you need to transform data somehow. In this example, we'll use the DATE_PARSE function to convert a string value to the Date type.

  1. Go to the Fields tab.

  2. Click + Add Field and switch to the Field from source tab.

  3. Click + Add Field and stay on the Formula tab.

  4. Name your field DATE_PARSE.

  5. Type the following to apply the correct data type to the Date_String field:

    DATE_PARSE([Date_String])
    
  6. Click Create.

  7. Click Save in the upper-right corner.

As a result, the first lines of your dataset might look as follows:

dataset-lines

Create a dashboard and add charts

A dashboard is a customizable data display that allows you to create visualization elements, such as charts of different kinds and selectors.

  1. Click CreateDashboard in the upper-right corner of your workbook page.

  2. Give a name to your dashboard - Analytics dashboard, and click Create.

  3. Click AddChart (Wizard).

  4. Select your web_data dataset from the list.

  5. Click the chart type field on the right of the dataset section and select the Line chart type.

  6. Drag and drop the DATE_PARSE field from the Dimensions section to the chart's X slot. The chart will redraw the corresponding axis with values.

  7. Drag the Completion_Rate field to Y and the User_Engagement to the Y2 slots. At this point, you have the most simple chart ready to operate.

  8. Click Save.

  9. Give a name to your chart, for example, Pageviews by date and click Done.

  10. Save your dashboard changes.

The sequence of steps above illustrates how to create most DoubleCloud chart types. Let's add the second chart - a table:

  1. Click AddChart (Wizard).

  2. Select your dataset - web_data - from the list.

  3. Click the chart type field on the right of the dataset section and select Table.

  4. Drag and drop the fields to display to the Columns section. We selected Page, Completion_Rateand User_Engagement.

    Tip

    When your dataset contains a lot of data, it's a good idea to also create a Chart filter to limit the number of visualized records. The dataset in the current example contains 100 lines, so a filter isn't required.

  5. Click and enable Pagination. Specify 20 records per page.

  6. Click Save.

  7. Give a name to your table, for example Pages Data and then click Done.

  8. Save your dashboard changes.

Add selectors

Selectors allow you to filter values of charts on a dashboard. Each selector works only with charts that are in the same tab with it.

  1. Click Edit in the upper-right corner of the dashboard.

  2. Click AddSelector.

  3. Keep the selection - Based on dataset.

  4. Select the web_data Dataset.

  5. Select the Page Field.

  6. Keep the List Selector type to display the field values in a dropdown menu.

  7. Select the Operation - Belongs to a set.

  8. Enable Multiple choice.

  9. Under Default values, click Select all.

  10. Enter the selector name - Pages.

  11. Click Add.

  12. Click Save in the upper-right corner of the wizard to apply changes.

You can use the selector to choose one or more pages for which you want to see the information on your chart.

At this point, you should have a dashboard that looks like this:

first-tab

This is the end of the tutorial for using DoubleCloud Apache Kafka® and ClickHouse®, clusters, Transfer, and Visualization. There are lots of other scenarios that you can implement with our service.

Take a look at the list below to learn more about our services:

See also

Apache Kafka®

ClickHouse®

Transfer

Visualization