ClickHouse as storage engine for MySQL? An experimental approach

Written By: Stefan Kaeser, DoubleCloud Senior Solution Architect

Sometimes I just love reading technical blog posts, even if it isn’t about a topic that brings you any specific value for a practical problem you have right now. Still it can open your mind, to think differently about certain things.

Lately I read a blog from the great LeFred on How to use MySQL FEDERATED engine?

His blog reminded me of an experiment I did several years ago, about possibilities to introduce ClickHouse into our production system of my company during that time: Is it possible to use ClickHouse as a storage engine within your MySQL environment?

To give a short Summary: Yes, you can. But most of the time you should not!

Prerequisites

To be able to do the experiments yourself, you need to have a basic setup. I used an t3.large AWS EC2 instance (2 vCPU Cores, 8GB Ram, 256GB gp3 storage) with Ubuntu 22.04 preinstalled.

On top of it I installed Percona Server for MySQL 8, but other flavors of MySQL or even MariaDB should also work as well.

Last but not least I installed the latest ClickHouse Server Version 23.3 LTS on the same host.

As test data I used my well beloved weather dataset, containing daily weather data since 1900, as described in my (De-)Normalization Blog Post last year. The exact dataset is not very important, as long as it’s big enough to test real use cases and see problems early, you should be able to build your experiment with your own data quite easily.

Configure MySQL

As mentioned in LeFreds blog, per default MySQL has not enabled the federated engine, as it is not under active development right now (State: April 2023). But fortunately it’s not hard to enable it.

If you use Percona Server 8 as i do, you just need to add one line to your /etc/mysql/my.cnf file in section [mysqld]:

federated

After a restart of MySQL server you can check if the setting works by issuing the SHOW ENGINES; command within your MySQL client:

SHOW ENGINES\G
...
...
*************************** 7. row ***************************
      Engine: FEDERATED
    Support: YES
    Comment: Federated MySQL storage engine
Transactions: NO
          XA: NO
  Savepoints: NO

Nothing more to prepare for MySQL.

Configure ClickHouse

Next we need to make sure that ClickHouse can be queried via the MySQL wire protocol. Open /etc/clickhouse-server/config.xml and check for this setting:

<mysql_port>9004</mysql_port>

You can double check everything by connecting to ClickHouse via the mysql client:

mysql -h 127.0.0.1 -P 9004 -u default --password
Enter password:

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 23.3.1.2823-ClickHouse

Copyright (c) 2009-2023 Percona LLC and/or its affiliates
Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Start the experiment

Create structure in MySQL

First of all, we will create our basic structure in MySQL, to have something to compare our later results with:

CREATE DATABASE weather;
USE weather;
CREATE TABLE `sensor_data` (
  `station_id` char(11) CHARACTER SET ascii COLLATE ascii_bin 
        NOT NULL,
  `date` date NOT NULL,
  `tempAvg` int DEFAULT NULL,
  `tempMax` int DEFAULT NULL,
  `tempMin` int DEFAULT NULL,
  `precipitation` int unsigned DEFAULT NULL,
  `snowfall` int unsigned DEFAULT NULL,
  `snowDepth` int unsigned DEFAULT NULL,
  `percentDailySun` tinyint unsigned DEFAULT NULL,
  `averageWindSpeed` int unsigned DEFAULT NULL,
  `maxWindSpeed` int unsigned DEFAULT NULL,
  `weatherType` enum('Normal','Fog','Heavy Fog','Thunder',
      'Small Hail','Hail','Glaze','Dust/Ash', 
      'Smoke/Haze', 'Blowing/Drifting Snow',
      'Tornado','High Winds','Blowing Spray',
      'Mist','Drizzle', 'Freezing Drizzle','Rain','Freezing Rain',
      'Snow','Unknown Precipitation','Ground Fog','Freezing Fog')
      DEFAULT NULL,
  PRIMARY KEY (`date`,`station_id`),
  KEY `station` (`station_id`)
) ENGINE=InnoDB;

After filling in the data (1 billion rows), it results in about 70GB, so nearly ten times of total memory of the instance itself.

We of course don’t expect a high speed, but we can get results in a somewhat reasonable time:

Q1:
SELECT COUNT(DISTINCT date) AS days, COUNT(DISTINCT station_id) AS stations, COUNT(*) AS data
FROM weather.sensor_data
WHERE date BETWEEN '2020-01-01' AND '2021-12-31'
  AND station_id LIKE 'US%';

+------+----------+----------+
| days | stations | data     |
+------+----------+----------+
|  731 |32429  | 16020472 |
+------+----------+----------+
1 row in set (26.38 sec)

Q2:
SELECT ROUND(AVG(tempAvg) / 10, 1) AS avgTemp, 
  TRUNCATE(YEAR(date), -1) AS dekade
FROM weather.sensor_data
WHERE date >= '1970-01-01'
GROUP BY dekade
ORDER BY dekade;
+---------+--------+
| avgTemp | dekade |
+---------+--------+
|     8.9 |   1970 |
|     9.9 |   1980 |
|    10.2 |   1990 |
|    10.8 |   2000 |
|    10.7 |   2010 |
|    10.8 |   2020 |
+---------+--------+
6 rows in set (8 min 42.05 sec)

Next we create the same structure in ClickHouse

Create structure in ClickHouse

We create the structure in ClickHouse quite similar to MySQL, just use some small ClickHouse specific optimizations

CREATE DATABASE weather;
USE weather;
CREATE TABLE sensor_data
(
  `station_id` LowCardinality(String),
  `date` Date32 CODEC(Delta(4), ZSTD(1)),
  `tempAvg` Nullable(Int32),
  `tempMax` Nullable(Int32),
  `tempMin` Nullable(Int32),
  `precipitation` Nullable(UInt32),
  `snowfall` Nullable(UInt32),
  `snowDepth` Nullable(UInt32),
  `percentDailySun` Nullable(UInt8),
  `averageWindSpeed` Nullable(UInt32),
  `maxWindSpeed` Nullable(UInt32),
  `weatherType` Nullable(Enum8('Normal' = 0, 'Fog' = 1, 
        'Heavy Fog' = 2, 'Thunder' = 3, 'Small Hail' = 4, 'Hail' = 5, 
        'Glaze' = 6, 'Dust/Ash' = 7, 'Smoke/Haze' = 8, 
        'Blowing/Drifting Snow' = 9, 'Tornado' = 10, 
        'High Winds' = 11, 'Blowing Spray' = 12, 'Mist' = 13, 
        'Drizzle' = 14, 'Freezing Drizzle' = 15, 'Rain' = 16, 
        'Freezing Rain' = 17, 'Snow' = 18, 
        'Unknown Precipitation' = 19, 'Ground Fog' = 21, 
        'Freezing Fog' = 22))
)
ENGINE = MergeTree
PARTITION BY toYear(date)
ORDER BY (station_id, date)
SETTINGS index_granularity = 8192

In ClickHouse because of very good compression, the same data only takes up 3.4GB, so it would even fit into memory of the machine, but as we are running ClickHouse on the same host as MySQL, the memory usage will be shared so we won’t be able to profit from this.

Still as we all expect the Queries on ClickHouse run a lot faster, even without the possibility to use multiple threads (as the instance only has 2 vCPU, hence issuing only half of them)

Q1:
SELECT COUNT(DISTINCT date) AS days, 
  COUNT(DISTINCT station_id) AS stations, COUNT(*) AS data
FROM weather.sensor_data
WHERE date BETWEEN '2020-01-01' AND '2021-12-31'
AND station_id LIKE 'US%';

Query id: 9587c875-24af-443b-aadd-6c5b3264cdf0

┌─days─┬─stations─┬─────data─┐
│  731 │    32429 │ 16020472 │
└──────┴──────────┴──────────┘

1 row in set. Elapsed: 1.173 sec. Processed 16.03 million rows, 145.68 MB (13.67 million rows/s., 124.22 MB/s.)

Q2:
SELECT
  ROUND(AVG(tempAvg) / 10, 1) AS avgTemp,
  TRUNCATE(YEAR(date), -1) AS dekade
FROM weather.sensor_data
WHERE date >= '1970-01-01'
GROUP BY dekade
ORDER BY dekade ASC


┌─avgTemp─┬─dekade─┐
│     8.9 │   1970 │
│     9.9 │   1980 │
│    10.2 │   1990 │
│    10.8 │   2000 │
│    10.7 │   2010 │
│    10.8 │   2020 │
└─────────┴────────┘

6 rows in set. Elapsed: 4.757 sec. Processed 596.29 million rows, 5.37 GB (125.35 million rows/s., 1.13 GB/s.)

Connect MySQL to ClickHouse

As ClickHouse can speak the MySQL wire protocol as we’ve seen before, it can simulate to be a MySQL Server. And as the sensor_data table in ClickHouse only needs 1:20 of space compared to MySQL, we would like to access the data in ClickHouse directly from within MySQL, that’s where the federated storage engine comes in, as it lets us do exactly that.

First of all, we need to create a connection within MySQL to the ClickHouse server by using the CREATE SERVER command:

CREATE SERVER localch
FOREIGN DATA WRAPPER mysql
OPTIONS (USER 'default', HOST '127.0.0.1', PORT 9004, DATABASE 'weather', PASSWORD '*****');

We gave the connection the name localch and pointed it to our ClickHouse instance based on the same host. Within the instance we named the database weather to be our source database.

Next we only need to create our remote table. As we already have a table in MySQL with the same structure, we can just copy the definition, and just change the engine part:

CREATE TABLE `sensor_data_ch` (
  `station_id` char(11) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
  `date` date NOT NULL,
  `tempAvg` int DEFAULT NULL,
  `tempMax` int DEFAULT NULL,
  `tempMin` int DEFAULT NULL,
  `precipitation` int unsigned DEFAULT NULL,
  `snowfall` int unsigned DEFAULT NULL,
  `snowDepth` int unsigned DEFAULT NULL,
  `percentDailySun` tinyint unsigned DEFAULT NULL,
  `averageWindSpeed` int unsigned DEFAULT NULL,
  `maxWindSpeed` int unsigned DEFAULT NULL,
  `weatherType` enum('Normal','Fog','Heavy Fog','Thunder',
      'Small Hail','Hail','Glaze','Dust/Ash', 
      'Smoke/Haze', 'Blowing/Drifting Snow',
      'Tornado','High Winds','Blowing Spray',
      'Mist','Drizzle', 'Freezing Drizzle','Rain','Freezing Rain',
      'Snow','Unknown Precipitation','Ground Fog','Freezing Fog')
      DEFAULT NULL,
  PRIMARY KEY (`station_id`, `date`),
  KEY `date` (`date`)
) ENGINE=FEDERATED
DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
CONNECTION 'localch/sensor_data';

This actually happens in an instant, as MySQL does not store any real data, only metadata information, all the data remains on the remote MySQL (or in this case ClickHouse).

So to check, if everything is running, we can issue our first Query directly on the federated table in MySQL:

Q1:
SELECT COUNT(DISTINCT date) AS days, COUNT(DISTINCT station_id) AS stations, COUNT(*) AS data
FROM weather.sensor_data_ch
WHERE date BETWEEN '2020-01-01' AND '2021-12-31'
  AND station_id LIKE 'US%';

+------+----------+----------+
| days | stations | data     |
+------+----------+----------+
|  731 |32429  | 16020472 |
+------+----------+----------+
1 row in set (43.05 sec)

As we can see, the connection from MySQL to our local ClickHouse works perfectly, and we can query data from ClickHouse directly in MySQL. But there is an overhead in the process, as the same query in MySQL took only 26 seconds and now it takes 43 seconds.

This overhead comes because of the implementation of the federated engine in MySQL. It doesn’t send the query itself to the remote server, but just a few easy conditions like ‘=’, ‘>’, ‘<’… Then the rows matching these easy conditions are sent back to the initiator MySQL where all the calculations etc are done.

This can be seen, when we want to execute query 2 on that connection:

SELECT ROUND(AVG(tempAvg) / 10, 1) AS avgTemp, 
  TRUNCATE(YEAR(date), -1) AS dekade
FROM weather.sensor_data_ch
WHERE date >= '1970-01-01'
GROUP BY dekade
ORDER BY dekade;

ERROR 2013 (HY000): Lost connection to MySQL server during query

Actually I needed to greatly reduce the time condition to get any result. It is because MySQL simply issues a query like this on the remote server:

SELECT tempAvg, date
FROM weather.sensor_data_ch
WHERE date >= '1970-01-01'

So it really only uses the remote ClickHouse as the storage, but it doesn’t make use of ClickHouse’s power much. It just selects the data and sends it over the ‘internal’ wire.

In this case, it means MySQL tries to get that half billion rows into Memory for the later calculation and therefore gets OOM killed quite frankly.

Working use-case

So does this mean, all is lost?

Actually not. Of course, if you were thinking of only using ClickHouse to store the data, but do all the work in MySQL, then federated is not the right choice.

I would kindly lead you to TokuDB or RocksDB in those cases. Both have compression by default and work quite well when it comes to handling a few billion rows, with a low disk footprint.

But if you want to achieve the power of ClickHouse while still using mainly MySQL for your application, you have to add an additional step.

As we saw, the federated engine doesn’t really care about how the table is structured on the remote server. It just sends something like

SELECT a,b FROM source_table WHERE c = ‘x’ 
and then works on the result, like you would do with your own application.

But this quite simple approach can be utilized to achieve our goal. If the inquiring MySQL doesn’t care about how the remote server handles the tables internally, would it care if the tables are not even tables?

The answer is: No. It doesn’t care at all.

As long as it gets a result for those simple SELECTs then all is fine.

That means, it should be possible to create a view on ClickHouse and just expose the view as a federated table.

So within ClickHouse we create a new view, doing the hard work for us:

CREATE VIEW weather.avg_temp_per_dekade AS
SELECT ROUND(AVG(tempAvg) / 10, 1) AS avgTemp, 
              TRUNCATE(YEAR(date), -1) AS dekade  
FROM weather.sensor_data  
GROUP BY dekade  
ORDER BY dekade;

And within MySQL instead of going to the base table we access this view as a federated table:

CREATE TABLE weather.avg_temp_per_dekade (
  dekade INT PRIMARY KEY,
  avgTemp FLOAT
) ENGINE=FEDERATED
CONNECTION 'localch/avg_temp_per_dekade';

Our second Query in MySQL now looks a bit different, but also a lot easier!

Q2:
SELECT * 
FROM weather_view.avg_temp_per_dekade 
WHERE dekade >= 1970;

+--------+---------+
| dekade | avgTemp |
+--------+---------+
|   1970 |     8.9 |
|   1980 |     9.9 |
|   1990 |   10.2 |
|   2000 |   10.8 |
|   2010 |   10.7 |
|   2020 |   10.8 |
+--------+---------+
6 rows in set (4.94 sec)

We are now down to the query-speed of ClickHouse, while still being able to work with the final result in MySQL.

Of Course in the end, this approach might lead to quite a lot of created views in ClickHouse, but as those are just metadata, you don’t have to worry about disk space etc.

So this could be the solution to avoid a lot of pre-aggregated Tables in MySQL and can therefore save you a lot of disk space.

Final thoughts

So, you really can use ClickHouse as a storage engine in MySQL, by utilizing the federated engine shipped with MySQL. But it is missing a lot of polishment, for considering it as a good solution.

The fact that I could get the MySQL instance killed multiple times by a single query which produced too many internal results, does not make me feel comfortable, when I think about production use cases.

The better approach is definitely, to do it the other way round: Use ClickHouse to connect to MySQL for some OLTP data and do OLAP queries all together in ClickHouse.

ClickHouse itself is quite active in its development, therefore the MySQL features to connect from ClickHouse to MySQL are a lot more mature than connecting from MySQL to ClickHouse (or another MySQL).

Still, if you only have one or two queries in your application, where you need to access ClickHouse, and all others can run on MySQL, it might be a way to go with the federated engine.

Either way, if you want to utilize the federated engine with clickhouse, want to optimize the code, or just have an idea for a similar blog post, feel free to reach out.

In this article, we’ll talk about:

DoubleCloud Managed Service for ClickHouse®

An open-source, managed ClickHouse DBMS service for sub-second analytics.

Start your trial today

Sign in to save this post