Written By: Stefan Kaeser, DoubleCloud Solution Architect
June 8, 2023
10 mins to read
When we talk to potential customers, typically it’s not very hard to convince them that ClickHouse is one of the best OLAP solutions out in the market. You can show a lot of benchmarks, demonstrate fancy queries and explain the fun of working with ClickHouse.
So, when someone wants to start a new product, utilizing a fast OLAP DBMS, ClickHouse often is the first choice.
But most customers don’t start a new product. They come to us because they already have a well-running product, but it has outgrown their current solution, or it got too expensive to still run on easy-to-start systems like BigQuery.
For these customers, typically it’s quite easy to see why ClickHouse would be the better solution because they have to think about not only the technology but also about code migration.
And opposite of typically thinking, most work involved when migrating to another DBMS is not to migrate the data itself. These tasks are typically quite similar for a lot of customers, therefore there already exists a lot of documentation and examples of how to move multiple TB (or even PB) of data from one system to another.
No, the biggest amount of work always comes back to the engineering department. The existing application might have hundreds of queries which need to be changed.
But even though there is a standard for SQL, it has become quite a habit for database creators to also invent their own dialect of SQL. And especially if query performance is relevant, keeping your queries close to SQL standards might result in working but bad-performing queries. That’s why experts in different systems always can outperform query builders.
But in the end, this leads to queries optimized for a specific DBMS which won’t run at all, or run with worse performance on another system.
And this is where migrations to different DBMSs get hard. Now, with the current hype about ChatGPT and other AI solutions, we were pondering: Can ChatGPT help with those tasks?
Don’t worry fellow DB Experts, right now your job is safe!
To find out if ChatGPT is able to help with DB migrations, we start out with something simple. If you need to migrate data from one system to another, we typically start with the basic schema of tables and try to recreate the structure in the new system.
So let’s take a simple DDL for a MySQL table, and ask ChatGPT if it can convert it into ClickHouse dialect:
CREATE TABLE users (
groupId Int32 NOT NULL,
userId Int32 NOT NULL,
password_hash FixedString(16) DEFAULT NULL,
username String DEFAULT NULL,
PRIMARY KEY (groupId, userId),
INDEX uname (username)
) ENGINE = MergeTree()
ORDER BY (groupId, userId);
It again converted the types correctly, giving explanations, why it chose which types etc. But unfortunately, the end result gives a syntax error in ClickHouse, as additional indices are not as easy to create as ChatGPT thinks.
I tried to ask ChatGPT to fix its error, but it just tried to skip the index in the first DDL and create it in a separate statement, which also didn’t work.
It also gave me an explanation, that ClickHouse does not enforce NOT NULL in not nullable columns. No idea how it came up with this idea, as ClickHouse definitely enforces the not null.
Ergo, even with those good explanations, you should be careful, as not everything might be correct, still, it sounds quite convincing as it would be right. (sounds like some humans you may know….)
DoubleCloud Managed Service for ClickHouse
An open-source, managed ClickHouse DBMS service for sub-second analytics. Don’t take two days to set up a new data cluster. Do it with us in five minutes.
Ok, so we saw that ChatGPT can convert DDL from one dialect to another with some restrictions, but what about real queries?
We asked some of our customers if they can provide us with some real queries they have to migrate. First of all, we got an example query from one of our customers which runs on BigQuery:
WHEN sq_2__search_console__period_0__ctr >= 0
AND sq_2__search_console__period_0__ctr < 5 THEN '0000_from_0_to_5'
WHEN sq_2__search_console__period_0__ctr >= 5
AND sq_2__search_console__period_0__ctr < 10 THEN '0001_from_5_to_10'
WHEN sq_2__search_console__period_0__ctr >= 10
AND sq_2__search_console__period_0__ctr < 20 THEN '0002_from_10_to_20'
WHEN sq_2__search_console__period_0__ctr >= 20
AND sq_2__search_console__period_0__ctr < 40 THEN '0003_from_20_to_40'
WHEN sq_2__search_console__period_0__ctr >= 40
AND sq_2__search_console__period_0__ctr < 70 THEN '0004_from_40_to_70'
WHEN sq_2__search_console__period_0__ctr >= 70 THEN '0005_from_70_to_'
IFNULL(SUM(sq_2__search_console__period_0__count_clicks), 0) f_1
(SELECT sq_2.search_console__period_0__count_clicks sq_2__search_console__period_0__count_clicks,
(SELECT ((SUM(clicks) / SUM(impressions)) * 100) search_console__period_0__ctr,
WHERE date BETWEEN '2023-01-28' AND '2023-02-26' )
GROUP BY url_keyword_hash) sq_2
WHERE http_code <> 0 ) sq_1 ON sq_2.url_hash = sq_1.url_hash
WHERE NOT(sq_1.compliant__is_compliant) )
GROUP BY f_0
As you can see, it’s already quite complex, but this is the real world and we want to see if ChatGPT can help us during a migration process, and not only for fun, right? So what do we get:
WHEN sq_2__search_console__period_0__ctr >= 0 AND sq_2__search_console__period_0__ctr < 5 THEN '0000_from_0_to_5'
WHEN sq_2__search_console__period_0__ctr >= 5 AND sq_2__search_console__period_0__ctr < 10 THEN '0001_from_5_to_10'
WHEN sq_2__search_console__period_0__ctr >= 10 AND sq_2__search_console__period_0__ctr < 20 THEN '0002_from_10_to_20'
WHEN sq_2__search_console__period_0__ctr >= 20 AND sq_2__search_console__period_0__ctr < 40 THEN '0003_from_20_to_40'
WHEN sq_2__search_console__period_0__ctr >= 40 AND sq_2__search_console__period_0__ctr < 70 THEN '0004_from_40_to_70'
WHEN sq_2__search_console__period_0__ctr >= 70 THEN '0005_from_70_to_'
END AS f_0,
COALESCE(SUM(sq_2__search_console__period_0__count_clicks), 0) AS f_1
sq_2.search_console__period_0__count_clicks AS sq_2__search_console__period_0__count_clicks,
sq_2.search_console__period_0__ctr AS sq_2__search_console__period_0__ctr
(sum(clicks) / sum(impressions)) * 100 AS search_console__period_0__ctr,
anyLast(url_hash) AS url_hash,
sum(clicks) AS search_console__period_0__count_clicks,
WHERE date BETWEEN '2023-01-28' AND '2023-02-26'
GROUP BY url_keyword_hash
) AS sq_2
ANY FULL OUTER JOIN
WHERE http_code <> 0
) AS sq_1
ON sq_2.url_hash = sq_1.url_hash
WHERE NOT sq_1.compliant__is_compliant
GROUP BY f_0
In the end, it just translated the logic, but nothing more. I asked ChatGPT if it can optimize the query, and it gave me some useful hints, again with some good explanations for beginners to learn.
Then I tried to trick it a little bit. Our customer sent me a list of Queries, but the tool they used, did some text encodings, so the query was not easy to read for the human eye anymore:
SELECT CASE WHEN sq_1__nb_words \u003e\u003d 1 AND sq_1__nb_words \u003c 4 THEN \u00270000_from_1_to_4\u0027 WHEN sq_1__nb_words \u003e\u003d 4 AND sq_1__nb_words \u003c 7 THEN \u00270001_from_4_to_7\u0027 WHEN sq_1__nb_words \u003e\u003d 7 THEN \u00270002_from_7_to_\u0027 END f_0, IFNULL(SUM(sq_1__search_console__period_0__count_clicks), 0) f_1 FROM (SELECT sq_1.nb_words sq_1__nb_words, sq_1.search_console__period_0__count_clicks sq_1__search_console__period_0__count_clicks FROM (SELECT ANY_VALUE(branded) branded, ANY_VALUE(nb_words) nb_words, ANY_VALUE(segments__pagetype__value) segments__pagetype__value, ANY_VALUE(url_hash) url_hash, SUM(clicks) search_console__period_0__count_clicks, url_keyword_hash FROM (SELECT * FROM `production_52366.search_console_flat` WHERE date BETWEEN \u00272023-04-10\u0027 AND \u00272023-05-07\u0027) WHERE (STARTS_WITH(segments__pagetype__value, \u0027category/\u0027) OR segments__pagetype__value \u003d \u0027category\u0027) AND NOT(STARTS_WITH(segments__pagetype__value, \u0027category/results-page/\u0027) OR segments__pagetype__value \u003d \u0027category/results-page\u0027) AND NOT(branded) GROUP BY url_keyword_hash) sq_1) GROUP BY f_0 LIMIT 20000
Now ChatGPT could easily fix the encoding, and gave me a query, but it was only partially in ClickHouse dialect. For whatever reason this time it didn’t convert the ANY_VALUE function which it already did in the previous example.
I hinted at ChatGPT to its error, so the next iteration of the query then worked.
So in the end, ChatGPT can help you translate queries for some PoCs, but you still need to manage yourself to optimise for the best performance, an auto-conversion process won’t work.
Speaking of PoCs. One of the hardest things in a new project is always the beginning. Let’s find out if ChatGPT can help there, can it create some basic tables for me?
As an example, I thought of some log data to store in ClickHouse. I, therefore, installed collectd on a test machine, and let it generate some data in its JSON format.
Then I took one row of the data and asked ChatGPT to provide me with an efficient table structure based on the single JSON row. I gave it some context, that it is log data collected and that I want to store it as efficiently as possible.
CREATE TABLE my_table (
) ENGINE = MergeTree()
ORDER BY time;
Actually, I think it’s quite impressive, what it can create out of so little information, but you can easily see there is little to no optimization. Typically you would use LowCardinality for most of the String columns, as it’s not likely to have millions of different hosts or plugins. Also, the chosen ORDER BY time is suboptimal for storing a lot of log data.
It explains that with this ORDER BY, you can facilitate efficient querying of time series data. Still, ChatGPT forgets to mention that it will also lead to bad data locality and therefore not good compression.
Even when explicitly asking for a more optimized version, it just gave me general hints on what could be done, but not specific enough to be useful for automation.
So in the end, one must say ChatGPT is a fantastic tool for some fast ramp-ups, but it definitely lacks the experience of a database expert. You can use it for some early prototypes, or for giving you some valuable information, but it’s still far away from using it as an automatism to generate your Database Architecture or optimize queries from one dialect to another.
Meet DoubleCloud AI-Insights based on GPT-4
Don’t waste time reading numerous reports and manually analyzing data — rely on AI-Insights and get fast and accurate conclusions.