📢 Upcoming webinar | Migrating from Rockset to ClickHouse made easy Register now →

Can ChatGPT do ClickHouse?

Written By: Stefan Kaeser, DoubleCloud Solution Architect

When we talk to potential customers, typically it’s not very hard to convince them that ClickHouse is one of the best OLAP solutions out in the market. You can show a lot of benchmarks, demonstrate fancy queries and explain the fun of working with ClickHouse.

So, when someone wants to start a new product, utilizing a fast OLAP DBMS, ClickHouse often is the first choice.

But most customers don’t start a new product. They come to us because they already have a well-running product, but it has outgrown their current solution, or it got too expensive to still run on easy-to-start systems like BigQuery.

For these customers, typically it’s quite easy to see why ClickHouse would be the better solution because they have to think about not only the technology but also about code migration.

And opposite of typically thinking, most work involved when migrating to another DBMS is not to migrate the data itself. These tasks are typically quite similar for a lot of customers, therefore there already exists a lot of documentation and examples of how to move multiple TB (or even PB) of data from one system to another.

No, the biggest amount of work always comes back to the engineering department. The existing application might have hundreds of queries which need to be changed.

But even though there is a standard for SQL, it has become quite a habit for database creators to also invent their own dialect of SQL. And especially if query performance is relevant, keeping your queries close to SQL standards might result in working but bad-performing queries. That’s why experts in different systems always can outperform query builders.

But in the end, this leads to queries optimized for a specific DBMS which won’t run at all, or run with worse performance on another system.

And this is where migrations to different DBMSs get hard. Now, with the current hype about ChatGPT and other AI solutions, we were pondering: Can ChatGPT help with those tasks?

Don’t worry fellow DB Experts, right now your job is safe!

Asking for simple things

To find out if ChatGPT is able to help with DB migrations, we start out with something simple. If you need to migrate data from one system to another, we typically start with the basic schema of tables and try to recreate the structure in the new system.

So let’s take a simple DDL for a MySQL table, and ask ChatGPT if it can convert it into ClickHouse dialect:

CREATE TABLE `datatypes` (
  `a` int NOT NULL,
  `b_bool` tinyint(1) DEFAULT NULL,
  `c_bit` bit(1) DEFAULT NULL,
  PRIMARY KEY (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Of course just a simple structure, but for a start, I thought it was fine, so this is what I got from ChatGPT:

CREATE TABLE datatypes (
  a Int32 NOT NULL,
  b_bool UInt8 DEFAULT NULL,
  c_bit UInt8 DEFAULT NULL,
  PRIMARY KEY (a)
) ENGINE = MergeTree()
ORDER BY a;

So it worked well, and it not only gives us a DDL, but it also gives an explanation of why it converted the signed column b_bool into an unsigned UInt8:

Because tinyint (1) is typically used for boolean values, while in ClickHouse those are represented by UInt8. The same goes for the bit (1) type.

So technically the new table structure is incorrect, as it cannot store the same data as the old data, but semantically the solution is correct.

Let’s try a second example:

CREATE TABLE `users` (
  `groupId` int NOT NULL,
  `userId` int NOT NULL,
  `password_hash` binary(16) DEFAULT NULL,
  `username` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`groupId`,`userId`),
  KEY `uname` (`username`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

This is the result:

CREATE TABLE users (
  groupId Int32 NOT NULL,
  userId Int32 NOT NULL,
  password_hash FixedString(16) DEFAULT NULL,
  username String DEFAULT NULL,
  PRIMARY KEY (groupId, userId),
  INDEX uname (username)
) ENGINE = MergeTree()
ORDER BY (groupId, userId);

It again converted the types correctly, giving explanations, why it chose which types etc. But unfortunately, the end result gives a syntax error in ClickHouse, as additional indices are not as easy to create as ChatGPT thinks.

I tried to ask ChatGPT to fix its error, but it just tried to skip the index in the first DDL and create it in a separate statement, which also didn’t work.

It also gave me an explanation, that ClickHouse does not enforce NOT NULL in not nullable columns. No idea how it came up with this idea, as ClickHouse definitely enforces the not null.

Ergo, even with those good explanations, you should be careful, as not everything might be correct, still, it sounds quite convincing as it would be right. (sounds like some humans you may know….)

DoubleCloud Managed Service for ClickHouse

An open-source, managed ClickHouse DBMS service for sub-second analytics. Don’t take two days to set up a new data cluster. Do it with us in five minutes.

Query dialects

Ok, so we saw that ChatGPT can convert DDL from one dialect to another with some restrictions, but what about real queries?

We asked some of our customers if they can provide us with some real queries they have to migrate. First of all, we got an example query from one of our customers which runs on BigQuery:

SELECT CASE
          WHEN sq_2__search_console__period_0__ctr >= 0
                AND sq_2__search_console__period_0__ctr < 5 THEN '0000_from_0_to_5'
          WHEN sq_2__search_console__period_0__ctr >= 5
                AND sq_2__search_console__period_0__ctr < 10 THEN '0001_from_5_to_10'
          WHEN sq_2__search_console__period_0__ctr >= 10
                AND sq_2__search_console__period_0__ctr < 20 THEN '0002_from_10_to_20'
          WHEN sq_2__search_console__period_0__ctr >= 20
                AND sq_2__search_console__period_0__ctr < 40 THEN '0003_from_20_to_40'
          WHEN sq_2__search_console__period_0__ctr >= 40
                AND sq_2__search_console__period_0__ctr < 70 THEN '0004_from_40_to_70'
          WHEN sq_2__search_console__period_0__ctr >= 70 THEN '0005_from_70_to_'
      END f_0,
      IFNULL(SUM(sq_2__search_console__period_0__count_clicks), 0) f_1
FROM
  (SELECT sq_2.search_console__period_0__count_clicks sq_2__search_console__period_0__count_clicks,
          sq_2.search_console__period_0__ctr sq_2__search_console__period_0__ctr
  FROM
    (SELECT ((SUM(clicks) / SUM(impressions)) * 100) search_console__period_0__ctr,
            ANY_VALUE(url_hash) url_hash,
            SUM(clicks) search_console__period_0__count_clicks,
            url_keyword_hash
      FROM
        (SELECT *
        FROM `production_64302.search_console_flat`
        WHERE date BETWEEN '2023-01-28' AND '2023-02-26' )
      GROUP BY url_keyword_hash) sq_2
  FULL
      OUTER JOIN
    (SELECT *
      FROM `.production_64302.crawl_1160774_1222833`
      WHERE http_code <> 0 ) sq_1 ON sq_2.url_hash = sq_1.url_hash
  WHERE NOT(sq_1.compliant__is_compliant) )
GROUP BY f_0
LIMIT 20000

As you can see, it’s already quite complex, but this is the real world and we want to see if ChatGPT can help us during a migration process, and not only for fun, right? So what do we get:

SELECT
  CASE
      WHEN sq_2__search_console__period_0__ctr >= 0 AND sq_2__search_console__period_0__ctr < 5 THEN '0000_from_0_to_5'
      WHEN sq_2__search_console__period_0__ctr >= 5 AND sq_2__search_console__period_0__ctr < 10 THEN '0001_from_5_to_10'
      WHEN sq_2__search_console__period_0__ctr >= 10 AND sq_2__search_console__period_0__ctr < 20 THEN '0002_from_10_to_20'
      WHEN sq_2__search_console__period_0__ctr >= 20 AND sq_2__search_console__period_0__ctr < 40 THEN '0003_from_20_to_40'
      WHEN sq_2__search_console__period_0__ctr >= 40 AND sq_2__search_console__period_0__ctr < 70 THEN '0004_from_40_to_70'
      WHEN sq_2__search_console__period_0__ctr >= 70 THEN '0005_from_70_to_'
  END AS f_0,
  COALESCE(SUM(sq_2__search_console__period_0__count_clicks), 0) AS f_1
FROM
(
  SELECT
      sq_2.search_console__period_0__count_clicks AS sq_2__search_console__period_0__count_clicks,
      sq_2.search_console__period_0__ctr AS sq_2__search_console__period_0__ctr
  FROM
  (
      SELECT
          (sum(clicks) / sum(impressions)) * 100 AS search_console__period_0__ctr,
          anyLast(url_hash) AS url_hash,
          sum(clicks) AS search_console__period_0__count_clicks,
          url_keyword_hash
      FROM
      (
          SELECT *
          FROM search_console_flat
          WHERE date BETWEEN '2023-01-28' AND '2023-02-26'
      )
      GROUP BY url_keyword_hash
  ) AS sq_2
  ANY FULL OUTER JOIN
  (
      SELECT *
      FROM crawl_1160774_1222833
      WHERE http_code <> 0
  ) AS sq_1
  ON sq_2.url_hash = sq_1.url_hash
  WHERE NOT sq_1.compliant__is_compliant
)
GROUP BY f_0
LIMIT 20000;

In the end, it just translated the logic, but nothing more. I asked ChatGPT if it can optimize the query, and it gave me some useful hints, again with some good explanations for beginners to learn.

Then I tried to trick it a little bit. Our customer sent me a list of Queries, but the tool they used, did some text encodings, so the query was not easy to read for the human eye anymore:

SELECT CASE WHEN sq_1__nb_words \u003e\u003d 1 AND sq_1__nb_words \u003c 4 THEN \u00270000_from_1_to_4\u0027 WHEN sq_1__nb_words \u003e\u003d 4 AND sq_1__nb_words \u003c 7 THEN \u00270001_from_4_to_7\u0027 WHEN sq_1__nb_words \u003e\u003d 7 THEN \u00270002_from_7_to_\u0027 END f_0, IFNULL(SUM(sq_1__search_console__period_0__count_clicks), 0) f_1 FROM (SELECT sq_1.nb_words sq_1__nb_words, sq_1.search_console__period_0__count_clicks sq_1__search_console__period_0__count_clicks FROM (SELECT ANY_VALUE(branded) branded, ANY_VALUE(nb_words) nb_words, ANY_VALUE(segments__pagetype__value) segments__pagetype__value, ANY_VALUE(url_hash) url_hash, SUM(clicks) search_console__period_0__count_clicks, url_keyword_hash FROM (SELECT * FROM `production_52366.search_console_flat` WHERE date BETWEEN \u00272023-04-10\u0027 AND \u00272023-05-07\u0027) WHERE (STARTS_WITH(segments__pagetype__value, \u0027category/\u0027) OR segments__pagetype__value \u003d \u0027category\u0027) AND NOT(STARTS_WITH(segments__pagetype__value, \u0027category/results-page/\u0027) OR segments__pagetype__value \u003d \u0027category/results-page\u0027) AND NOT(branded) GROUP BY url_keyword_hash) sq_1) GROUP BY f_0 LIMIT 20000

Now ChatGPT could easily fix the encoding, and gave me a query, but it was only partially in ClickHouse dialect. For whatever reason this time it didn’t convert the ANY_VALUE function which it already did in the previous example.

I hinted at ChatGPT to its error, so the next iteration of the query then worked.

So in the end, ChatGPT can help you translate queries for some PoCs, but you still need to manage yourself to optimise for the best performance, an auto-conversion process won’t work.

Build table schema

Speaking of PoCs. One of the hardest things in a new project is always the beginning. Let’s find out if ChatGPT can help there, can it create some basic tables for me?

As an example, I thought of some log data to store in ClickHouse. I, therefore, installed collectd on a test machine, and let it generate some data in its JSON format.

Then I took one row of the data and asked ChatGPT to provide me with an efficient table structure based on the single JSON row. I gave it some context, that it is log data collected and that I want to store it as efficiently as possible.

The sample data i used looked like this:

[{"values”:[0],"dstypes”:[“gauge”],"dsnames”:[“value”],"time”:1684941133.135,"interval”:10.000,"host”:"loki”,"plugin”:"cpu”,"plugin_instance”:"5”,"type”:"percent”,"type_instance”:"interrupt"}]

ChatGPT provided the following table structure:

CREATE TABLE my_table (
  values Array(Float64),
  dstypes Array(String),
  dsnames Array(String),
  time Float64,
  interval Float64,
  host String,
  plugin String,
  plugin_instance String,
  type String,
  type_instance String
) ENGINE = MergeTree()
ORDER BY time;

Actually, I think it’s quite impressive, what it can create out of so little information, but you can easily see there is little to no optimization. Typically you would use LowCardinality for most of the String columns, as it’s not likely to have millions of different hosts or plugins. Also, the chosen ORDER BY time is suboptimal for storing a lot of log data.

It explains that with this ORDER BY, you can facilitate efficient querying of time series data. Still, ChatGPT forgets to mention that it will also lead to bad data locality and therefore not good compression.

Even when explicitly asking for a more optimized version, it just gave me general hints on what could be done, but not specific enough to be useful for automation.

Conclusion

So in the end, one must say ChatGPT is a fantastic tool for some fast ramp-ups, but it definitely lacks the experience of a database expert. You can use it for some early prototypes, or for giving you some valuable information, but it’s still far away from using it as an automatism to generate your Database Architecture or optimize queries from one dialect to another.

Meet DoubleCloud AI-Insights based on GPT-4

Don’t waste time reading numerous reports and manually analyzing data — rely on AI-Insights and get fast and accurate conclusions.

Start your trial today

Sign in to save this post