DoubleCloud’s final update | We are winding down operations. Learn more →

Neat ClickHouse features more people need to know about

Written By: Stefan Kaeser, DoubleCloud Senior Solution Architect

As ClickHouse’s popularity has increased over the years, more and more people have learned about its core features, such as its incredible performance, high compression ratio, and extensive capabilities of reading and writing nearly all data formats.

But ClickHouse also has many hidden gems that can help with your day-to-day work, which many people don’t know about. Even though most of them are well documented in the official documentation, you will not find them if you don’t know what you are looking for.

In this blog post, I’ll mention some of my favorite small features that more people should know about.

SELECT Modifiers

In column-based DBMS like ClickHouse, queries like SELECT * FROM table should generally be avoided. At least, that’s true when it comes to regular queries issued by your application. However, the typical day-to-day work of a DBA or database developer often includes these types of queries, as otherwise, it would take a huge amount of time to type down all X columns of a table manually.

But what if you want to run a lot of operations on some columns? Imagine you have the following table:

CREATE TABLE customer (
customerId UInt64,
custom_num_1 UInt64,
custom_num_2 UInt64,
...
custom_num_50 UInt64,
custom_string_1 String,
custom_string_2 String,
...
custom_string_50 String
);

When optimizing your table, you might be interested in the average length of the String columns and the maximum value of your numeric columns. If you were to handwrite a query to collect this information, you would have to write the correct function for 100 columns.
In some databases (like MySQL but also ClickHouse), you can utilize the INFORMATION_SCHEMA.COLUMNS table to build your query. It can be convenient for a lot of DBAs as they might already be familiar with this method. However, ClickHouse provides an even faster way to achieve your goal: SELECT Modifiers.

By utilizing a combination of modifiers, our task comes down to a simple query:

SELECT
  COLUMNS('.*num.*') APPLY max,
  COLUMNS('.*string.*') APPLY length APPLY max
FROM customer

We use the COLUMNS modifier to apply a regex to the column names, filtering for columns with names containing “num” or “string”. We then apply the max function for numeric columns. For string columns, we first apply the length function and then the max function. This method gives us the desired results, and takes way less time than building a query via the information schema, or manually writing down 100 columns.

Manual test data in various formats

Often, when I help customers or ClickHouse users in the open Slack channel, there are cases where I need to provide examples in a test table. Sometimes, writing the table definition for such a test table and filling it with test data can involve more code than you want to show in the first place.

As you could see in the previous section, the table I used to explain SELECT modifiers is longer than the actual code I want to present (and it has already been abbreviated).

But there is a possibility in ClickHouse to directly work on data as if it would be in a table, which is the format table-function.

With this function, you can specify which format you want to use (like JSONEachRow) and then directly enter the data instead of inserting it into a table first:

SELECT *
FROM FORMAT(JSONEachRow, '{"customerId":1,"type":1,"custom_num1":4711}\n{"customerId":2, "type":2,"custom_ips":["127.0.0.1","127.0.0.2"]}')


+-customerId-+-type-+-custom_num1-+-custom_ips----------------+
|          1 |    1 |        4711 | []                        |
|          2 |    2 |        NULL | ['127.0.0.1','127.0.0.2'] |
+------------+------+-------------+---------------------------+

The function generates a result set with two rows, and even complex types like Arrays are doable. Additionally, you can use a large variety of data formats.

Generate Series with the numbers table-function

Some DBMS can generate number sequences by using table functions. ClickHouse is no exception in this regard, providing the numbers() table-function for that. But did you know that you can also use it to easily generate date or time sequences?

ClickHouse allows simple arithmetic functions on Date and DateTime data types, allowing you to generate sequences of dates or timestamps:

SELECT
  number,
  now() - number AS previousTimes,
  toDate(now()) + number AS nextDays
FROM numbers(10)


+-number-+-------previousTimes-+---nextDays-+
|   0 | 2024-04-25 13:04:30 | 2024-04-25 |
|   1 | 2024-04-25 13:04:29 | 2024-04-26 |
|   2 | 2024-04-25 13:04:28 | 2024-04-27 |
|   3 | 2024-04-25 13:04:27 | 2024-04-28 |
|   4 | 2024-04-25 13:04:26 | 2024-04-29 |
|   5 | 2024-04-25 13:04:25 | 2024-04-30 |
|   6 | 2024-04-25 13:04:24 | 2024-05-01 |
|   7 | 2024-04-25 13:04:23 | 2024-05-02 |
|   8 | 2024-04-25 13:04:22 | 2024-05-03 |
|   9 | 2024-04-25 13:04:21 | 2024-05-04 |
+--------+---------------------+------------+

By applying multiplication, you can also introduce steps, etc. As the final data type of the columns will still be Date or DateTime, you can be sure that only valid dates are generated.

Data formatting in custom columns

Sometimes it is needed to partially format your data on output. For example, if you want to insert data into a streaming service like Kafka, you might have some columns that should be direct columns, while data from others needs to be combined into a payload column in a given format (typically JSON).

Of course, you can do that in other DBMS as well, by using string concatenation and building your JSON manually, or by using specific JSON functions like toJSONString, or by manually creating JSON Objects, etc. ClickHouse has you covered by giving you the function formatRowNoNewline(). This function allows you to format an arbitrary amount of columns into all possible output formats ClickHouse has to offer.

And, of course, you can also use SELECT modifiers to specify which columns to format:

SELECT
  customerId,
  formatRowNoNewline('JSONEachRow', COLUMNS('.*num.*')) AS payload
FROM customer
LIMIT 10

+-customerId-+-payload------------------------------------------------------------------------------+
|       20 | {"custom_num_1":"4503644724578621668","custom_num_2":"156","custom_num_50":"32624"}  |
|     111 | {"custom_num_1":"9395348731023764754","custom_num_2":"4","custom_num_50":"8919"}  |
|     187 | {"custom_num_1":"4410745110154105282","custom_num_2":"67","custom_num_50":"19015"}   |
|     231 | {"custom_num_1":"8206799308850873985","custom_num_2":"151","custom_num_50":"43260"}  |
|     262 | {"custom_num_1":"14904510309594397590","custom_num_2":"83","custom_num_50":"2802"}   |
|     375 | {"custom_num_1":"14468162691130275987","custom_num_2":"13","custom_num_50":"6910"}   |
|     388 | {"custom_num_1":"15191470301382236130","custom_num_2":"110","custom_num_50":"39490"} |
|     434 | {"custom_num_1":"11648353314185268442","custom_num_2":"211","custom_num_50":"52725"} |
|     439 | {"custom_num_1":"8430391284904487000","custom_num_2":"31","custom_num_50":"43376"}   |
|     468 | {"custom_num_1":"11265143749859112447","custom_num_2":"41","custom_num_50":"58748"}  |
+------------+--------------------------------------------------------------------------------------+

Querying the whole cluster

Sometimes, querying a single node is not enough. Imagine that you are looking for queries that run longer than a specific threshold. You can find the information within ClickHouse’s system.query_log table, but you would have to check all hosts separately.

Once again, ClickHouse has you covered. The table-function clusterAllReplicas allows you to execute a query on all nodes of a cluster and gives you the combined result as if it were a local table:

SELECT
  user,
  substring(query, 1, 15) AS query_part,
  query_duration_ms
FROM clusterAllReplicas('mytestcluster', system, query_log)
WHERE query_duration_ms > 50
LIMIT 3

+-user----+-query_part------+-query_duration_ms-+
| default | INSERT INTO col |                52 |
| default | INSERT INTO col |                55 |
| default | INSERT INTO col |                51 |
+---------+-----------------+-------------------+

Bonus: Working with AggregateStates

Working with Materialized Views and the possibilities of AggregateFunctions could fill multiple blog posts on its own (for example, this one about the performance impact of Materialized Views). Therefore, I only want to briefly mention some functionality not everyone knows about, but which could come in handy.

Let’s assume we have the following two tables, which count unique customers per hour or day:

CREATE TABLE customers_hourly (
  eventDate Date,
  eventHour UInt8,
  uniqueCustomers AggregateFunction(uniq, UInt64)
) ENGINE=AggregatingMergeTree
ORDER BY (eventDate, eventHour);

CREATE TABLE customers_daily (
  eventDate Date,
  uniqueCustomers AggregateFunction(uniq, UInt64)
) ENGINE=AggregatingMergeTree
ORDER BY (eventDate);

Initialize Aggregation

Filling those tables is quite easy with Materialized Views. But what if you want to insert a row manually? For example, for testing purposes, you want to insert the test customerId 4711 at 3 different hours of the same day.

As uniqueCustomers is an AggregateFunction, you cannot directly insert the customerId into that column, so something like this doesn’t work:

INSERT INTO customers_hourly
SELECT eventDate,eventHour,customerId
FROM FORMAT(JSONEachRow,
$${"customerId":4711,"eventDate":"2024-04-26","eventHour":1}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":2}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":3}$$)

Of course, it’s possible to generate an aggregation state by using window functions:

INSERT INTO customers_hourly
SELECT eventDate,eventHour,
  uniqState(toUInt64(assumeNotNull(customerId))) OVER ()
FROM FORMAT(JSONEachRow,
$${"customerId":4711,"eventDate":"2024-04-26","eventHour":1}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":2}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":3}$$)

However, if you have multiple arguments for the function, it could be hard to read. So, you can just use the initializeAggregation function to directly insert into the table:

INSERT INTO customers_hourly
SELECT eventDate,eventHour,
  initializeAggregation('uniqState',toUInt64(assumeNotNull(customerId)))
FROM FORMAT(JSONEachRow,
$${"customerId":4711,"eventDate":"2024-04-26","eventHour":1}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":2}
{"customerId":4711,"eventDate":"2024-04-26","eventHour":3}$$)

Finalize Aggregation

Now that we know how to write directly into an aggregation function, how can we read from it? Of course, we can use uniqMerge to get the final result, but this is an AggregateFunction, which means we need to use GROUP BY, etc., to get the final result. If we want to see the intermediate results we added to the table, we’ll need to use a window function again, to prevent the collapsing of the rows. Alternatively, we can use finalizeAggregation to make it easier:

SELECT
  eventDate,
  eventHour,
  finalizeAggregation(uniqueCustomers)
FROM customers_hourly


  ┌──eventDate─┬─eventHour─┬─finalizeAggregation(uniqueCustomers)─┐
1.2024-04-26112.2024-04-26213.2024-04-2631 │
  └────────────┴───────────┴──────────────────────────────────────┘
  ┌──eventDate─┬─eventHour─┬─finalizeAggregation(uniqueCustomers)─┐
4.2024-04-26115.2024-04-26216.2024-04-2631 │
  └────────────┴───────────┴──────────────────────────────────────┘

-MergeState

How do we go from the hourly aggregation to the daily aggregation? In cases where you could sum up the results from the hourly table, that would be quite easy to achieve. Unfortunately, with the uniq function, you cannot simply sum up the intermediate results, as the same user could have been active during multiple hours. The daily result is not the sum of all hourly results. I’ve seen customers resort to recalculating the daily table from the basic table instead of continuing the aggregation by using uniqMergeState:

INSERT INTO customers_daily
SELECT eventDate, uniqMergeState(uniqueCustomers)
FROM customers_hourly
GROUP BY eventDate

The logic is as easy as it sounds. It will merge the intermediate results, but instead of giving the merged result, it will give the new internal state, which will then be stored in the daily Aggregation Table.

There you have 5 (+3 bonus) small features I think are good to know. If you have any more neat features or topics you want to read about, feel free to contact me via Slack or LinkedIin.

Open Source lives from its community!

Get started with DoubleCloud

Sign in to save this post