Here are the steps of distributed query execution in ClickHouse:
1. Query analysis: To understand the query structure, identify the relevant tables and sections, and determine the query dependencies.
2. Query parsing: To divide it into fragments in an optimal way based on the analysis results.
3. Query optimization: To minimize data movement across the cluster and maximize query performance by selecting the most efficient execution plan.
4. Query routing: To route the query fragments to the relevant nodes that hold the required data.
5. Parallel query execution: ClickHouse executes query fragments in parallel for each node and collects the results.
6. Data Aggregation: Once the query fragments are processed, ClickHouse performs the necessary merging of intermediate results data and produces the final query result.
The distributed query execution approach brings faster data processing. This happens as each node in the cluster can separately process its assigned query part in parallel mode. Also, the multiple nodes and clusters structure increases database throughput, scalability, and fault tolerance.
Materialized views and aggregations
Materialized views are objects of a database. They store precalculated query results obtained from the underlying data. The resulting data for a materialized view is stored in a separate table. Each table refers to a certain executed query.
Materialized views are exceptionally useful for scenarios with frequently accessed queries. They also provide excellent performance results for complicated queries as they reduce the need for repetitive and resource-intensive calculations.
The materialized views approach is based on query operations like joins and data aggregation. There are lots of aggregation functions in ClickHouse. Here are some of them:
- COUNT: Calculates the number of rows or non-null values in a column.
- SUM: computes the sum of the values in a column.
- AVG: calculates the average (mean) of the values in a column.
- MIN: returns the minimum value in a column.
- MAX: returns the maximum value in a column.
- GROUP_ARRAY: aggregates values into an array within each group.
- GROUP_CONCAT: concatenates values into a string within each group.
- ANY: returns an arbitrary non-null value from a group.
- DISTINCT: removes duplicate values from a column or group.
- HLL_COUNT.MERGE: merges HyperLogLog counters to estimate unique values in a column.
- MEDIAN: computes the median value within a group.
- TOPK: returns the top-K values based on a specified criterion.
- QUANTILE: calculates the quantile value within a group.
- ARRAY JOIN: unrolls arrays and performs subsequent aggregations on the expanded data.
- argMin/argMax: returns the argument (value or index) associated with the minimum or maximum value in a group.
It is also possible to use combinators for aggregation functions. They help to perform additional calculations, filtering, or transformations on the aggregated results.
On top of that, ClickHouse supports user-defined aggregate functions (UDAFs). These functions allow custom aggregation logic to support specific requirements.
Different versions of ClickHouse support different aggregation functions' availability and syntax. To get the most up-to-date and full information please refer to ClickHouse documentation.
Vectorized query execution
Vectorized query execution is an important optimization technique in ClickHouse. It contributes remarkably to ClickHouse’s speed and helps to improve performance. Vectorized query execution operates on data in batches or chunks and processes several rows or columns as a single unit.
A more standard row-by-row process of reading data works on an individual row basis. It operates on only one row and a single query at a time. Only after it gets the results for all rows of data can it provides the final query result. In comparison, ClickHouse vectorized query execution can process the calculations for multiple rows at once.
With its parallel approach, the ClickHouse database provides higher query processing for large datasets or complex queries. This leads to improved processing speed.
What do these features allow ClickHouse to do?
ClickHouse has a number of technical solutions that make it really fast, even for complicated and resource-intensive analytical queries. Its features also make ClickHouse a very scalable database. It can be expanded according to the specific business needs.
Parallel query execution.
Thanks to vectorized and distributed query execution, it is possible to process query parts in parallel and to operate on many data batches at once. Also, several data analytics can process several competitive queries at one time without losing speed.
Prearranged data operations.
Data aggregation together with data compression and encoding allows faster calculations for big queries with numerous parameters. It is also a great feature for often repeated queries or queries with equal scenarios.
The online analytical processing is based on preliminary query results. That means that at least part of query processing is already done and data analytics can get the final query result in a very short time.
Fault tolerance and load balancing.
ClickHouse can sensibly distribute queries to less loaded cluster nodes. This helps avoid overloading of specific nodes or usage of a node that, for some reason, becomes unavailable.
Columnar storage architecture, query execution distributed by multiple nodes, and data compression algorithms allow storing very large datasets with a reasonable place on the disk used. It means that you are able to store and supplement as much data as you need according to the business specifically. You don’t have to delete data, saving disk space.
To sum up, thanks to its approach to data structure and optimization features, ClickHouse is a very fast, reliable, and modern data storage system. It is great for companies with large and growing data arrays. It also performs well with analytical tasks that have complex and numerous queries.
ClickHouse performance example
If you don’t have several different databases where the same data is stored, it will be difficult to test the query performance speed for different databases on your own. The good news is you can easily find plenty of the results of already organized performance speed tests.
Here is the latest example from DoubleCloud. This time we compared ClickHouse and PostgreSQL. Processing the same query through the same dataset, we got the query result more than 13 times faster with ClickHouse!