New product launch | Managed Airflow is now generally available. Learn more →

From Elasticsearch to ClickHouse: solving performance challenges in text analysis

Highlights

Achieved sub-second aggregations, down from 8 seconds in Elastic.

Cluster costs were reduced by 60%.

Overview

Recently, our team received a challenging task from a client. The AdTech player was facing the need to analyze a vast amount of social media and web posts, along with the requirement for fast aggregation of all the results. This required processing a large volume of text within tight timelines. Additionally, their product demanded the ability to perform real-time generation of Ads suggestions. To achieve this, it was crucial to analyze the massive amount of textual information as quickly as possible in order to build further recommendations.

The problem

Initially, the client had been using Elasticsearch (ES) for text processing and analytics, as is commonly done by many organizations. However, as the volume of the text increased, they encountered performance issues. Queries were taking too long, reaching up to 10 seconds, which did not meet their requirements.

In their search for a solution, the client discovered ClickHouse®, a column-oriented database specifically designed for handling enormous data volumes with high performance and low latency. However, ClickHouse lacked built-in text processing capabilities that were available in Elasticsearch.

That’s when they reached out to DoubleCloud, seeking assistance in adapting ClickHouse to their specific case of text data processing.

Architecture of the solution

The client had initially set up a robust infrastructure for their needs. They built an Elasticsearch cluster consisting of 10 nodes, each equipped with 16 cores and 64GB of RAM.

The intention was to leverage the power of the Elasticsearch cluster for text processing tasks and then extract a set of post IDs to be utilized for aggregation purposes in the ClickHouse cluster.

For the ClickHouse setup on DoubleCloud, they deployed a single node with 16 cores and 64GB of RAM, and they began their testing phase.

The ClickHouse table they used (simplified here for you) looked roughly like this:

CREATE TABLE posts (
  postId String, 
  location String, 
  infos Array(String)
) ENGINE = MergeTree 
ORDER BY 
  postId

An example of a typical ClickHouse query they would have been running would look like this:

Get postIds 
from 
  ES 
where 
  posttext matches DoubleCloud 
SELECT 
  COUNT(*), 
  location 
FROM 
  posts 
WHERE 
  postId IN ({ES PostIds}) 
GROUP BY 
  location 
ORDER BY 
  COUNT(*) DESC 
LIMIT 3

Their initial results with a generated dataset showed promising performance, as the aggregations on ClickHouse met their expectations in terms of speed. However, their optimism quickly faded when they began testing with real datasets.

While searching for a specific word in Elasticsearch performed admirably and lived up to their hopes, dealing with approximately 200 million posts in the actual dataset presented significant challenges. The search for those specific words yielded a list of over 5 million post IDs.

Naturally, transferring 5 million post IDs over their network, considering that post IDs were strings with varying lengths, introduced considerable latency. Moreover, incorporating these post IDs into a query for analysis also incurred additional processing time.

Although the solution was functional, the amount of time it took to complete the process exceeded their expectations. They were unable to achieve a performance below 10 seconds overall, which proved to be far too long for their client’s production use case.

What did we do to improve the speed of the queries?

Together with our client, we looked into the secondary indexes in ClickHouse. These indexes were intended to streamline data retrieval by skipping the reading of unnecessary information. Intrigued by the possibilities, our client discovered that ClickHouse also supported Bloom Filters, a probabilistic data structure that seemed to align perfectly with their requirements (although its effectiveness may vary depending on the specific use case).

Motivated by the desire to eliminate the back-and-forth movement of PostIDs between ES and ClickHouse, they decided to store the complete post information directly in ClickHouse. This decision aimed to eliminate the “round trip” that had been causing considerable time overhead.

That made the new table look a little bit like this:

CREATE TABLE posts (
  postId String, 
  location String, 
  postText String, 
  infos Array(String), 
  INDEX idx_post postText TYPE bloom_filter(0.35) GRANULARITY 2
) ENGINE = MergeTree 
ORDER BY 
  postId

The client dedicated significant effort to fine-tuning the index settings, hoping to achieve a notable improvement in query performance. Although the example queries did exhibit faster execution, the results were still taking several seconds to return, falling far short of their sub-second goal.

The main obstacle proved to be the lack of a sorting mechanism within the posts table. Normally this sorting would enable efficient utilization of the skipping index. Consequently, ClickHouse still had to read a significant number of granules from the disk, hindering the desired speed improvements.

In an attempt to optimize the logic, we introduced a sharding approach to expedite the process. However, despite our efforts, the query response time remained in the range of a few seconds, falling short of the desired performance level.

Solution found: denormalization

In our pursuit of a viable solution, our Solution Architects explored various approaches, each showing promise—until they encountered the real dataset.

One of the most significant challenges they faced was the inability to reduce the amount of scanned data due to the lack of additional metadata. However, fueled by our passion for tackling intricate problems, we finally built a rather unconventional solution (and we are proud of it).

The breakthrough came in the form of denormalizing the posts table based on the words contained within each post. This approach enabled local scans of posts sharing common words, leveraging the skipping index as a primary key. This ingenious solution allowed for efficient data retrieval and made the best possible use of the available resources.

That meant the posts text got tokenized and the rows were multiplied by the amount of words within, like this:

CREATE TABLE posts (
  word String,
  postIdHash Int64,
  postId String,
  location String,
  postText String,
  infos Array(String)
) ENGINE=MergeTree
ORDER BY word, postIdHash

With this structure, we could then easily query ClickHouse directly and do all the logic within one Query and table:

SELECT 
  COUNT(*), 
  location 
FROM 
  posts 
WHERE 
  word = ‘DoubleCloud’ 
GROUP BY 
  location 
ORDER BY 
  COUNT(*) DESC 
LIMIT 
  3

The outcome

During the development of our client’s product, it became evident that the existing solution based on Elasticsearch required reevaluation. ES is undoubtedly the industry standard for handling large volumes of textual data. However, its limitations hindered the ability to analyze data in near real-time, a crucial requirement of our client.

In this context, ClickHouse stood out with its unparalleled speed, allowing us to overcome the architectural bottlenecks and propel the business forward. ClickHouse enables the analysis of vast data volumes almost in real-time, addressing the performance, scalability, and cost challenges our client faced with Elasticsearch.

Furthermore, ClickHouse is an open-source solution, providing cost-efficiency compared to vendor-lock alternatives. At DoubleCloud, we provide managed open-source ClickHouse, ensuring our customers avoid any vendor lock-in. We also prioritize transparent and predictable pricing. Our DoubleCloud platform and team of experienced Solution Architects are ready to assist you in enhancing your existing solution or migrating from your current platform to tackle performance, scalability, and cost issues.

By leveraging Managed ClickHouse on DoubleCloud in conjunction with our experts' expertise, you can create an efficient solution that surpasses limitations and elevates your platform to new heights.

Get started with DoubleCloud

ClickHouse® is a trademark of ClickHouse, Inc. https://Clickhouse.com