DoubleCloud’s final update | We are winding down operations. Learn more →

Doublecloud’s Managed ClickHouse on Google Cloud Platform - Get more performance for your money

Written By: Stefan Kaeser, DoubleCloud Senior Solution Architect

After a long time of developing, testing, evaluating and planning, DoubleCloud finally beta-released its connection to the second Public Cloud provider: Google Cloud (GCP).

But what does this mean for our customers? Is it just the same as before, only with a different logo, or are there any differences?

Long story short answer: As a customer, you get better performance for nearly the same price!

Heading to a new cloud provider

We at DoubleCloud really live the promise of open source to stay open for everyone, and that also should be represented by the choice of where your services run. Therefore only supporting a single cloud provider like AWS kinda contradicts that value. This means, of course, we want our services to be available on a many of different providers.

But we also need to be able to provide the same services on all providers we can offer, meaning there are some restrictions which need to be considered to add another provider like Google Cloud:

  • Multiple high availability zones in regions on different continents

  • Easy scalability of elastic disks

  • An Object Storage solution like S3 for Backups and Hybrid storage

  • A variety of instance types, with scalability of CPU and Ram and the availability to deploy them on most regions

  • Tools to set up and access this infrastructure automatically

As a managed service provider, you typically want to have similar setups for your customers, so you don’t have too many issues just because of different infrastructure. Therefore points 1 and 4 play strongly together. For example, some instance types only exist in some availability zones but not in all of them.

To avoid a split-brain scenario in high availability setups, only those regions and instance types are possible to use for a managed ClickHouse installation, if the same instance type is available in at least 3 availability zones per region.

This already limits the list of possible combinations we needed to evaluate, but there were still a lot of options to investigate further.

Once we elaborated the basic tests with the operating system, load balancing, networking etc., we could directly compare the different types of instances available in our final listing.

In the end, we did a bunch of tests with instance types like N2, T2D, N2D, C3 etc. After all those tests were through, we also needed to consider pricing options. We are not targeting for the fastest possible results but for the best performance per $ spent.

Regarding all points we checked, N2D instance types proved to give the best value for our customers' most typical workloads.

In this article, we’ll talk about:

Compute tests

So we have decided to go with N2D instances, but what can you as a customer expect from those new GCP clusters on DoubleCloud, in comparison to our existing AWS based x64 or arm setups?

Let’s test it with one of my favorite datasets, the weather data since 1900. (The same I used for my blog about Denormalization.)

The main table looks like this (I omit fields, not needed for this test):

CREATE TABLE sensor_data
(
    `station_id` LowCardinality(String),
    `date` Date32 CODEC(Delta(4), ZSTD(1)),
    `tempAvg` Nullable(Int32),
    `tempMax` Nullable(Int32),
    `tempMin` Nullable(Int32),
    …
)
ENGINE = MergeTree
PARTITION BY toYear(date)
ORDER BY (station_id, date)
SETTINGS index_granularity = 8192

There are around 1.08 billion rows in the table. For ClickHouse, that’s not really a big number, so we can use a rather small setup, choosing a single node with 2 vCPU only and 8GB of RAM.

First we start with a very simple aggregation query, to get the average temperature per decade:

SELECT ROUND(AVG(tempAvg) / 10, 1) AS avgTemp, 
    TRUNCATE(YEAR(date), -1) AS dekade
FROM weather_new.sensor_data
WHERE date > '1970-01-01'
GROUP BY dekade
ORDER BY dekade;

Please note that we limit our request to start at 1970 to make sure the selected parts and columns fit into the file system cache, hence only having the compute power impacting the speed, not any networking or disk:

Type

Time

DC AWS x64

6.2 sec

DC AWS arm

5.3 sec

DC GCP x64

3.7 sec

So the new instance type on GCP is 2.5 sec faster for this query, looks promising. Let’s check a more complicated one:

SELECT year, 
    ROUND(avgMaxTemp, 1) AS avgMaxTemp, 
    ROUND(rolling, 1) AS rolling
FROM (
  SELECT year, avgMaxTemp, 
    AVG(avgMaxTemp) OVER (
      ORDER BY year ROWS BETWEEN 15 PRECEDING AND CURRENT ROW
    ) AS rolling
  FROM (
    SELECT AVG(tempMax) / 10 AS avgMaxTemp, YEAR(date) AS year 
    FROM sensor_data 
    WHERE date > '1970-01-01' 
    GROUP BY year
  ) AS yearlyAverage
) 
ORDER BY year;

This query makes use of window function, to get a rolling average over the last 15 years, let’s look at the timings:

Type

Time

DC AWS x64

7.9 sec

DC AWS arm

6.3 sec

DC GCP x64

4.9 sec

Again we could spare 3 out of the 7.9 seconds using our new GCP instances.

Disk tests

Our next topic before we can come to a conclusion would be: What disk types should we use? Cloud providers offer a bunch of different disk types which could be attached to your instances. In AWS there is GP2 and GP3, PIOPs and even magnetic disks.

At DoubleCloud we have been using GP2 in the past and upgraded all of our clusters to GP3 a few months back. For small instances, that means we have a minimum of 3000 IOps and a minimum throughput of 125 MB/s, with the possibility to get higher numbers for bigger nodes (based on disk size), and even increase the values manually via support-ticket if really needed.

Regarding our tests, GCP’s best equivalent to AWS GP3 would be the pd-balanced disk type, with a minimum of 3000 IOps (up to 16’000 depending on disk and instance size) and a throughput of 149 MB/s.

So again, we gain some performance boost without the need to invest more money.

But measuring disk performance alone doesn’t give a real value for our use-case. ClickHouse uses compression on nearly all occasions, therefore only disk performance means nothing, as all write and read operations will involve the CPU as well.

For a real world test, I choose to use the copy test, meaning: Create a new table, selecting all of the data from the original table, and insert it into the new table.

Therefore we utilize the disk but also have compression tackling the CPU as it would be on a real write-heavy workload. These are our results:

Type

Time

DC AWS x64 GP3

770 sec

DC AWS arm GP3

502 sec

DC GCP x64 pd-balanced

418 sec

As you can see, when comparing the AWS Arm and x64 values, CPU makes a huge difference on write throughput in ClickHouse.

But still with GCP we could be again 20% faster than the fastest AWS based machine.

Summary

Of course, we tested a lot more queries and combinations until we decided what we will use, but the results all pointed in the same direction. Almost everywhere the execution speed was between 30% and 60% faster in the setup we decided to go with.

As for our own margin, our pricing is based directly on the infrastructure cost. Meaning the savings when comparing GCP over AWS basic infrastructure (typically around 10-15%) can be forwarded directly to our customers.

All in all, that means you’ll get around a 30-40% boost in price/performance or even more than 50% in some cases when choosing our new GCP based clusters.

That’s worth a try, isn’t it? To request access, simply fill out the form that appears when you select the Google Cloud option while creating a new cluster here.

Managed Service for ClickHouse

Fully managed service from the creators of the world’s 1st managed ClickHouse. Backups, 24/7 monitoring, auto-scaling, and updates.

Start your trial today

Sign in to save this post