We at DoubleCloud really live the promise of open source to stay open for everyone, and that also should be represented by the choice of where your services run. Therefore only supporting a single cloud provider like AWS kinda contradicts that value. This means, of course, we want our services to be available on a many of different providers.
But we also need to be able to provide the same services on all providers we can offer, meaning there are some restrictions which need to be considered to add another provider like Google Cloud:
Multiple high availability zones in regions on different continents
Easy scalability of elastic disks
An Object Storage solution like S3 for Backups and Hybrid storage
A variety of instance types, with scalability of CPU and Ram and the availability to deploy them on most regions
Tools to set up and access this infrastructure automatically
As a managed service provider, you typically want to have similar setups for your customers, so you don’t have too many issues just because of different infrastructure. Therefore points 1 and 4 play strongly together. For example, some instance types only exist in some availability zones but not in all of them.
To avoid a split-brain scenario in high availability setups, only those regions and instance types are possible to use for a managed ClickHouse installation, if the same instance type is available in at least 3 availability zones per region.
This already limits the list of possible combinations we needed to evaluate, but there were still a lot of options to investigate further.
Once we elaborated the basic tests with the operating system, load balancing, networking etc., we could directly compare the different types of instances available in our final listing.
In the end, we did a bunch of tests with instance types like N2, T2D, N2D, C3 etc. After all those tests were through, we also needed to consider pricing options. We are not targeting for the fastest possible results but for the best performance per $ spent.
Regarding all points we checked, N2D instance types proved to give the best value for our customers' most typical workloads.
There are around 1.08 billion rows in the table. For ClickHouse, that’s not really a big number, so we can use a rather small setup, choosing a single node with 2 vCPU only and 8GB of RAM.
First we start with a very simple aggregation query, to get the average temperature per decade:
SELECT ROUND(AVG(tempAvg) / 10, 1) AS avgTemp,
TRUNCATE(YEAR(date), -1) AS dekade
WHERE date > '1970-01-01'
GROUP BY dekade
ORDER BY dekade;
Please note that we limit our request to start at 1970 to make sure the selected parts and columns fit into the file system cache, hence only having the compute power impacting the speed, not any networking or disk:
DC AWS x64
DC AWS arm
DC GCP x64
So the new instance type on GCP is 2.5 sec faster for this query, looks promising. Let’s check a more complicated one:
ROUND(avgMaxTemp, 1) AS avgMaxTemp,
ROUND(rolling, 1) AS rolling
SELECT year, avgMaxTemp,
AVG(avgMaxTemp) OVER (
ORDER BY year ROWS BETWEEN 15 PRECEDING AND CURRENT ROW
) AS rolling
SELECT AVG(tempMax) / 10 AS avgMaxTemp, YEAR(date) AS year
WHERE date > '1970-01-01'
GROUP BY year
) AS yearlyAverage
ORDER BY year;
This query makes use of window function, to get a rolling average over the last 15 years, let’s look at the timings:
DC AWS x64
DC AWS arm
DC GCP x64
Again we could spare 3 out of the 7.9 seconds using our new GCP instances.
Our next topic before we can come to a conclusion would be: What disk types should we use? Cloud providers offer a bunch of different disk types which could be attached to your instances. In AWS there is GP2 and GP3, PIOPs and even magnetic disks.
At DoubleCloud we have been using GP2 in the past and upgraded all of our clusters to GP3 a few months back. For small instances, that means we have a minimum of 3000 IOps and a minimum throughput of 125 MB/s, with the possibility to get higher numbers for bigger nodes (based on disk size), and even increase the values manually via support-ticket if really needed.
Regarding our tests, GCP’s best equivalent to AWS GP3 would be the pd-balanced disk type, with a minimum of 3000 IOps (up to 16’000 depending on disk and instance size) and a throughput of 149 MB/s.
So again, we gain some performance boost without the need to invest more money.
But measuring disk performance alone doesn’t give a real value for our use-case. ClickHouse uses compression on nearly all occasions, therefore only disk performance means nothing, as all write and read operations will involve the CPU as well.
For a real world test, I choose to use the copy test, meaning: Create a new table, selecting all of the data from the original table, and insert it into the new table.
Therefore we utilize the disk but also have compression tackling the CPU as it would be on a real write-heavy workload. These are our results:
DC AWS x64 GP3
DC AWS arm GP3
DC GCP x64 pd-balanced
As you can see, when comparing the AWS Arm and x64 values, CPU makes a huge difference on write throughput in ClickHouse.
But still with GCP we could be again 20% faster than the fastest AWS based machine.
Of course, we tested a lot more queries and combinations until we decided what we will use, but the results all pointed in the same direction. Almost everywhere the execution speed was between 30% and 60% faster in the setup we decided to go with.
As for our own margin, our pricing is based directly on the infrastructure cost. Meaning the savings when comparing GCP over AWS basic infrastructure (typically around 10-15%) can be forwarded directly to our customers.
All in all, that means you’ll get around a 30-40% boost in price/performance or even more than 50% in some cases when choosing our new GCP based clusters.
That’s worth a try, isn’t it? To request access, simply fill out the form that appears when you select the Google Cloud option while creating a new cluster here.