Is hybrid storage the way forward for handling big data?
Written By: Stefan Kaeser, DoubleCloud Solution Architect
November 2, 2022
15 mins to read
When Mark Litwintschik wrote his Blogpost about 1.1 Billion Taxi Rides in ClickHouse on DoubleCloud we were flattered to see that Doublecloud were the fastest in his benchmarks regarding cloud offerings. But after Mark’s blog went live we received several questions about how this ‘gamechanger’ hybrid storage works, and what it was in the first place.
In the days before cloud was a thing, people already knew about storage and how they used it. Storage could be very fast like Ramdisks, very flexible to move like floppy disks, reliable and usable like hard drives or very big and cheap (regarding price per GB) when using tapes.
In the past (and today) you had to decide whether you wanted fast storage which is more expensive, or big storage which is cheaper but also slower.
Times have changed however. Instead of tapes we now have object storage like Amazon S3, instead of floppy disks we have usb sticks and whilst hard drives still exist, there’s a greater variety now when you count SSDs and NVMEs; only memory has stayed the same (very fast and expensive).
But what didn’t change is the logic behind the different types of storage.
You still have to decide whether you need to spend more money, or you lose speed. And another thing did not change: You have to manage what kind of data uses what kind of storage. In the past this could mean you have to have a person who has to change the tapes, you have to write jobs to move data from one system to another or you have to change your application to access archived data differently and so on.
This is where hybrid storage comes into play.
Hybrid storage means that different types of storage will be integrated into one, giving you the best of both worlds.
Hybrid promises you to save cost, while still being fast most of the time, handling the data movement in the background.
When using harddrives, there are some hybrid drives called SSHD. They combine a normal spinning harddrive with a fast ssd within the same case. The controller on board moves the data between the different parts of the hardware based on rules like how often it is accessed or how long since the last access etc.
You as a user don’t have to care about it and just use the drive as one storage.
Kinda the same but of course on a bigger level is hybrid storage on DoubleCloud.
It combines cheap but slower S3 object storage with fast but expensive GP2 local storage.
That way you as a user can just write your data into a table and don’t have to care about moving old data to S3, changing your application to have different access patterns etc.
As you can see when the cache can be used, queries tend to be 1.5 — 3 slower.
However when caches are cold, you can even see a query time increase by a factor of 10!
Now is this a good result or not?
Of course it depends on your workload. If you query old data all the time, then a factor of 3 or more will be critical.
But that’s not the use case hybrid storage is meant for.
In our benchmark, if you just limit your queries to the current year (2015 in this case) you don’t have ANY slow down in query speed. You might even get a small speed up, as queries on old data would not mess up with your filesystem caches.
Regarding the fact that S3 object storage costs are around a factor of five times cheaper than EBS storage, using a hybrid storage solution is an easy way to save you money.
As you can set up hybrid storage on a table level and configure different TTLs for different tables, you can easily adjust the setting to your use cases.
And you don’t have to change your application at all, you’ll just optimize your monthly bill by changing settings in your clusters.
But of course hybrid storage is no silver bullet.
If you have to access the majority of your data all the time, then speed reduction by a factor of 3 might not be worth it.
However, as most access patterns on real data tend to use active data 99% of the time and the rest of the data only 1% of the time, it’s worth it for those 1% to wait a little longer if you can spare a big amount of money.