Is hybrid storage the way forward for handling big data?
Written By: Stefan Kaeser, DoubleCloud Solution Architect
When Mark Litwintschik wrote his Blogpost about 1.1 Billion Taxi Rides in ClickHouse on DoubleCloud we were flattered to see that Doublecloud were the fastest in his benchmarks regarding cloud offerings. But after Mark’s blog went live we received several questions about how this ‘gamechanger’ hybrid storage works, and what it was in the first place.
What Does Hybrid Storage Mean?
Let’s start a little bit in the past.
In the days before cloud was a thing, people already knew about storage and how they used it. Storage could be very fast like Ramdisks, very flexible to move like floppy disks, reliable and usable like hard drives or very big and cheap (regarding price per GB) when using tapes.
In the past (and today) you had to decide whether you wanted fast storage which is more expensive, or big storage which is cheaper but also slower.
Times have changed however. Instead of tapes we now have object storage like Amazon S3, instead of floppy disks we have usb sticks and whilst hard drives still exist, there’s a greater variety now when you count SSDs and NVMEs; only memory has stayed the same (very fast and expensive).
But what didn’t change is the logic behind the different types of storage.
You still have to decide whether you need to spend more money, or you lose speed. And another thing did not change: You have to manage what kind of data uses what kind of storage. In the past this could mean you have to have a person who has to change the tapes, you have to write jobs to move data from one system to another or you have to change your application to access archived data differently and so on.
This is where hybrid storage comes into play.
Hybrid storage means that different types of storage will be integrated into one, giving you the best of both worlds.
Hybrid promises you to save cost, while still being fast most of the time, handling the data movement in the background.
When using harddrives, there are some hybrid drives called SSHD. They combine a normal spinning harddrive with a fast ssd within the same case. The controller on board moves the data between the different parts of the hardware based on rules like how often it is accessed or how long since the last access etc.
You as a user don’t have to care about it and just use the drive as one storage.
Kinda the same but of course on a bigger level is hybrid storage on DoubleCloud.
It combines cheap but slower S3 object storage with fast but expensive GP2 local storage.
That way you as a user can just write your data into a table and don’t have to care about moving old data to S3, changing your application to have different access patterns etc.
DoubleCloud takes care of it.
Setup Hybrid Storage For Your Tables
As Mark’s blog gave us a good heads up about the data he loves to use for benchmarks, we’ll continue with his data and table definitions. Therefore I’ll just give a quick reminder about his setup:
For comparison we will also use the s1-c32-m128 instance type with 512GB GP2 storage in the region Frankfurt.
We’ll also make use of his test dataset of 1.1 billion rows of taxi rides and the table structure he shared with us:
CREATE TABLE ebs.trips_hybrid (
trip_id UInt32,
vendor_id vendor_id AS Enum8('1' = 1,
'2' = 2,
'CMT' = 3,
'VTS' = 4,
'DDS' = 5,
'B02512' = 10,
'B02598' = 11,
'B02617' = 12,
'B02682' = 13,
'B02764' = 14),
pickup_datetime DateTime,
dropoff_datetime Nullable(DateTime),
store_and_fwd_flag Nullable(FixedString(1)),
rate_code_id Nullable(UInt8),
pickup_longitude Nullable(Float64),
pickup_latitude Nullable(Float64),
dropoff_longitude Nullable(Float64),
dropoff_latitude Nullable(Float64),
passenger_count Nullable(UInt8),
trip_distance Nullable(Float64),
fare_amount Nullable(Float32),
extra Nullable(Float32),
mta_tax Nullable(Float32),
tip_amount Nullable(Float32),
tolls_amount Nullable(Float32),
ehail_fee Nullable(Float32),
improvement_surcharge Nullable(Float32),
total_amount Nullable(Float32),
payment_type Nullable(String),
trip_type Nullable(UInt8),
pickup Nullable(String),
dropoff Nullable(String),
cab_type Enum8('yellow' = 1, 'green' = 2))
AS cab_type,
precipitation Nullable(Int8),
snow_depth Nullable(Int8),
snowfall Nullable(Int8),
max_temperature Nullable(Int8),
min_temperature Nullable(Int8),
average_wind_speed Nullable(Int8),
pickup_nyct2010_gid Nullable(Int8),
pickup_ctlabel Nullable(String),
pickup_borocode Nullable(Int8),
pickup_boroname Nullable(String),
pickup_ct2010 Nullable(String),
pickup_boroct2010 Nullable(String),
pickup_cdeligibil Nullable(FixedString(1)),
pickup_ntacode Nullable(String),
pickup_ntaname Nullable(String),
pickup_puma Nullable(String),
dropoff_nyct2010_gid Nullable(UInt8),
dropoff_ctlabel Nullable(String),
dropoff_borocode Nullable(UInt8),
dropoff_boroname Nullable(String),
dropoff_ct2010 Nullable(String),
dropoff_boroct2010 Nullable(String),
dropoff_cdeligibil Nullable(String),
dropoff_ntacode Nullable(String),
dropoff_ntaname Nullable(String),
dropoff_puma Nullable(String)
) ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY (pickup_datetime)
;
Making use of the hybrid storage in DoubleCloud is easy and just requires you to change one setting:
ALTER TABLE hybrid.trips_hybrid MODIFY SETTING storage_policy = 'hybrid_storage';
Let’s assume we already imported all the data into the trips_hybrid table, now what data is on S3 and what data resides on local storage?
Currently all data will be on local storage, because right now ClickHouse doesn’t know which data could be considered as ‘old’.
That’s the last thing we have to configure ourselves.
As the data contains records for the years 2009 to 2015, we decided that the cut off should be on December 31st 2014.
In real case scenarios, you most likely won’t have a fixed date but something like older than 90 days which is of course possible as well.
So all we have to do now, is inform ClickHouse that data before 2015 should be move to object storage:
ALTER TABLE hybrid.trips_hybrid
MODIFY TTL pickup_date + toIntervalDay(dateDiff('day',
toDate('2014-12-31'), now()))
TO DISK 'object_storage'
And that’s everything.
Only two settings you have to adjust and ClickHouse and DoubleCloud take care of everything under the hood.
You don’t have to change a single query in your application, as all that is done behind the curtain.
Benchmarks
Ok, so now as we have parts of our data in S3, we can redo Marks benchmark queries on our hybrid storage set:
Query 1:
SELECT cab_type, count(*)
FROM ebs.trips_mergetree
GROUP BY cab_type;
Query 2:
SELECT passenger_count,
avg(total_amount)
FROM ebs.trips_mergetree
GROUP BY passenger_count;
Query 3:
SELECT passenger_count,
toYear(pickup_date) AS year,
count(*)
FROM ebs.trips_mergetree
GROUP BY passenger_count,
year;
Query 4:
SELECT passenger_count,
toYear(pickup_date) AS year,
round(trip_distance) AS distance,
count(*)
FROM ebs.trips_mergetree
GROUP BY passenger_count,
year,
distance
ORDER BY year,
count(*) DESC;
With data on S3, cold queries will take significantly longer, this is the tradeoff for using object storage instead of local storage.
But if you run the same query twice, ClickHouse will use internal caching to speed up the subsequent queries.
So what are the final results?
Query |
Local |
Hybrid Cold |
|
Q1 |
0.347 |
4.581 |
1.077 |
Q2 |
1.1 |
14.768 |
3.254 |
Q3 |
1.389 |
4.925 |
2.392 |
Q4 |
2.935 |
14.633 |
4.472 |
As you can see when the cache can be used, queries tend to be 1.5 — 3 slower.
However when caches are cold, you can even see a query time increase by a factor of 10!
Now is this a good result or not?
Of course it depends on your workload. If you query old data all the time, then a factor of 3 or more will be critical.
But that’s not the use case hybrid storage is meant for.
In our benchmark, if you just limit your queries to the current year (2015 in this case) you don’t have ANY slow down in query speed. You might even get a small speed up, as queries on old data would not mess up with your filesystem caches.
Final thoughts
Regarding the fact that S3 object storage costs are around a factor of five times cheaper than EBS storage, using a hybrid storage solution is an easy way to save you money.
As you can set up hybrid storage on a table level and configure different TTLs for different tables, you can easily adjust the setting to your use cases.
And you don’t have to change your application at all, you’ll just optimize your monthly bill by changing settings in your clusters.
But of course hybrid storage is no silver bullet.
If you have to access the majority of your data all the time, then speed reduction by a factor of 3 might not be worth it.
However, as most access patterns on real data tend to use active data 99% of the time and the rest of the data only 1% of the time, it’s worth it for those 1% to wait a little longer if you can spare a big amount of money.
In this article, we’ll talk about: