Is Hybrid Storage The Way Forward for Handling BigData?

November 2, 2022
15 mins to read

Written By: Stefan Kaeser, DoubleCloud Solution Architect

When Mark Litwintschik wrote his Blogpost about 1.1 Billion Taxi Rides in ClickHouse on DoubleCloud we were flattered to see that Doublecloud were the fastest in his benchmarks regarding cloud offerings. But after Mark’s blog went live we received several questions about how this ‘gamechanger’ hybrid storage works, and what it was in the first place.

What Does Hybrid Storage Mean?

Let’s start a little bit in the past.
In the days before cloud was a thing, people already knew about storage and how they used it. Storage could be very fast like Ramdisks, very flexible to move like floppy disks, reliable and usable like hard drives or very big and cheap (regarding price per GB) when using tapes.
In the past (and today) you had to decide whether you wanted fast storage which is more expensive, or big storage which is cheaper but also slower.
Times have changed however. Instead of tapes we now have object storage like Amazon S3, instead of floppy disks we have usb sticks and whilst hard drives still exist, there’s a greater variety now when you count SSDs and NVMEs; only memory has stayed the same (very fast and expensive).

But what didn’t change is the logic behind the different types of storage.
You still have to decide whether you need to spend more money, or you lose speed. And another thing did not change: You have to manage what kind of data uses what kind of storage. In the past this could mean you have to have a person who has to change the tapes, you have to write jobs to move data from one system to another or you have to change your application to access archived data differently and so on.

This is where hybrid storage comes into play.
Hybrid storage means that different types of storage will be integrated into one, giving you the best of both worlds.
Hybrid promises you to save cost, while still being fast most of the time, handling the data movement in the background.
When using harddrives, there are some hybrid drives called SSHD. They combine a normal spinning harddrive with a fast ssd within the same case. The controller on board moves the data between the different parts of the hardware based on rules like how often it is accessed or how long since the last access etc.
You as a user don’t have to care about it and just use the drive as one storage.
Kinda the same but of course on a bigger level is hybrid storage on DoubleCloud.
It combines cheap but slower S3 object storage with fast but expensive GP2 local storage.
That way you as a user can just write your data into a table and don’t have to care about moving old data to S3, changing your application to have different access patterns etc.
DoubleCloud takes care of it.

Setup Hybrid Storage For Your Tables

As Mark’s blog gave us a good heads up about the data he loves to use for benchmarks, we’ll continue with his data and table definitions. Therefore I’ll just give a quick reminder about his setup:

For comparison we will also use the s1-c32-m128 instance type with 512GB GP2 storage in the region Frankfurt.
We’ll also make use of his test dataset of 1.1 billion rows of taxi rides and the table structure he shared with us:

CREATE TABLE ebs.trips_hybrid (
    trip_id                 UInt32,
    vendor_id               vendor_id AS Enum8('1' = 1,
                                '2' = 2,
                                'CMT' = 3,
                                'VTS' = 4,
                                'DDS' = 5,
                                'B02512' = 10,
                                'B02598' = 11,
                                'B02617' = 12,
                                'B02682' = 13,
                                'B02764' = 14),
    pickup_datetime         DateTime,
    dropoff_datetime        Nullable(DateTime),
    store_and_fwd_flag      Nullable(FixedString(1)),
    rate_code_id            Nullable(UInt8),
    pickup_longitude        Nullable(Float64),
    pickup_latitude         Nullable(Float64),
    dropoff_longitude       Nullable(Float64),
    dropoff_latitude        Nullable(Float64),
    passenger_count         Nullable(UInt8),
    trip_distance           Nullable(Float64),
    fare_amount             Nullable(Float32),
    extra                   Nullable(Float32),
    mta_tax                 Nullable(Float32),
    tip_amount              Nullable(Float32),
    tolls_amount            Nullable(Float32),
    ehail_fee               Nullable(Float32),
    improvement_surcharge   Nullable(Float32),
    total_amount            Nullable(Float32),
    payment_type            Nullable(String),
    trip_type               Nullable(UInt8),
    pickup                  Nullable(String),
    dropoff                 Nullable(String),

    cab_type                Enum8('yellow' = 1, 'green' = 2))
                                AS cab_type,

    precipitation           Nullable(Int8),
    snow_depth              Nullable(Int8),
    snowfall                Nullable(Int8),
    max_temperature         Nullable(Int8),
    min_temperature         Nullable(Int8),
    average_wind_speed      Nullable(Int8),

    pickup_nyct2010_gid     Nullable(Int8),
    pickup_ctlabel          Nullable(String),
    pickup_borocode         Nullable(Int8),
    pickup_boroname         Nullable(String),
    pickup_ct2010           Nullable(String),
    pickup_boroct2010       Nullable(String),
    pickup_cdeligibil       Nullable(FixedString(1)),
    pickup_ntacode          Nullable(String),
    pickup_ntaname          Nullable(String),
    pickup_puma             Nullable(String),

    dropoff_nyct2010_gid    Nullable(UInt8),
    dropoff_ctlabel         Nullable(String),
    dropoff_borocode        Nullable(UInt8),
    dropoff_boroname        Nullable(String),
    dropoff_ct2010          Nullable(String),
    dropoff_boroct2010      Nullable(String),
    dropoff_cdeligibil      Nullable(String),
    dropoff_ntacode         Nullable(String),
    dropoff_ntaname         Nullable(String),
    dropoff_puma            Nullable(String)
) ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY (pickup_datetime)
;

Making use of the hybrid storage in DoubleCloud is easy and just requires you to change one setting:

ALTER TABLE hybrid.trips_hybrid MODIFY SETTING storage_policy = 'hybrid_storage';

Let’s assume we already imported all the data into the trips_hybrid table, now what data is on S3 and what data resides on local storage?
Currently all data will be on local storage, because right now ClickHouse doesn’t know which data could be considered as ‘old’.
That’s the last thing we have to configure ourselves.
As the data contains records for the years 2009 to 2015, we decided that the cut off should be on December 31st 2014.
In real case scenarios, you most likely won’t have a fixed date but something like older than 90 days which is of course possible as well.
So all we have to do now, is inform ClickHouse that data before 2015 should be move to object storage:

ALTER TABLE hybrid.trips_hybrid
MODIFY TTL pickup_date + toIntervalDay(dateDiff('day',
       toDate('2014-12-31'), now())) 
   TO DISK 'object_storage'

And that’s everything.
Only two settings you have to adjust and ClickHouse and DoubleCloud take care of everything under the hood.
You don’t have to change a single query in your application, as all that is done behind the curtain.

Benchmarks

Ok, so now as we have parts of our data in S3, we can redo Marks benchmark queries on our hybrid storage set:

Query 1:

SELECT cab_type, count(*)
FROM ebs.trips_mergetree
GROUP BY cab_type;

Query 2:

SELECT passenger_count,
       avg(total_amount)
FROM ebs.trips_mergetree
GROUP BY passenger_count;

Query 3:

SELECT passenger_count,
       toYear(pickup_date) AS year,
       count(*)
FROM ebs.trips_mergetree
GROUP BY passenger_count,
         year;

Query 4:

SELECT passenger_count,
       toYear(pickup_date) AS year,
       round(trip_distance) AS distance,
       count(*)
FROM ebs.trips_mergetree
GROUP BY passenger_count,
         year,
         distance
ORDER BY year,
         count(*) DESC;

With data on S3, cold queries will take significantly longer, this is the tradeoff for using object storage instead of local storage.
But if you run the same query twice, ClickHouse will use internal caching to speed up the subsequent queries.
So what are the final results?

Query

Local

Hybrid Cold

Hybrid Cached [1],[2]

Q1

0.347

4.581

1.077

Q2

1.1

14.768

3.254

Q3

1.389

4.925

2.392

Q4

2.935

14.633

4.472

As you can see when the cache can be used, queries tend to be 1.5 — 3 slower.
However when caches are cold, you can even see a query time increase by a factor of 10!
Now is this a good result or not?
Of course it depends on your workload. If you query old data all the time, then a factor of 3 or more will be critical.
But that’s not the use case hybrid storage is meant for.
In our benchmark, if you just limit your queries to the current year (2015 in this case) you don’t have ANY slow down in query speed. You might even get a small speed up, as queries on old data would not mess up with your filesystem caches.

Final thoughts

Regarding the fact that S3 object storage costs are around a factor of five times cheaper than EBS storage, using a hybrid storage solution is an easy way to save you money.
As you can set up hybrid storage on a table level and configure different TTLs for different tables, you can easily adjust the setting to your use cases.

And you don’t have to change your application at all, you’ll just optimize your monthly bill by changing settings in your clusters.
But of course hybrid storage is no silver bullet.
If you have to access the majority of your data all the time, then speed reduction by a factor of 3 might not be worth it.
However, as most access patterns on real data tend to use active data 99% of the time and the rest of the data only 1% of the time, it’s worth it for those 1% to wait a little longer if you can spare a big amount of money.