Be fast & reduce cost when DoubleCloud handles your sensor data
Written By: Stefan Kaeser, DoubleCloud Solution Architect
Building and using sensors is all about collecting data.
But… storing and handling data is expensive and let’s be honest, you want to invest your time and money getting value from the data, not spend it on data management.
That’s where DoubleCloud can help providers of sensors to be as cost efficient as possible.
The Multi-Trillion Datagenerator: Power Sensors
With photovoltaic sensors in most private households, energy prices going through the roof and people thinking more in terms of green and savings, power sensors are popping up in more and more places.
You plug them into your sockets, into your pc or fridge, or your newly bought electric vehicle to check the energy it’s really consuming.
All those sensors collect data.
Even if there are just 1,000 sensors producing data every few seconds, you can come up with trillions of data points very quickly.
Setting
Let’s start with the setting and fill in some concrete numbers.
You as a company sell sensors which measure the amount of current running through them and send those readings to your database for later analysis.
The data is collected every five seconds, there are 1,000 sensors and you have to keep track of the data for at least five years.
That means you’ll end up collecting 31 trillion data points!
It doesn’t end there though! There are other requirements than just collecting the data.
Users want to see the current state of their sensors (for example they are starting video editing and want to see power consumption while rendering).
Others want to see how much power was produced today while watching the sun shining on their photovoltaic panels.
Another person wants to check the power consumption on a year to year basis because they’ve invested money in a new, energy efficient fridge.
All those requests differ in the way data needs to be prepared, and in the past require different systems to handle each of those requests.
But with managed ClickHouse® by DoubleCloud you can fulfill all those requests within the same tool, and save money by doing so.
Data Ingestion
In our scenario this is the easiest part.
ClickHouse can handle millions of rows inserted per second, so having 1k rows every 5 seconds doesn’t bother us at all.
We just need a simple table to insert the data and then we throw it in.
To keep the examples simple we’ll just store three values per dataset:
-
sensor_id
: a unique identifier for each sensor. We assume an alphanumeric string of variable length here -
measure_time
: the actual timestamp the sensor collected the data, rounded to seconds -
wh
: the amount of Watthours measured by the sensor since the last datapoint. We allow negative numbers here so we can differentiate between produced and consumed
The final table would then look like this:
CREATE TABLE sensor_raw
(
`sensor_id` String,
`measure_time` DateTime,
`wh` Int32
)
ENGINE = MergeTree
ORDER BY sensor_id;
Automatic Aggregation ‑ Materialized Views
While inserting millions of rows per second is not a problem for ClickHouse, it doesn’t make sense to scan billions of rows just for answering easy queries like how much power was produced last monday.
Of course it’s possible and ClickHouse can handle requests over billions of rows, it just doesn’t make sense to fetch all those rows from disk and read them every time you need answers on a daily basis.
As past data can’t change, this is where aggregation comes in. Of course you can build your own aggregation pipeline like you would have done for example in MySQL, but ClickHouse has built in features to help you there.
You just have to define your aggregation tables and define a so-called materialized view to do the aggregation for you, whenever you ingest new data to the source table.
And those materialized views can even build up on one another for having complete aggregation pipelines within clickhouse with no effort for your application.
So let’s say we want to aggregate data for every five minutes, as this should be detailed enough for complex analyses on power consumption, but also want a daily aggregation as most requests will be done per day. We define two aggregation tables for this:
CREATE TABLE sensor_5min
(
`sensor_id` String,
`measure_time` DateTime,
`wh` Int32
)
ENGINE = SummingMergeTree
PARTITION BY (toYYYYMM(measure_time))
ORDER BY (sensor_id, measure_time);
CREATE TABLE sensor_daily
(
`sensor_id` String,
`measure_date` Date,
`wh` Int32
)
ENGINE = SummingMergeTree
PARTITION BY (toYYYYMM(measure_date))
ORDER BY (sensor_id, measure_date);
There are some things worth mentioning here:
-
ENGINE=SummingMergeTree
: This is a specialized engine exactly for our use case of aggregating data. In the background it combines rows with the same primary key (the order by) and sums up all values in columns not in the primary key. -
PARTITION BY toYYYYMM()
: As we want to keep data for a long time, but old data will not change much, we use partitioning by month here. This will lead to warm and cold datafiles on the file system and optimizes the use of filesystem caches (apart from other things).
So now we must tell ClickHouse to fill the tables when new data is inserted:
CREATE MATERIALIZED VIEW sensor_rawTo5min
TO sensor_5min
AS SELECT
sensor_id,
toStartOfFiveMinutes(measure_time) AS measure_time,
sum(wh) AS wh
FROM sensor_raw
GROUP BY sensor_id, measure_time
ORDER BY sensor_id, measure_time;
This is the first trigger. It aggregates all new data inserted into sensor_raw, groups them into chunks of 5 minutes, and stores the aggregation result into sensor_5min. As sensor_5min is a SummingMergeTree, it will automatically add the values for wh if there are already some aggregate values in the table for that specific sensor and time. Note that there are a lot of functions handling Date and DateTimes so however you want to group your data, you will find a function to get the grouping with ease.
Now let’s just create another aggregation. Each time data is ingested into the 5min table, it should also update the daily aggregation:
CREATE MATERIALIZED VIEW sensor_5minTodaily
TO sensor_daily
AS SELECT
sensor_id,
toDate(measure_time) AS measure_date,
sum(wh) AS wh
FROM sensor_raw
GROUP BY sensor_id, measure_date
ORDER BY sensor_id, measure_date;
Current State — Materialized Views
Nice.
Now we have an aggregation table which will speed up intraday queries and queries over a long period of time… but what about the current state of the sensor?
If a user wants to see how much power is consumed in his house when he’s at a garden party, he’s not interested in long term analytics or intraday graphs, he just wants to see the state of now.
Of course you can just get the last data points out of your sensor_raw table, but this would still read through a lot of data, because you don’t know the exact time.
This is where materialized views can help you as well. We just create another table keeping the current state of all sensors, and whenever a new datapoint is ingested into the raw table, it should update the state in that table:
CREATE TABLE sensor_current
(
`sensor_id` String,
`measure_time_state` DateTime,
`wh_state` Int32
)
ENGINE = ReplacingMergeTree
PARTITION BY tuple()
ORDER BY (sensor_id)
CREATE MATERIALIZED VIEW sensor_raw2current
TO sensor_current_state
AS SELECT
sensor_id,
max(measure_time) AS measure_time_state,
argMax(wh, measure_time) AS wh_state
FROM sensor_raw
GROUP BY sensor_id;
Unfortunately there isn’t a fast update in ClickHouse but you can use ReplacingMergeTree in this case.
Whenever there comes a new entry with the same primary key, it will remove the old entry and replace it with the new one. When querying the data, you have to use SELECT …. FROM sensor_current FINAL to make sure the replacement has taken place.
Saving Money ‑ Hybrid Storage & TTL
Now we’ve done everything to answer all types of requests in a fast way.
But there’s still one thing we have to tackle… Storing data costs money.
Of course this is especially true for old data, as it’s rarely needed but still takes up the same amount of space as fresh data, it just doesn’t produce that much value anymore.
Fortunately there are ways to solve this problem with managed ClickHouse on DoubleCloud in an easy way…
Removing Data With TTL
First of all we have to ask ourselves if we really need all the data we have.
The cheapest way to store data is to get rid of it. As we aggregate our data automatically to five minutes and daily aggregations, there’s the question if we need a datagranularity of a few seconds.
When was the last time you checked how much power your fridge consumed at 7:00:05 am? So we can assume that having a datapoint every five minutes is more than enough for long time analytics.
Still in case of error, it might be interesting to have near second data for some days.
So let’s get rid of the raw data after seven days. This can easily be done in ClickHouse via adding a TTL:
ALTER TABLE sensor_raw MODIFY TTL measure_time + toIntervalDay(7) DELETE;
This is enough to get rid of old data automatically after seven days.
Note that aggregated data is not affected by the deletion of data in the source table.
Aggregation only takes place on insert time.
Hybrid Storage Or Where To Keep Old Data
Nice that we could get rid of nearly a factor of 60 by aggregation and using TTL.
But we still have nearly half a trillion rows to store, when we assume that 5min granularity of data is needed for special analytical requests.
But most of the time, we’ll need only a fraction of that data regularly.
Most people will look at graphs for today, or yesterday.
Maybe sometime look a few days into the past (how much power does a christmas tree consume?) but needing a granularity of five minutes more than a month ago, won’t be needed that often.
Still we have to keep it for these cases, and it will cost the same amount of money as data from the last two days will.
Or does it?
Fortunately ClickHouse has a solution for that as well: tiered storage.
You can define within a table if data is regarded as warm or cold. Warm data is stored as it is, but cold data can be exported to S3. There it can lay around a lot cheaper, and still be accessed when needed.
When using ClickHouse with DoubleCloud this is easy to set up. First we change the storage police for our 5min table to enable different storage tiers:
ALTER TABLE sensor_5min SETTINGS storage_policy = 'hybrid_storage';
Now we just have to decide what kind of data we consider as cold.
Let’s say we want to keep 40 days of data as ‘warm’ data, then it’s just another TTL we need to add, this time telling clickhouse not to delete the data but moving it to S3:
ALTER TABLE sensor_5min MODIFY TTL measure_time + toIntervalDay(40) TO DISK 'object_storage';
Conclusion
As you could see there are a lot of easy possibilities to use ClickHouse on DoubleCloud to manage trillions of data. And that’s not even the tip of the iceberg.
The possibilities you have are nearly endless, so just make an account and use our free trials, and you will not want to miss it in the future!
In this article, we’ll talk about: