Advantages of Integrating ClickHouse® with Hybrid S3 Storage

December 26, 2022
15 mins to read

As a polyglot data processor, ClickHouse® integrates with object storage systems to extend its capabilities. After ClickHouse processes your large quantity of table data, you need a fast, accessible, and cost-friendly way of storing the data for future use. Integrating it with standard storage services or, even better, hybrid S3 object storage would be an effective way to go.

What is S3 Storage?

S3 refers to Amazon’s Simple Storage Service, which is based on a scalable and secure object storage technology. Amazon S3 provides high-level durability and storage performance that supports a variety of use cases. Its applications span big data analytics, mobile applications, data lakes, IoT devices, enterprise applications, backup and restores, websites, and more. Data stored on S3 can be accessed over the internet from anywhere.

Key Features of S3 Object Storage:

  • Buckets: On S3, data is stored in buckets that can hold an unlimited amount of data.

  • Expandable scalability: Since S3 has unlimited storage space, individual objects can take up to 5TB of space with room for more.

  • Data structure: the flexibility of the S3 object storage data structure makes it easy to identify and retrieve objects using unique keys or metadata.

  • Accessibility: Data stored on the Amazon Simple Storage Service is accessible and shareable for anyone on the internet to download and use.

  • Permissions: S3 users can ensure only authorized access to their data by assigning permissions at the bucket or object level.

  • API: the S3 API provided as a REST and SOAP interface is why users can integrate other tools like ClickeHouse for seamless use.

Advantages Of Using S3 For ClickHouse Data Storage

As organizations and companies produce large quantities of data, it becomes imperative to have a robust cloud storage strategy in place. No matter the velocity, volume, or variety of data, Amazon Simple Storage Service (S3) helps facilitate scalable, secured, and low-latency storage. Some beneficial advantages of integrating S3 with ClickHouse include the following:

  • Data Security: As previously mentioned, Amazon S3 buckets are highly secured and only accessible to the creators and those they grant permission. The identity access management protocol controls access permissions for each file or bucket on the S3 storage.

  • Availability: As a storage system built by Amazon, S3 offers users the same fast, accessible, and reliable system the platform uses to run its global network. The S3 Standard- IA and S3 Standard are designed for 99.9% and 99.99% availability, respectively.

  • Cost Effective: With the S3 standard data storage system, you get to pay for what you use, which is usually a low cost, between $0.022 / GB and $0.0125 / GB. The Amazon S3 Glacier, which offers infrequent access to automatically migrated data, is even more cost-effective.

  • Ease of Data Migration: Amazon S3 offers three types of data migration options: rsync, S3 command line interface, and Glacier command line interface. These options are cost-friendly and allow for the automated and fast transfer of large quantities of data imported or exported to and from any device.

  • User-Friendly: S3 Simple Storage Service removes the hassle of storing, securing, managing, and transferring data between systems. Being a configurable system, users can decide their lifecycle policy and replication guidelines. Users can also customize the request metrics and storage classes with filters that provide a better view of data storage.

How S3 Works

Data stored on the S3 is managed through the Amazon console or the S3 API. S3 data is also stored as objects, which makes the storage cloud highly scalable because the data (objects) are placed on disk drives in a physical data center. Using distributed files and other types of computer systems, Amazon S3 data centers are able to provide elastic storage scalability.

Data migrated to S3 is automatically stored and distributed across multiple locations and disk drives with block storage methods that ensure seamless access and version control. Here are descriptions of some terms associated with S3 storage.

S3 Bucket

An S3 bucket is a logical container on which data is stored. There is no static limit to the amount of data or objects you can store in the S3 bucket, but the buckets usually hold up to 5TB size of an object.

S3 Key

Every object uploaded into a bucket is assigned a unique key for easy access. This given key is a string that mimics the storage directory details and is denoted as bucket name, key, and version ID. It forms a unique URL structure used to access the object anywhere on the web.

Example:

{BUCKET-NAME}.s3.amazonaws.com/{OBJECT-KEY}
s3.amazon.aws.com/{BUCKET-NAME}/{OBJECT-KEY}

Amazon S3 Storage Classes

S3 Standard
The S3 standard tier offers 99.999999999% object durability achieved by duplicating objects across various data centers. It provides 99.99% availability guaranteed by the SLA. All transit and rest data stored in this class is encrypted with SSL by default.

S3 Standard-Infrequent Access
The S3 Standard-IA tier storage is an excellent choice for rarely accessed data. Compared to the Standard tier, it is less expensive, but there is a data retrieval fee. However, you get the same performance and latency as the Standard tier, the same durability of 99.999999999% across multiple storage sites, and 99.9% availability backed by SLA.

S3 Storage Archive
The Glacier and Deep Archive are storage classes designed for archived data that is not accessed regularly but stored away for future use. It also costs lower than S3 Standard-IA storage classes.

S3 Table Function

Table function is ClickHouse’s method of integrating with other external storage systems to easily import and export data. The ClickHouse table function allows data transfer from various sources, including MySQL Server, URL, PostgreSQL, and most recently, S3, which creates the S3 engine table that connects it to the Amazon S3 ecosystem. The basic syntax for integrating the S3 table function is as follows:

s3(path, [aws_access_key_id, aws_secret_access_key,] format, structure, [compression])

where “path” indicates the storage bucket, followed by the data format, table structure, and gzip compression parameter.

Amplifying ClickHouse Capacity with Hybrid Storage

The Amazon S3 table function may be great for importing and exporting data, but it cannot handle some advanced workloads. Hybridization and closer integration with other systems, such as ClickHouse, are necessary to ensure cost savings and storage performance.

Hybrid storage is a solution for improved speed of access, high performance, and increased volume of cloud storage. Combining disk drives, local data centers, and S3 storage capabilities, hybrid storage allows for easy merging and storage of data on the S3. Whether you’re looking to read data faster, merge efficiently, or improve storage performance, hybrid storage will make the perfect option for your project.

The hybrid storage we built combines the affordability, security, and reliability of S3 with our fast-speed SSD disks to remove ClickHouse data operation limitations. This enables users to decrease operation time while taking advantage of unlimited storage capacity and significantly lower costs than what they would spend using virtual machines.

Final Thoughts

At DoubleCloud, we understand the need for speed and efficiency in data storage. The Amazon Simple Storage Service has given the industry what it needs in terms of affordability and storage capacity. Our S3‑based hybrid storage takes this innovation further to enable users optimally utilize ClickHouse features with or without free space.

* ClickHouse® is a trademark of ClickHouse, Inc. https://clickhouse.com