Scaling ClickHouse Using Amazon S3 as Storage

Table of Contents

Working with Text Files on S3
Storing MergeTree Tables on S3
Creating S3 MergeTree Table
Inserting Data
Reading Data
Storage Strategy in Real Life
Summary

Handling growing data can be challenging, especially when disk space becomes a limitation. ClickHouse offers robust horizontal scaling capabilities, but scaling processing power isn't always the primary concern. In some cases, the need arises for ample storage while maintaining CPU and RAM efficiency. Instead of investing in additional disks, one can explore the option of using Amazon S3 to store part or all of the data.

Working with Text Files on S3

ClickHouse comes equipped with the functionality to read data directly from CSV, TSV, Parquet, and other file formats on Amazon S3. The S3 Engine facilitates both reading and writing data to S3 files. This capability proves beneficial for scenarios where data storage on S3 is preferred. Learn more about working with S3 files directly from ClickHouse here.

While this may suffice in simple cases, leveraging the full power of MergeTree tables requires the use of S3 disks.

Storing MergeTree Tables on S3

ClickHouse seamlessly supports the S3 protocol at all levels, including storage. To store a table on an S3 bucket, the process begins with declaring a disk and policy in the ClickHouse configuration file. It is advisable to create a new s3.xml file under /etc/clickhouse-server/config.d instead of modifying the default configuration.

Assuming the existence of a bucket named clickhousetests in the eu-central-1 region, accessible with a given key and secret, a new disk and policy can be declared:

<s3>
    <access_key_id>your_access_key</access_key_id>
    <secret_access_key>your_secret_key</secret_access_key>
    <region>eu-central-1</region>
    <endpoint>s3.eu-central-1.amazonaws.com</endpoint>
    <path>db/</path>
    <cache_enabled>1</cache_enabled>
    <data_cache_enabled>1</data_cache_enabled>
    <enable_filesystem_cache>1</enable_filesystem_cache>
</s3>

Ensure caching is enabled for potential query performance improvements.

Restart ClickHouse to apply the changes:

sudo clickhouse restart

Creating S3 MergeTree Table

With the S3 storage configured, a MergeTree table can be created to utilize S3:

CREATE TABLE test (
    `id` UInt32,
    `msg` String,
    `ts` DateTime
)
ENGINE = MergeTree ORDER BY id
SETTINGS storage_policy = 's3_main';

Using the s3_main policy indicates to ClickHouse that the data should be stored on S3. Now, the table can be interacted with like any standard ClickHouse table.

Inserting Data

Let's insert 10 million rows for testing purposes:

INSERT INTO test
SELECT number, randomPrintableASCII(10), now() - rand()
FROM numbers(10000000);

The operation takes approximately 16 seconds, and the S3 browser shows the db folder, automatically created by ClickHouse, occupying around 300 MB of space.

Reading Data

When the cache is enabled, reading data can be quick if ClickHouse can reuse cached data. Otherwise, queries might take longer to process.

Storage Strategy in Real Life

In a production environment, maintaining ClickHouse performance with S3 storage may be a concern. A common strategy is the hot-cold storage approach. This involves storing the recent part (hot) of a table on high-speed devices (local SSD or NVMe) and the historical part (cold) on slower devices (HDD or S3).

The beauty is that ClickHouse manages this tiered storage automatically. Specify storage tiers in the policies block to implement this strategy:

<hot_cold>
    <volumes>
        <hot>
            <disk>default</disk>
        </hot>
        <cold>
            <disk>s3</disk>
        </cold>
    </volumes>
    <move_factor>0.1</move_factor>
</hot_cold>

This configuration directs ClickHouse to save table data on a local disk first (hot block). When the available local disk space reaches 10% (as specified by move_factor), the table data is automatically relocated to S3 storage (cold block).

Tables can be created with this policy:

CREATE TABLE hot_cold_table (
    `id` UInt32,
    ...
)
ENGINE = MergeTree ORDER BY id
SETTINGS storage_policy = 'hot_cold';

ClickHouse will seamlessly manage storage operations.

Summary

ClickHouse's native support for S3 as a storage device enables the efficient handling of large MergeTree tables stored on Amazon S3. For real-world scenarios, a hot-cold storage strategy proves effective, where recent data is stored locally for performance, while historical data is moved to S3 for scalability.

Utilizing ClickHouse storage policies allows for the use of multiple local disks to distribute tables across different disks, contributing to scalability based on local devices.

Scaling ClickHouse Using Amazon S3 as Storage

Working with Text Files on S3

Storing MergeTree Tables on S3

Creating S3 MergeTree Table

Inserting Data

Reading Data

Storage Strategy in Real Life

Summary

::Share it on::

Comments (0)

Related articles

Connecting ClickHouse with AWS RDS MySQL and Migrating Database

copy files from local to aws S3 Bucket(aws cli + s3 bucket)

Enabling Two Factor Authentication for EC2 SSH – AWS MFA Setup

How to Set-up Cloudwatch Agent and push Server/Application Logs to AWS Cloudwatch

How to Set-up Supervisor on EC2 Server

How to send SMS Messages in Laravel using AWS SNS

Tailing log files on multiple remote servers

Serving Images with CloudFront + S3

Setting up a AWS Cloud9 for Php, Python, Lambda and other programming languages

Setting up a Laravel Project on AWS Cloud9 with RDS

Scaling ClickHouse Using Amazon S3 as Storage

#Working with Text Files on S3

#Storing MergeTree Tables on S3

#Creating S3 MergeTree Table

#Inserting Data

#Reading Data

#Storage Strategy in Real Life

#Summary

::Share it on::

Comments (0)

Related articles

Working with Text Files on S3

Storing MergeTree Tables on S3

Creating S3 MergeTree Table

Inserting Data

Reading Data

Storage Strategy in Real Life

Summary