Palzin Track
Get 15% off with code PTRACKSIGNUP15 

Laravel Diary Logo

Scaling ClickHouse Using Amazon S3 as Storage

clickhouse
Table of Contents

Handling growing data can be challenging, especially when disk space becomes a limitation. ClickHouse offers robust horizontal scaling capabilities, but scaling processing power isn't always the primary concern. In some cases, the need arises for ample storage while maintaining CPU and RAM efficiency. Instead of investing in additional disks, one can explore the option of using Amazon S3 to store part or all of the data.

ClickHouse comes equipped with the functionality to read data directly from CSV, TSV, Parquet, and other file formats on Amazon S3. The S3 Engine facilitates both reading and writing data to S3 files. This capability proves beneficial for scenarios where data storage on S3 is preferred. Learn more about working with S3 files directly from ClickHouse here.

While this may suffice in simple cases, leveraging the full power of MergeTree tables requires the use of S3 disks.

ClickHouse seamlessly supports the S3 protocol at all levels, including storage. To store a table on an S3 bucket, the process begins with declaring a disk and policy in the ClickHouse configuration file. It is advisable to create a new s3.xml file under /etc/clickhouse-server/config.d instead of modifying the default configuration.

Assuming the existence of a bucket named clickhousetests in the eu-central-1 region, accessible with a given key and secret, a new disk and policy can be declared:

<s3>
    <access_key_id>your_access_key</access_key_id>
    <secret_access_key>your_secret_key</secret_access_key>
    <region>eu-central-1</region>
    <endpoint>s3.eu-central-1.amazonaws.com</endpoint>
    <path>db/</path>
    <cache_enabled>1</cache_enabled>
    <data_cache_enabled>1</data_cache_enabled>
    <enable_filesystem_cache>1</enable_filesystem_cache>
</s3>

Ensure caching is enabled for potential query performance improvements.

Restart ClickHouse to apply the changes:

sudo clickhouse restart

With the S3 storage configured, a MergeTree table can be created to utilize S3:

CREATE TABLE test (
    `id` UInt32,
    `msg` String,
    `ts` DateTime
)
ENGINE = MergeTree ORDER BY id
SETTINGS storage_policy = 's3_main';

Using the s3_main policy indicates to ClickHouse that the data should be stored on S3. Now, the table can be interacted with like any standard ClickHouse table.

Let's insert 10 million rows for testing purposes:

INSERT INTO test
SELECT number, randomPrintableASCII(10), now() - rand()
FROM numbers(10000000);

The operation takes approximately 16 seconds, and the S3 browser shows the db folder, automatically created by ClickHouse, occupying around 300 MB of space.

When the cache is enabled, reading data can be quick if ClickHouse can reuse cached data. Otherwise, queries might take longer to process.

In a production environment, maintaining ClickHouse performance with S3 storage may be a concern. A common strategy is the hot-cold storage approach. This involves storing the recent part (hot) of a table on high-speed devices (local SSD or NVMe) and the historical part (cold) on slower devices (HDD or S3).

The beauty is that ClickHouse manages this tiered storage automatically. Specify storage tiers in the policies block to implement this strategy:

<hot_cold>
    <volumes>
        <hot>
            <disk>default</disk>
        </hot>
        <cold>
            <disk>s3</disk>
        </cold>
    </volumes>
    <move_factor>0.1</move_factor>
</hot_cold>

This configuration directs ClickHouse to save table data on a local disk first (hot block). When the available local disk space reaches 10% (as specified by move_factor), the table data is automatically relocated to S3 storage (cold block).

Tables can be created with this policy:

CREATE TABLE hot_cold_table (
    `id` UInt32,
    ...
)
ENGINE = MergeTree ORDER BY id
SETTINGS storage_policy = 'hot_cold';

ClickHouse will seamlessly manage storage operations.

ClickHouse's native support for S3 as a storage device enables the efficient handling of large MergeTree tables stored on Amazon S3. For real-world scenarios, a hot-cold storage strategy proves effective, where recent data is stored locally for performance, while historical data is moved to S3 for scalability.

Utilizing ClickHouse storage policies allows for the use of multiple local disks to distribute tables across different disks, contributing to scalability based on local devices.

::Share it on::

Comments (0)

What are your thoughts on "Scaling ClickHouse Using Amazon S3 as Storage"?

You need to create an account to comment on this post.

Related articles