
Practical session with JSON in Clickhouse
Table of Contents
- Working with JSON in Clickhouse
- Storing JSON data
- JSON Validating and checking keys
- Extracting values
- Using JSON key values for indexing
- JSON keys or separate columns?
- Experimental JSON object type
- Summary
Working with JSON in Clickhouse
Dealing with dynamic data structures can be challenging, especially when the set of features varies over time or under different circumstances. JSON provides a flexible solution for handling dynamic data, and ClickHouse offers convenient tools for working with it.
Storing JSON data
The simplest (and single as of now) way to store JSON object is to create String
typed column and store JSON text representation inthere:
CREATE TABLE test\_string ( \`t\` DateTime, **\`v\` String** )
ENGINE = MergeTree ORDER BY t
We’re going to save JSON data under v
column:
INSERT INTO test\_string VALUES(now(), **'{"name":"Joe","age":95}'**)
Clickhouse provides several functions to work with JSON data, including validating JSON data itself.
JSON Validating and checking keys
We can simply check if JSON is valid using isValidJSON
function:
SELECT v, isValidJSON(v) FROM test\_string
Which returns 1
(or 0
) if JSON is valid (or not):
We can also check if object contains certain key (or any of given keys) which is useful during data cleansing procedures:
SELECT
**JSONHas(v, 'name') AND JSONHas(v, 'rating')** AS is\_valid,
count(\*)
FROM test\_string GROUP BY **is\_valid**
Here we require our JSON field to have name
and rating
keys defined to be marked as valid:
As we can see, our table contains 2 invalid values.
Extracting values
In most cases we want to operate with JSON attributes (keys) values, which can be done in multiple ways. First of all there is a function to extract typed values based on key names:
SELECT
**JSONExtract**(v, 'name', 'String'),
**JSONExtract**(v, 'rating', 'UInt32')
FROM test\_string LIMIT 5
Here we extract name
key value as string and rating
key value as [unsigned 4 bytes] integer from v
column value (which stores JSON):
We can extract scalar types (like String
, Int/UInt*
or Float*
) or complex structures, like Array
or Tuple
(or combinations of any):
SELECT JSONExtract('{"val": \[1,2,3,4\]}', 'val', 'Array(UInt8)')**\[2\]**
Which will return second integer value from val
array (which is a JSON key):
Another option to extract value (without having to think about types) is to use JSON_VALUE
function:
SELECT **JSON\_VALUE(v, '$.name')** FROM test\_string LIMIT 5
Which, again, will simply list value of name
key of v
JSON column:
Using JSON key values for indexing
As we know, Clickhouse allows using function to build sorting keys based on. That’s why we can use JSON extraction functions in indexes to optimize certain queries:
CREATE TABLE test\_index ( \`t\` Int64, \`v\` String )
ENGINE = MergeTree **ORDER BY JSONExtractUInt(v, 'rating')**
Here, we’ve used extracted rating
key value as a sorting key, so the relevant queries will be quite efficient:
SELECT count(\*) FROM test\_index
WHERE **JSONExtractUInt(v, 'rating') = 140970729**
Clickhouse will answer in the very efficient way in this case since it uses index for that:
On the contrary, if we filter on the JSON key which is not under index, full table scan follows. For example, the same query on the same table/data, but with a different sort index results in the following:
250x rows scanned, 250x RAM used and 35x time spent. Not good.
JSON keys or separate columns?
If you know your JSON field is going to have exactly certain keys of exactly certain types, it is better to move them out of JSON value into separate columns before writing to Clickhouse. Doing so will save space (because of saving data using relevant types instead of strings) and improve performance (since no JSON extraction overhead will happen).
Experimental JSON object type
Clickhouse has experimental JSON type which is too early to consider for production, but let’s take a look at that. Since the feature is experimental we have to:
SET allow\_experimental\_object\_type = 1
Now we can use JSON
type while creating tables:
CREATE TABLE test\_json ( \`t\` DateTime, **\`v\` JSON** )
ENGINE = MergeTree ORDER BY t
First cool thing about this type, is that we can directly use object notation in queries:
SELECT **v.name, v.rating** FROM test\_json LIMIT 5
And Clickhouse knows what to do:
Second cool thing about this type is that it takes less space (~30%
saved in our case) than String
field usage:
But one downside here is the speed of inserting data. If we compare insertion speed of the same data into String
and JSON
columns we can see almost 10 fold difference:
Anyway, this is still experimental, so hopefully Clickhouse team will document (in details) and release this soon.
Summary
Working with JSON in Clickhouse is as simple as using String
field for data and then using set of JSON*
functions to check and extract key values from JSON fields:
SELECT JSONExtract(json_col, 'key', 'String') FROM table
We can use extraction functions as indexes (or part of indexes) to improve query performance, but consider moving those to separate columns instead and leave only dynamic set of keys for storing in JSON columns.
Comments (0)
What are your thoughts on "Practical session with JSON in Clickhouse"?
You need to create an account to comment on this post.