Feature: Optimize performance for semi-structured JSON #5925

shaeqahmed · 2022-06-12T21:18:14Z

Awesome project! Really nice to see support for semi-structured data here. I'm evaluating using this project for a platform to analyse security data, where logs often follow a loose dynamic schema (e.g. AWS Cloudtrail). Right now, I believe storing this in a JSON column for a schemaless approach will kill any performance benefits (need to load whole column, parse text). Clickhouse recently added native support for Object('JSON') type that abstracts away the details and allows ingesting arbitrary JSON values with full performance, by automatically evolving the schema:

https://clickhouse.com/docs/en/guides/developer/working-with-json/json-semi-structured/#overview

The JSON Object type is advantageous when dealing with complex nested structures, which are subject to change. The type automatically infers the columns from the structure during insertion and merges these into the existing table schema. By storing JSON keys and their values as columns and dynamic subcolumns, ClickHouse can exploit the same optimizations used for structured data and thus provide comparable performance. The user is also provided with an intuitive path syntax for column selection. Furthermore, a table can contain a JSON object column with a flexible schema and more strict conventional columns with predefined types.

Would be a killer feature to have this incorporated into Databend.

Note: In Clickhouse's approach the issue arises of incompatible schemas (int -> string supported, but not int -> Array etc.). I think a better approach to this would be the "dynamic typing" approach of Redshift SUPER and Rockset.

What do you think?

sundy-li · 2022-06-13T01:05:41Z

I believe storing this in a JSON column for a schemaless approach will kill any performance benefits (need to load whole column, parse text).

That's definitely right, thanks for your advices.

Current JSON is the first simple implementation to support this feature, we did know this approach would perform poorly. Besides Redshift SUPER and Rockset, we also discovered sneller which is based on ion.

The main problem in databend is that the schema of table is fixed(due to parquet), thus we can't store different schemas in different data parts if we just infer the schema from input data.

We will keep investigating these to develop a new version of JSON format to optimize the performance. Any new ideas are all welcome.

shaeqahmed · 2022-06-13T12:14:34Z

Understood. Iceberg table format brings nested schema evolution (struct) to parquet data lake by keeping track of column identifiers in metadata files. I guess we first need to add similar support to Databend for Nested and other datatypes, then we can explore the automatic schema evolution feature for JSON objects? Trying to understand if this is something I could take a stab at myself. Thanks

ZhiHanZ · 2022-06-13T13:09:11Z

Pretty nice idea! we have discuessed several implementations for further optimization on Nested datatypes, https://github.com/datafuselabs/databend/pull/4320/files, I think you could at first take a look through those discussed researches and promote some new designs and solutions.

shaeqahmed added the C-feature Category: feature label Jun 12, 2022

b41sh self-assigned this Jun 17, 2022

b41sh mentioned this issue Aug 5, 2022

Feature: Tracking issue for JSON optimization #6994

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Optimize performance for semi-structured JSON #5925

Feature: Optimize performance for semi-structured JSON #5925

shaeqahmed commented Jun 12, 2022

sundy-li commented Jun 13, 2022 •

edited

Loading

shaeqahmed commented Jun 13, 2022

ZhiHanZ commented Jun 13, 2022

Feature: Optimize performance for semi-structured JSON #5925

Feature: Optimize performance for semi-structured JSON #5925

Comments

shaeqahmed commented Jun 12, 2022

sundy-li commented Jun 13, 2022 • edited Loading

shaeqahmed commented Jun 13, 2022

ZhiHanZ commented Jun 13, 2022

sundy-li commented Jun 13, 2022 •

edited

Loading