Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Optimize performance for semi-structured JSON #5925

Open
shaeqahmed opened this issue Jun 12, 2022 · 3 comments
Open

Feature: Optimize performance for semi-structured JSON #5925

shaeqahmed opened this issue Jun 12, 2022 · 3 comments
Assignees
Labels
C-feature Category: feature

Comments

@shaeqahmed
Copy link

Awesome project! Really nice to see support for semi-structured data here. I'm evaluating using this project for a platform to analyse security data, where logs often follow a loose dynamic schema (e.g. AWS Cloudtrail). Right now, I believe storing this in a JSON column for a schemaless approach will kill any performance benefits (need to load whole column, parse text). Clickhouse recently added native support for Object('JSON') type that abstracts away the details and allows ingesting arbitrary JSON values with full performance, by automatically evolving the schema:

https://clickhouse.com/docs/en/guides/developer/working-with-json/json-semi-structured/#overview

The JSON Object type is advantageous when dealing with complex nested structures, which are subject to change. The type automatically infers the columns from the structure during insertion and merges these into the existing table schema. By storing JSON keys and their values as columns and dynamic subcolumns, ClickHouse can exploit the same optimizations used for structured data and thus provide comparable performance. The user is also provided with an intuitive path syntax for column selection. Furthermore, a table can contain a JSON object column with a flexible schema and more strict conventional columns with predefined types.

Would be a killer feature to have this incorporated into Databend.

Note: In Clickhouse's approach the issue arises of incompatible schemas (int -> string supported, but not int -> Array etc.). I think a better approach to this would be the "dynamic typing" approach of Redshift SUPER and Rockset.

What do you think?

@shaeqahmed shaeqahmed added the C-feature Category: feature label Jun 12, 2022
@sundy-li
Copy link
Member

sundy-li commented Jun 13, 2022

I believe storing this in a JSON column for a schemaless approach will kill any performance benefits (need to load whole column, parse text).

That's definitely right, thanks for your advices.

Current JSON is the first simple implementation to support this feature, we did know this approach would perform poorly. Besides Redshift SUPER and Rockset, we also discovered sneller which is based on ion.

The main problem in databend is that the schema of table is fixed(due to parquet), thus we can't store different schemas in different data parts if we just infer the schema from input data.

We will keep investigating these to develop a new version of JSON format to optimize the performance. Any new ideas are all welcome.

@shaeqahmed
Copy link
Author

Understood. Iceberg table format brings nested schema evolution (struct) to parquet data lake by keeping track of column identifiers in metadata files. I guess we first need to add similar support to Databend for Nested and other datatypes, then we can explore the automatic schema evolution feature for JSON objects? Trying to understand if this is something I could take a stab at myself. Thanks

@ZhiHanZ
Copy link
Collaborator

ZhiHanZ commented Jun 13, 2022

Pretty nice idea! we have discuessed several implementations for further optimization on Nested datatypes, https://github.com/datafuselabs/databend/pull/4320/files, I think you could at first take a look through those discussed researches and promote some new designs and solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature
Projects
None yet
Development

No branches or pull requests

4 participants