Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to specify structure hints in schema inference #40068

Merged
merged 2 commits into from
Aug 18, 2022

Conversation

Avogar
Copy link
Member

@Avogar Avogar commented Aug 10, 2022

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add new setting schema_inference_hints that allows to specify structure hints in schema inference for specific columns. Closes #39569

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@robot-ch-test-poll robot-ch-test-poll added the pr-feature Pull request with new product feature label Aug 10, 2022
@kssenii kssenii self-assigned this Aug 10, 2022
}
catch (...)
{
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there can be an exception? Is some tryLogCurrentException needed? or at least LOG_TRACE?

Copy link
Member Author

@Avogar Avogar Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can throw an exception when given structure is a valid columns declaration list from the parser's point of view, but contains invalid types. Examples: x Tuple, x Decimal(999)
I don't think we need to log this exception, as this tryParse* function suppose to work with invalid structures as well

Copy link
Member

@kssenii kssenii Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then may be it makes sense to add tryGetColumnsDescription (not getColumnsDescription) in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I will check if it's possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked getColumnsDescription implementation and I don't think it's worth rewriting it to implement tryGetColumnsDescription, because it will make this function more complex and in some places it will require try/catch construction. Let's keep at as is.

@@ -104,6 +115,26 @@ NamesAndTypesList IRowSchemaReader::readSchema()
"Most likely setting input_format_max_rows_to_read_for_schema_inference is set to 0");

DataTypes data_types = readRowAndGetDataTypes();
/// If column names weren't set, use default names 'c1', 'c2', ...
if (column_names.empty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case it is empty or non-empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's non-empty in two cases:

  1. Format contains names and they were explicitely set using setColumnNames method. For example in formats with suffix WithNames:
  2. Column names were set using special settig column_names_for_schema_inference.

@Avogar
Copy link
Member Author

Avogar commented Aug 17, 2022

@kssenii Can we merge it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow users to specify hints in schema inference
4 participants