Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor how we create listing tables #4227

Merged
merged 14 commits into from
Nov 17, 2022
Merged

refactor how we create listing tables #4227

merged 14 commits into from
Nov 17, 2022

Conversation

timvw
Copy link
Contributor

@timvw timvw commented Nov 15, 2022

Which issue does this PR close?

Closes #.

Rationale for this change

Currently the creation of External tables is somewhat strange. Some types are hardcoded, others are only supported via TableProviderFactories..
This PR tries to simplify the logic by always using a TableProviderFactory.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Nov 15, 2022
@timvw
Copy link
Contributor Author

timvw commented Nov 15, 2022

@avantgardnerio Is it possible that "with_listing_schema_provider" implementation/test is missing something?

@timvw
Copy link
Contributor Author

timvw commented Nov 15, 2022

I have updated the semantics of catalog.type such that it indicates the fileformat..

Only difference is now that by default we assume CSV does not have a header line.
The with_listing_schema_provider test uses tests/tpch-csv data which does have a header column...
Ideally we extend the catalog.XXX options such that this can be provided as parameter as well?

@avantgardnerio
Copy link
Contributor

Nice work, ty @timvw !

@avantgardnerio
Copy link
Contributor

Is it possible that "with_listing_schema_provider" implementation/test is missing something?

Definitely possible... what makes you ask?

@avantgardnerio
Copy link
Contributor

Ideally we extend the catalog.XXX options such that this can be provided as parameter as well?

That's what I was thinking...

@avantgardnerio
Copy link
Contributor

Just because it's not obvious, I'd like to leave a note here to say that the real functionality of the Factories is to allow other repos to register TableProviders we don't even know about at compile time: delta-io/delta-rs#892

@timvw
Copy link
Contributor Author

timvw commented Nov 16, 2022

@avantgardnerio

It is still possible to register other/unknown tableproviderfactories. This happens in eg: test sql::create_drop::create_external_table_with_ddl. Because the filetype is passed in uppercase to CreateExternalTableCommand it's needed to register the factor with an uppercase name (example)

There is a slight change in how ListingSchemaProvider (this is very new functionality anyway) behaves (as it now requires a fileformat and (optional) has_header values).

(In a next step i'd like to move the has_header, delimiter etc values into options of CreateExternalTable command)

@timvw timvw marked this pull request as ready for review November 16, 2022 10:43
@alamb
Copy link
Contributor

alamb commented Nov 17, 2022

I am looking forward to reviewing this tomorrow -- thank you @timvw

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great -- thank you so much for the cleanup @timvw

let object_store_provider = DatafusionCliObjectStoreProvider {};
let object_store_registry =
ObjectStoreRegistry::new_with_provider(Some(Arc::new(object_store_provider)));
let rn_config = RuntimeConfig::new()
.with_object_store_registry(Arc::new(object_store_registry))
.with_table_factories(table_factories);
.with_object_store_registry(Arc::new(object_store_registry));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@@ -59,18 +61,24 @@ impl ListingSchemaProvider {
/// `path`: The root path that contains subfolders which represent tables
/// `factory`: The `TableProviderFactory` to use to instantiate tables for each subfolder
/// `store`: The `ObjectStore` containing the table data
/// `format`: The `FileFormat` of the tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it matters, but this says FileFormat but the actual argument is a String

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @timvw

@andygrove andygrove merged commit e18f7ba into apache:master Nov 17, 2022
@ursabot
Copy link

ursabot commented Nov 17, 2022

Benchmark runs are scheduled for baseline = a0581dc and contender = e18f7ba. e18f7ba is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb
Copy link
Contributor

alamb commented Nov 17, 2022

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants