Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669

alamb · 2022-01-24T20:12:29Z

@thinkharderdev implemented schema merging functionality for Parquet files in #1622. However, this logic only applies to Parquet, and @tustvold noted that it would likely also be useful to apply to CSV, Avro and Json files so that DataFusion could read from files in those formats that had compatible but not identical schemas

Specifically, the logic in read_partition might be extracted into some of SchemaAdapter, akin to PartitionColumnProjector. This would allow the logic to be reused with other file formats, e.g. JSON or CSV, whilst also allowing testing it in isolation.

Originally posted by @tustvold in #1622 (comment)

The text was updated successfully, but these errors were encountered:

thinkharderdev · 2022-01-26T22:00:15Z

Since I made this mess I feel duty-bound to clean it up :) I can take this one.

alamb · 2022-01-27T11:11:52Z

Since I made this mess I feel duty-bound to clean it up :) I can take this one.

I wouldn't describe this as a mess ! The ability to merge multiple parquet files without the exact same schema is a great addition -- this will just be the icing on the cake, as it were

alamb mentioned this issue Jan 24, 2022

Handle merging of evolved schemas in ParquetExec #1622

Merged

thinkharderdev mentioned this issue Jan 30, 2022

Create SchemaAdapter trait to map table schema to file schemas #1709

Merged

alamb closed this as completed in #1709 Jan 31, 2022

alamb added enhancement New feature or request datafusion Changes in the datafusion crate labels Feb 10, 2022

tustvold mentioned this issue Mar 24, 2022

RFC: More Granular File Operators #2079

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669

Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669

alamb commented Jan 24, 2022

thinkharderdev commented Jan 26, 2022

alamb commented Jan 27, 2022

Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669

Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669

Comments

alamb commented Jan 24, 2022

thinkharderdev commented Jan 26, 2022

alamb commented Jan 27, 2022