-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files without .parquet
, .csv
extension inferred as having no schema
#1736
Comments
.parquet
extension inferred as having no schema
Change this description to be a bug, which I think better reflects what is going on |
@tustvold may i ask you how to write a tmp parquet file, IMHP i think all parquet data file should end .parquet. |
@Ted-Jiang Remove this line in the SQL benchmark https://github.com/apache/arrow-datafusion/pull/1738/files#diff-d1dbff8af63c3a3fe4d918432f982181b40fa4b7e1641522a6a48904f521fc89R143 and that was what caused issues. I don't feel strongly whether the |
I agree -- an error about "can not infer schema" seems much better than silently ignoring |
.parquet
extension inferred as having no schema.parquet
, .csv
extension inferred as having no schema
I plan on fixing this for 14.0.0 |
Some notes from investigating this. This is where the // if no files need to be read, return an `EmptyExec`
if partitioned_file_lists.is_empty() {
let schema = self.schema();
let projected_schema = project_schema(&schema, projection.as_ref())?;
return Ok(Arc::new(EmptyExec::new(false, projected_schema)));
} The problem is I am registering a CSV data source and the filename does not end with I think we just need to add an error check for this case and prompt the user to specify the file extension they want rather than use the default. |
I propose that the error check we add is that at least one file exists with the specified extension. |
Makes sense to me |
I also ran into this issue. The problem seems to be even if only a single file is provided, we still try to match by extension. If my use case always uses a single csv file for each table, would a reasonable workaround be setting file extension to an empty string in ctx.register_csv(
name,
path,
CsvReadOptions::new().file_extension("")
).await? |
Maybe |
I tried this usecase with the fix in #6147 from @aprimadi and it works great :
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I wrote a test approximating
This would result in "Invalid identifier" errors, effectively claiming the column didn't exist. I verified the file existed, had the correct columns, etc... I was very confused 😆
Eventually I tracked this down to the schema being inferred as empty if the extension is not ".parquet", this feels unexpected
Describe the solution you'd like
Either
register_parquet
should return an error if the extension is missing, orFileFormat::infer_schema
should be more agnostic to file extensions.The text was updated successfully, but these errors were encountered: