Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files without .parquet, .csv extension inferred as having no schema #1736

Closed
tustvold opened this issue Feb 3, 2022 · 11 comments · Fixed by #6274
Closed

Files without .parquet, .csv extension inferred as having no schema #1736

tustvold opened this issue Feb 3, 2022 · 11 comments · Fixed by #6274
Assignees
Labels
bug Something isn't working enhancement New feature or request
Milestone

Comments

@tustvold
Copy link
Contributor

tustvold commented Feb 3, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I wrote a test approximating

let file = tempfile::tempfile();

// ... write parquet data ...

let mut context = ExecutionContext::new();
context.register_parquet("t", file.path().as_str())
context.sql("select column from t");

This would result in "Invalid identifier" errors, effectively claiming the column didn't exist. I verified the file existed, had the correct columns, etc... I was very confused 😆

Eventually I tracked this down to the schema being inferred as empty if the extension is not ".parquet", this feels unexpected

Describe the solution you'd like

Either register_parquet should return an error if the extension is missing, or FileFormat::infer_schema should be more agnostic to file extensions.

@tustvold tustvold added the enhancement New feature or request label Feb 3, 2022
@alamb alamb changed the title File Extension Agnostic Schema Inference Parquet files without .parquet extension inferred as having no schema Feb 3, 2022
@alamb alamb added the bug Something isn't working label Feb 3, 2022
@alamb
Copy link
Contributor

alamb commented Feb 3, 2022

Change this description to be a bug, which I think better reflects what is going on

@Ted-Jiang
Copy link
Member

@tustvold may i ask you how to write a tmp parquet file, IMHP i think all parquet data file should end .parquet.

@tustvold
Copy link
Contributor Author

tustvold commented Feb 7, 2022

@Ted-Jiang Remove this line in the SQL benchmark https://github.com/apache/arrow-datafusion/pull/1738/files#diff-d1dbff8af63c3a3fe4d918432f982181b40fa4b7e1641522a6a48904f521fc89R143 and that was what caused issues.

I don't feel strongly whether the .parquet file should be mandatory or not, but I do think silently inferring an empty schema makes for a pretty poor UX 😅

@alamb
Copy link
Contributor

alamb commented Feb 7, 2022

but I do think silently inferring an empty schema makes for a pretty poor UX 😅

I agree -- an error about "can not infer schema" seems much better than silently ignoring

@tustvold tustvold changed the title Parquet files without .parquet extension inferred as having no schema Files without .parquet, .csv extension inferred as having no schema Jul 17, 2022
@andygrove andygrove added this to the 14.0.0 milestone Oct 30, 2022
@andygrove andygrove self-assigned this Oct 30, 2022
@andygrove
Copy link
Member

I plan on fixing this for 14.0.0

@andygrove
Copy link
Member

andygrove commented Oct 30, 2022

Some notes from investigating this.

This is where the EmptyExec gets introduced. I ran into this issue in Ballista as well in apache/datafusion-ballista#414

        // if no files need to be read, return an `EmptyExec`
        if partitioned_file_lists.is_empty() {
            let schema = self.schema();
            let projected_schema = project_schema(&schema, projection.as_ref())?;
            return Ok(Arc::new(EmptyExec::new(false, projected_schema)));
        }

The problem is I am registering a CSV data source and the filename does not end with .csv and we have hard-coded assumptions that CSV files must end with ".csv", so we cannot support .tsv or .tbl (without the user explicitly specifying this, and it is not possible to specify this in SQL as far as I know).

I think we just need to add an error check for this case and prompt the user to specify the file extension they want rather than use the default.

@andygrove
Copy link
Member

I propose that the error check we add is that at least one file exists with the specified extension.

@alamb
Copy link
Contributor

alamb commented Oct 31, 2022

I propose that the error check we add is that at least one file exists with the specified extension.

Makes sense to me

@liningpan
Copy link

I also ran into this issue. The problem seems to be even if only a single file is provided, we still try to match by extension.

https://github.com/apache/arrow-datafusion/blob/0e5f6df2c4fb2d647874102a19c74eaaf7f34d98/datafusion/core/src/datasource/listing/url.rs#L143-L175

If my use case always uses a single csv file for each table, would a reasonable workaround be setting file extension to an empty string in CsvReadOptions? Are there any problems?

ctx.register_csv(
    name,
    path,
    CsvReadOptions::new().file_extension("")
).await?

@andygrove
Copy link
Member

Maybe file_extension could be made an Option<String>, defaulting to None?

@alamb
Copy link
Contributor

alamb commented May 8, 2023

I tried this usecase with the fix in #6147 from @aprimadi and it works great :

(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion/datafusion-cli$ head /tmp/test/customer
1|Customer#000000001|IVhzIApeRb ot,c,E|15|25-989-741-2988|711.56|BUILDING|to the even, regular platelets. regular, ironic epitaphs nag e|
2|Customer#000000002|XSTf4,NCwDVaWNe6tEgvwfmRchLXak|13|23-768-687-3665|121.65|AUTOMOBILE|l accounts. blithely ironic theodolites integrate boldly: caref|
3|Customer#000000003|MG9kdTD2WBHm|1|11-719-748-3364|7498.12|AUTOMOBILE| deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov|
4|Customer#000000004|XxVSJsLAGtn|4|14-128-190-5944|2866.83|MACHINERY| requests. final, regular ideas sleep final accou|
5|Customer#000000005|KvpyuHCplrB84WgAiGV6sYpZq7Tj|3|13-750-942-6364|794.47|HOUSEHOLD|n accounts will have to unwind. foxes cajole accor|
6|Customer#000000006|sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn|20|30-114-968-4951|7638.57|AUTOMOBILE|tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious|
7|Customer#000000007|TcGe5gaZNgVePxU5kRrvXBfkasDTea|18|28-190-982-9759|9561.95|AUTOMOBILE|ainst the ironic, express theodolites. express, even pinto beans among the exp|
8|Customer#000000008|I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5|17|27-147-574-9335|6819.74|BUILDING|among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide|
9|Customer#000000009|xKiAFTjUsCuxfeleNqefumTrjS|8|18-338-906-3675|8324.07|FURNITURE|r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl|
10|Customer#000000010|6LrEaV6KR6PLVcgl2ArL Q3rqzLzcT1 v2|5|15-741-346-9870|2753.54|HOUSEHOLD|es regular deposits haggle. fur|
(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion/datafusion-cli$ CARGO_TARGET_DIR=/Users/alamb/Software/target-df cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.29s
     Running `/Users/alamb/Software/target-df/debug/datafusion-cli`
DataFusion CLI v23.0.0
❯ CREATE EXTERNAL TABLE customer STORED AS CSV DELIMITER '|' LOCATION  '/tmp/test/customer';
0 rows in set. Query took 0.290 seconds.
❯ select * from customer limit 1;
+----------+--------------------+-------------------+----------+-----------------+----------+----------+----------------------------------------------------------------+----------+
| column_1 | column_2           | column_3          | column_4 | column_5        | column_6 | column_7 | column_8                                                       | column_9 |
+----------+--------------------+-------------------+----------+-----------------+----------+----------+----------------------------------------------------------------+----------+
| 1        | Customer#000000001 | IVhzIApeRb ot,c,E | 15       | 25-989-741-2988 | 711.56   | BUILDING | to the even, regular platelets. regular, ironic epitaphs nag e |          |
+----------+--------------------+-------------------+----------+-----------------+----------+----------+----------------------------------------------------------------+----------+
1 row in set. Query took 0.065 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
5 participants