Improving the CSV schema inference #2580

bezbac · 2022-08-24T18:17:44Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There's an open issue in the datafusion repository with the CSV schema inference. The current implementation in arrow will return Int64 as the datatype for any numeric columns that have no decimal and don't match a date format. This circumstance is causing problems when the CSV is read later, should the value overflow the Int64 data type.

Here's the datafusion issue apache/datafusion#3174

Describe the solution you'd like
Maybe arrow could try to support the UInt64 and Decimal128 datatypes as well, should it notice the values inside the CSV are too large. Or even default to String should it notice that even these are too small to ensure the CSV can be read without problems.

Describe alternatives you've considered
Alternatively, I imagine the column's type could be "upgraded" when reading the CSV, should there be any parsing errors due to overflows. I imagine this would need all previously parsed values to be casted, which could hopefully be avoided given better inference results.

Additional context
I'd be open to implementing this change. My naive approach would be something like this: 4b3104e in case anyone here has any suggestions on how to improve it, I would be very happy.

The text was updated successfully, but these errors were encountered:

tustvold · 2022-08-25T07:28:02Z

The approach you describe seems sensible to me, I would perhaps caution that the Decimal support within arrow is fairly limited at the moment, but perhaps that is a separate problem to solve 😄

bezbac added the enhancement Any new improvement worthy of a entry in the changelog label Aug 24, 2022

tustvold mentioned this issue Oct 23, 2022

Support reading DecimalArray from JSON data #2900

Closed

tustvold mentioned this issue Apr 26, 2023

[arrow_json]infer_json_schema() infers u64::MAX as type Int64 #4134

Open

haohuaijin mentioned this issue Mar 17, 2024

[question]when using the datafusion reading csv in rust project, it went wrong apache/datafusion#9652

Closed

CookiePieWw mentioned this issue Sep 30, 2024

fix: check overflow numbers while inferring type for csv files #6481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the CSV schema inference #2580

Improving the CSV schema inference #2580

bezbac commented Aug 24, 2022 •

edited

Loading

tustvold commented Aug 25, 2022

Improving the CSV schema inference #2580

Improving the CSV schema inference #2580

Comments

bezbac commented Aug 24, 2022 • edited Loading

tustvold commented Aug 25, 2022

bezbac commented Aug 24, 2022 •

edited

Loading