Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: check overflow numbers while inferring type for csv files #6481

Merged
merged 2 commits into from
Oct 2, 2024

Conversation

CookiePieWw
Copy link
Contributor

@CookiePieWw CookiePieWw commented Sep 30, 2024

Which issue does this PR close?

Related to #2580. Also related to apache/datafusion#3174

Rationale for this change

Currently we use regex to infer types in .csv files. The regex for Int64 is (^-?(\d+)$) which accepts all numbers even overflow (this caused apache/datafusion#3174). Initially I think we can use a regex that match the numbers in range, but the regex will be too long (more than 300 chars as I tried.

We can turn to a function trying to parse the string to i64, which is simple and flexible. The original regex could be kept or changed to more effective funtions if needed.

What changes are included in this PR?

Change the regex mentioned above to funtions.

I only changed the boolean and i64 to functions since it's obvious. The regex of decimal is extended to accept overflowing numbers. Other regex is kept. I also add a TODO for further improvements. (I'd like to try to change it later if following questions are addressed)

Some questions here:

  1. The regex of timestamp is ^\d{4}-\d\d-\d\d[T ]\d\d:\d\d:\d\d(?:[^\d\.].*)?$ which accept some illegal timestamps like 1000-00-00T11:11:11(adewoifas). I wonder if it's alright to use chrono::NaiveDateTime::parse_from_str(s, "%Y-%m-%d %H:%M:%S").is_ok() to replace it.
  2. The tests are performed over the file uk_cities.csv which has meaningful content. I wonder if it's alright to add some meaningless strings to it for testing.

Are there any user-facing changes?

The numbers that overflow now will be inferred as decimal type instead of int64.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 30, 2024
@tustvold
Copy link
Contributor

tustvold commented Sep 30, 2024

I worry this will significantly regress performance as well as regress functionality. The change to use NaiveDatetime will break timestamp inference for timestamps with timezones, for example. Moving away from RegexSet to serially checking Regex expressions will significantly slow down performance.

Perhaps we could take a step back and determine what it is we're trying to solve here? Inferring large integers as decimals is not really a like for like change

Edit: one less disruptive way to achieve this might be to on detecting Int64 from the RegexSet then try to parse the i64 if the string has enough characters that it could overflow. This would avoid regressing the majority of current workloads

@CookiePieWw
Copy link
Contributor Author

Edit: one less disruptive way to achieve this might be to on detecting Int64 from the RegexSet then try to parse the i64 if the string has enough characters that it could overflow. This would avoid regressing the majority of current workloads

Got you. I've changed it to a check for overflow now :)

Perhaps we could take a step back and determine what it is we're trying to solve here? Inferring large integers as decimals is not really a like for like change

We may have to find a type for large integers however, maybe utf8's better?

@CookiePieWw CookiePieWw changed the title refactor: use functions to infer types in csv fix: check overflow numbers while inferring type for csv files Sep 30, 2024
@tustvold
Copy link
Contributor

We may have to find a type for large integers however, maybe utf8's better?

Yeah I think utf8 is probably the safest option, float runs the risk of silent truncation, whereas decimal comes with its own complexities and can itself overflow.

I think all this needs now is some tests.

I don't think this is a breaking change, as the fact it inferred Int64 for values larger than fit in an Int64 could reasonably be considered a bug.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @CookiePieWw and @tustvold

The code now looks good to me -- it and it seems the overhead of checking the string length will be relatively minor compared to regex matching

I looked for benchmarks, but the only thing I see in https://github.com/apache/arrow-rs/blob/b2458bd686e5bc75397fde4a25f3a8b6c42ab064/arrow/benches/csv_reader.rs is for the actual reading (not type inference)

@tustvold tustvold merged commit 4389cf9 into apache:master Oct 2, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants