-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv
: consistent parsing across types (possible bug?)
#10344
Comments
Is it only dates that have this problem? |
@orlp I spend some hours on this because parsing/converting inside TLDR: in its current state I would advise against using
Some explanationsOne of the most difficult to find bugs are silent errors. You think everything is fine, but it is not. This is especially true for data processing pipelines. Simple example where you read inventory data at specific dates: DATA_INVENTORY = '''
date,inventory
2019-01-01,10
2019-21-02,20
2019-01-03
2019-01-04,400
,50
2019-01-06,60
XXX,700
2019-01-08,80
2019-01-09,90
'''
pl.read_csv(
source=StringIO(DATA_INVENTORY),
dtypes={'date': pl.Date, 'inventory': pl.Int8},
ignore_errors=True,
)
┌────────────┬───────────┐
│ date ┆ inventory │
│ --- ┆ --- │
│ date ┆ i8 │
╞════════════╪═══════════╡
│ 2019-01-01 ┆ 10 │
│ null ┆ 20 │
│ 2019-01-03 ┆ null │
│ 2019-01-04 ┆ null │
│ null ┆ 50 │
│ 2019-01-06 ┆ 60 │
│ null ┆ null │
│ 2019-01-08 ┆ 80 │
│ 2019-01-09 ┆ 90 │
└────────────┴───────────┘ In the first step we set If we instead read everything as strings and parse the data afterwards, we get completely different behavior: pl.read_csv(
source=StringIO(DATA_INVENTORY),
infer_schema_length=0,
).with_columns(
pl.col("date").str.to_date(), # ComputeError: strict date parsing failed for 2 value(s) (2 unique): ["2019-21-02", "XXX"]
pl.col("inventory").cast(pl.Int8), # ComputeError: strict conversion from `str` to `i8` failed for value(s) ["700", "400"]
) This is what we expect! Not sure why Hereafter, a small overview of what I discovered so far: Integers
|
Thank you for your detailed investigation. I'll look at it later to see which things need to be changed. |
This is due to the fact that some dtypes are directly supported in our reader, whilst other are casted after the parsing (from another dtype, e.g. Utf8 or I32). I shall ensure we respect the |
Problem description
Consistent parsing across types
This is my biggest gripe currently to be honst and this might even be a bug/unintentional because I could not find it explained in the docs ^^
Reason
currently:
problems:
ignore_errors=False
should crash independend on type if conversions failsproposal:
benefits:
The text was updated successfully, but these errors were encountered: