`read_csv`: consistent parsing across types (possible bug?) #10344

Julian-J-S · 2023-08-07T14:31:56Z

Problem description

Consistent parsing across types

This is my biggest gripe currently to be honst and this might even be a bug/unintentional because I could not find it explained in the docs ^^

Reason

I should be able to do the folling things when reading
- read everything as str
- specify types and try to parse, if invalid use null
- specify types and force to parse, if invalid then crash ->> this is not possible for all types and really useful in production ETL

currently:

DTYPES = {
    'int': pl.Int64,
    'float': pl.Float64,
    'date': pl.Date,
}

DATA_VALID = 'int,float,date\n3,3.4,2023-01-31'
DATA_INVALID_INT = 'int,float,date\nX,3.4,2023-01-31'
DATA_INVALID_FLOAT = 'int,float,date\n3,X,2023-01-31'
DATA_INVALID_DATE = 'int,float,date\n3,3.4,X'

# Consistent: on error -> null
pl.read_csv(
    source=StringIO(DATA_VALID),            # ok
    # source=StringIO(DATA_INVALID_INT),    # ok; int: null
    # source=StringIO(DATA_INVALID_FLOAT),  # ok; float: null
    # source=StringIO(DATA_INVALID_DATE),   # ok; date: null
    dtypes=DTYPES,
    ignore_errors=True,  # Ignore problems; use null instead
)

# NOT consistenst! On error: int/float -> Error; Temporal types -> null !?!?
pl.read_csv(
    source=StringIO(DATA_VALID),            # ok
    # source=StringIO(DATA_INVALID_INT),    # ComputeError: Could not parse `X` as dtype `i64` at column 'int' (column number 1)
    # source=StringIO(DATA_INVALID_FLOAT),  # ComputeError: Could not parse `X` as dtype `f64` at column 'float' (column number 2)    
    # source=StringIO(DATA_INVALID_DATE),   # ok with date: null <<<<<<< PROBLEM! WHY??
    dtypes=DTYPES,
    ignore_errors=False,  # Do NOT ignore problems; please crash; (Not working on temporal types...)
)

problems:

this behaviour seems very unitiutive and unexpected. Is this intended? Bug? I cant find any explanation in the docs?
in my opinion ignore_errors=False should crash independend on type if conversions fails
there is absolutely no way to differentiate if a date value was empty or could not be parsed

proposal:

pl.read_csv(
    source=StringIO(DATA_VALID),            # ok
    # source=StringIO(DATA_INVALID_INT),    # ComputeErrorComputeError: Could not parse `X` as dtype `i64` at column 'int' (column number 1)
    # source=StringIO(DATA_INVALID_FLOAT),  # ComputeError: Could not parse `X` as dtype `f64` at column 'float' (column number 2)    
    # source=StringIO(DATA_INVALID_DATE),   # ComputeError: ... <<<<<<<< Yes, consistent Error!
    dtypes=DTYPES,
    ignore_errors=False,
)

benefits:

consistent parsing of all dtypes
intuitive behaviour
being able to differentiate between emtpy date values and parsing errors

The text was updated successfully, but these errors were encountered:

orlp · 2023-08-08T12:58:04Z

Is it only dates that have this problem?

Julian-J-S · 2023-08-11T12:55:57Z

@orlp I spend some hours on this because parsing/converting inside read_csv together with ignore_errors=False seems to be broken in many ways (inconsistent in itself; inconsistent compared to other methods, does not do what it says)

TLDR: in its current state I would advise against using read_csv in combination with dtypes + ignore_errors=False because it is very error prone and you might get silent errors. Instead, read everything as strings and parse the data afterwards.

even if you set ignore_errors=False you still get many silent errors! (null if value cannot be parsed)
no way to distinguish between missing values and errors
parsing logic is not consistent with cast, str.strptime, str.to_date and str.to_datetime
parsing logic is not consistent within itself; overflow error on Int32 but ignored on Int8
...

Some explanations

One of the most difficult to find bugs are silent errors. You think everything is fine, but it is not. This is especially true for data processing pipelines.
This can happen with polars and also with probably the most used method: read_csv

Simple example where you read inventory data at specific dates:

DATA_INVENTORY = '''
date,inventory
2019-01-01,10
2019-21-02,20
2019-01-03
2019-01-04,400
,50
2019-01-06,60
XXX,700
2019-01-08,80
2019-01-09,90
'''

pl.read_csv(
    source=StringIO(DATA_INVENTORY),
    dtypes={'date': pl.Date, 'inventory': pl.Int8},
    ignore_errors=True,
)
┌────────────┬───────────┐
│ date       ┆ inventory │
│ ---        ┆ ---       │
│ date       ┆ i8        │
╞════════════╪═══════════╡
│ 2019-01-01 ┆ 10        │
│ null       ┆ 20        │
│ 2019-01-03 ┆ null      │
│ 2019-01-04 ┆ null      │
│ null       ┆ 50        │
│ 2019-01-06 ┆ 60        │
│ null       ┆ null      │
│ 2019-01-08 ┆ 80        │
│ 2019-01-09 ┆ 90        │
└────────────┴───────────┘

In the first step we set ignore_errors=True to ignore errors because we might expect some errors in the data.
If we on the other hand expect the data to be correct, we can set ignore_errors=False to raise an error if the data is not correct.
However, in this case this does absolutely nothing. The data is read without any errors and we get silent errors!
This is a huge problem because we cannot validate the data this way and might assume all null values are missing values and not errors!

If we instead read everything as strings and parse the data afterwards, we get completely different behavior:

pl.read_csv(
    source=StringIO(DATA_INVENTORY),
    infer_schema_length=0,
).with_columns(
    pl.col("date").str.to_date(), # ComputeError: strict date parsing failed for 2 value(s) (2 unique): ["2019-21-02", "XXX"]
    pl.col("inventory").cast(pl.Int8), # ComputeError: strict conversion from `str` to `i8` failed for value(s) ["700", "400"]
)

This is what we expect! Not sure why read_csv seems to be using a different parsing logic?!?

Hereafter, a small overview of what I discovered so far:

Integers

`Int8`

read_csv with ignore_errors=False
- 128, -129: silently converted to null 🛑
- xxx: ComputeError 🟢
conversion on str column
- 128, -129, xxx: ComputeError 🟢

`UInt8`

read_csv with ignore_errors=False
- -1, 256: silently converted to null 🛑
- xxx: ComputeError 🟢
conversion on str column
- -1, 256, xxx: ComputeError 🟢

`Int32` (All good; Why is this different from `Int8`?)

read_csv with ignore_errors=False
- 2147483649, -2147483649, xxx: ComputeError 🟢
conversion on str column
- 2147483649, -2147483649, xxx: ComputeError 🟢

Temporal

`Date`

read_csv with ignore_errors=False
- 2023-01-32, 2023-00-01, xxx: silently converted to null 🛑
conversion on str column
- 2023-01-32, 2023-00-01, xxx: ComputeError 🟢

`DateTime`

read_csv with ignore_errors=False
- 2023-01-31: falsely converted to null 🛑
- -1000-01-01 00:00:00.000: falsely converted to null! (but negative Date works...) 🛑
- xxx: silently converted to null 🛑
- 2023-01-31 24:56:78: somehow works with date "2023-01-31 00:00:00.000000" ?? 🟧
conversion on str column
- 2023-01-31, -1000-01-01 00:00:00.000: works fine! 🟢
- 2023-01-31 24:56:78: somehow works with date "2023-01-31 00:00:00.000000" ?? 🟧

orlp · 2023-08-11T13:28:09Z

Thank you for your detailed investigation. I'll look at it later to see which things need to be changed.

ritchie46 · 2023-08-21T06:04:24Z

This is due to the fact that some dtypes are directly supported in our reader, whilst other are casted after the parsing (from another dtype, e.g. Utf8 or I32).

I shall ensure we respect the ignore_errors flag during this cast.

Julian-J-S added the enhancement New feature or an improvement of an existing feature label Aug 7, 2023

orlp self-assigned this Aug 11, 2023

Julian-J-S mentioned this issue Aug 20, 2023

read_csv() inconsistently handles out-of-range integers when dtypes are Int8/16/32 #10635

Closed

2 tasks

ritchie46 mentioned this issue Aug 21, 2023

fix(rust, python): respect 'ignore_errors=False' in csv parser #10641

Merged

ritchie46 closed this as completed in #10641 Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_csv`: consistent parsing across types (possible bug?) #10344

`read_csv`: consistent parsing across types (possible bug?) #10344

Julian-J-S commented Aug 7, 2023

orlp commented Aug 8, 2023

Julian-J-S commented Aug 11, 2023

orlp commented Aug 11, 2023

ritchie46 commented Aug 21, 2023 •

edited

Loading

read_csv: consistent parsing across types (possible bug?) #10344

read_csv: consistent parsing across types (possible bug?) #10344

Comments

Julian-J-S commented Aug 7, 2023

Problem description

Consistent parsing across types

orlp commented Aug 8, 2023

Julian-J-S commented Aug 11, 2023

Some explanations

Integers

Int8

UInt8

Int32 (All good; Why is this different from Int8?)

Temporal

Date

DateTime

orlp commented Aug 11, 2023

ritchie46 commented Aug 21, 2023 • edited Loading

`read_csv`: consistent parsing across types (possible bug?) #10344

`read_csv`: consistent parsing across types (possible bug?) #10344

`Int8`

`UInt8`

`Int32` (All good; Why is this different from `Int8`?)

`Date`

`DateTime`

ritchie46 commented Aug 21, 2023 •

edited

Loading