Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv: consistent parsing across types (possible bug?) #10344

Closed
Julian-J-S opened this issue Aug 7, 2023 · 4 comments · Fixed by #10641
Closed

read_csv: consistent parsing across types (possible bug?) #10344

Julian-J-S opened this issue Aug 7, 2023 · 4 comments · Fixed by #10641
Assignees
Labels
enhancement New feature or an improvement of an existing feature

Comments

@Julian-J-S
Copy link
Contributor

Problem description

Consistent parsing across types

This is my biggest gripe currently to be honst and this might even be a bug/unintentional because I could not find it explained in the docs ^^

Reason

  • I should be able to do the folling things when reading
    • read everything as str
    • specify types and try to parse, if invalid use null
    • specify types and force to parse, if invalid then crash ->> this is not possible for all types and really useful in production ETL

currently:

DTYPES = {
    'int': pl.Int64,
    'float': pl.Float64,
    'date': pl.Date,
}

DATA_VALID = 'int,float,date\n3,3.4,2023-01-31'
DATA_INVALID_INT = 'int,float,date\nX,3.4,2023-01-31'
DATA_INVALID_FLOAT = 'int,float,date\n3,X,2023-01-31'
DATA_INVALID_DATE = 'int,float,date\n3,3.4,X'

# Consistent: on error -> null
pl.read_csv(
    source=StringIO(DATA_VALID),            # ok
    # source=StringIO(DATA_INVALID_INT),    # ok; int: null
    # source=StringIO(DATA_INVALID_FLOAT),  # ok; float: null
    # source=StringIO(DATA_INVALID_DATE),   # ok; date: null
    dtypes=DTYPES,
    ignore_errors=True,  # Ignore problems; use null instead
)

# NOT consistenst! On error: int/float -> Error; Temporal types -> null !?!?
pl.read_csv(
    source=StringIO(DATA_VALID),            # ok
    # source=StringIO(DATA_INVALID_INT),    # ComputeError: Could not parse `X` as dtype `i64` at column 'int' (column number 1)
    # source=StringIO(DATA_INVALID_FLOAT),  # ComputeError: Could not parse `X` as dtype `f64` at column 'float' (column number 2)    
    # source=StringIO(DATA_INVALID_DATE),   # ok with date: null <<<<<<< PROBLEM! WHY??
    dtypes=DTYPES,
    ignore_errors=False,  # Do NOT ignore problems; please crash; (Not working on temporal types...)
)

problems:

  • this behaviour seems very unitiutive and unexpected. Is this intended? Bug? I cant find any explanation in the docs?
  • in my opinion ignore_errors=False should crash independend on type if conversions fails
  • there is absolutely no way to differentiate if a date value was empty or could not be parsed

proposal:

pl.read_csv(
    source=StringIO(DATA_VALID),            # ok
    # source=StringIO(DATA_INVALID_INT),    # ComputeErrorComputeError: Could not parse `X` as dtype `i64` at column 'int' (column number 1)
    # source=StringIO(DATA_INVALID_FLOAT),  # ComputeError: Could not parse `X` as dtype `f64` at column 'float' (column number 2)    
    # source=StringIO(DATA_INVALID_DATE),   # ComputeError: ... <<<<<<<< Yes, consistent Error!
    dtypes=DTYPES,
    ignore_errors=False,
)

benefits:

  • consistent parsing of all dtypes
  • intuitive behaviour
  • being able to differentiate between emtpy date values and parsing errors
@Julian-J-S Julian-J-S added the enhancement New feature or an improvement of an existing feature label Aug 7, 2023
@orlp
Copy link
Collaborator

orlp commented Aug 8, 2023

Is it only dates that have this problem?

@Julian-J-S
Copy link
Contributor Author

@orlp I spend some hours on this because parsing/converting inside read_csv together with ignore_errors=False seems to be broken in many ways (inconsistent in itself; inconsistent compared to other methods, does not do what it says)

TLDR: in its current state I would advise against using read_csv in combination with dtypes + ignore_errors=False because it is very error prone and you might get silent errors. Instead, read everything as strings and parse the data afterwards.

  • even if you set ignore_errors=False you still get many silent errors! (null if value cannot be parsed)
  • no way to distinguish between missing values and errors
  • parsing logic is not consistent with cast, str.strptime, str.to_date and str.to_datetime
  • parsing logic is not consistent within itself; overflow error on Int32 but ignored on Int8
  • ...

Some explanations

One of the most difficult to find bugs are silent errors. You think everything is fine, but it is not. This is especially true for data processing pipelines.
This can happen with polars and also with probably the most used method: read_csv

Simple example where you read inventory data at specific dates:

DATA_INVENTORY = '''
date,inventory
2019-01-01,10
2019-21-02,20
2019-01-03
2019-01-04,400
,50
2019-01-06,60
XXX,700
2019-01-08,80
2019-01-09,90
'''

pl.read_csv(
    source=StringIO(DATA_INVENTORY),
    dtypes={'date': pl.Date, 'inventory': pl.Int8},
    ignore_errors=True,
)
┌────────────┬───────────┐
│ dateinventory │
│ ------       │
│ datei8        │
╞════════════╪═══════════╡
│ 2019-01-0110        │
│ null20        │
│ 2019-01-03null      │
│ 2019-01-04null      │
│ null50        │
│ 2019-01-0660        │
│ nullnull      │
│ 2019-01-0880        │
│ 2019-01-0990        │
└────────────┴───────────┘

In the first step we set ignore_errors=True to ignore errors because we might expect some errors in the data.
If we on the other hand expect the data to be correct, we can set ignore_errors=False to raise an error if the data is not correct.
However, in this case this does absolutely nothing. The data is read without any errors and we get silent errors!
This is a huge problem because we cannot validate the data this way and might assume all null values are missing values and not errors!

If we instead read everything as strings and parse the data afterwards, we get completely different behavior:

pl.read_csv(
    source=StringIO(DATA_INVENTORY),
    infer_schema_length=0,
).with_columns(
    pl.col("date").str.to_date(), # ComputeError: strict date parsing failed for 2 value(s) (2 unique): ["2019-21-02", "XXX"]
    pl.col("inventory").cast(pl.Int8), # ComputeError: strict conversion from `str` to `i8` failed for value(s) ["700", "400"]
)

This is what we expect! Not sure why read_csv seems to be using a different parsing logic?!?

Hereafter, a small overview of what I discovered so far:

Integers

Int8

  • read_csv with ignore_errors=False
    • 128, -129: silently converted to null 🛑
    • xxx: ComputeError 🟢
  • conversion on str column
    • 128, -129, xxx: ComputeError 🟢

UInt8

  • read_csv with ignore_errors=False
    • -1, 256: silently converted to null 🛑
    • xxx: ComputeError 🟢
  • conversion on str column
    • -1, 256, xxx: ComputeError 🟢

Int32 (All good; Why is this different from Int8?)

  • read_csv with ignore_errors=False
    • 2147483649, -2147483649, xxx: ComputeError 🟢
  • conversion on str column
    • 2147483649, -2147483649, xxx: ComputeError 🟢

Temporal

Date

  • read_csv with ignore_errors=False
    • 2023-01-32, 2023-00-01, xxx: silently converted to null 🛑
  • conversion on str column
    • 2023-01-32, 2023-00-01, xxx: ComputeError 🟢

DateTime

  • read_csv with ignore_errors=False
    • 2023-01-31: falsely converted to null 🛑
    • -1000-01-01 00:00:00.000: falsely converted to null! (but negative Date works...) 🛑
    • xxx: silently converted to null 🛑
    • 2023-01-31 24:56:78: somehow works with date "2023-01-31 00:00:00.000000" ?? 🟧
  • conversion on str column
    • 2023-01-31, -1000-01-01 00:00:00.000: works fine! 🟢
    • 2023-01-31 24:56:78: somehow works with date "2023-01-31 00:00:00.000000" ?? 🟧

@orlp
Copy link
Collaborator

orlp commented Aug 11, 2023

Thank you for your detailed investigation. I'll look at it later to see which things need to be changed.

@ritchie46
Copy link
Member

ritchie46 commented Aug 21, 2023

This is due to the fact that some dtypes are directly supported in our reader, whilst other are casted after the parsing (from another dtype, e.g. Utf8 or I32).

I shall ensure we respect the ignore_errors flag during this cast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants