Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust,python): address multiple issues caused by implicit casting of is_in values to the column dtype being searched #11427

Merged

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Sep 29, 2023

Closes #11394.

  • The linked issue shows that we were casting is_in values to string if the target series was Utf8:

    from datetime import date
    import polars as pl
    
    df = pl.DataFrame({
        "a": ["1","2"],
        "b": [3,4],
    })
    df.filter( pl.col("a").is_in([1,2]) )  # << filtering string by int64 🚫
    # ┌─────┬─────┐
    # │ a   ┆ b   │
    # │ --- ┆ --- │
    # │ str ┆ i64 │
    # ╞═════╪═════╡
    # │ 1   ┆ 3   │
    # │ 2   ┆ 4   │
    # └─────┴─────┘
  • This led down a surprisingly deep rabbit-hole, as various other incompatible dtype pairs were being allowed via implicit cast, leading to incorrect or weird/misleading results - for example, the following filters silently cast and drop all results:

    df = pl.DataFrame({
        "a": ["1","2"],
        "b": [3,4],
    })
    df.filter( pl.col("a").is_in([date.today()]) )  # << filtering string by date 🚫
    # ┌─────┬─────┐
    # │ a   ┆ b   │
    # │ --- ┆ --- │
    # │ str ┆ i64 │
    # ╞═════╪═════╡
    # └─────┴─────┘
    
    df.filter( pl.col("b").is_in([date.today()]) )  # << filtering int64 by date 🚫
    # ┌─────┬─────┐
    # │ a   ┆ b   │
    # │ --- ┆ --- │
    # │ str ┆ i64 │
    # ╞═════╪═════╡
    # └─────┴─────┘

    I've resolved this by raising InvalidOperationError where the dtype of the is_in values is not a reasonable match for the dtype of the target Series. We should fail with a clear error here; if the user wants to cast, they can do so explicitly without any trouble.

    In the cases above this means we now raise the following exceptions:

    InvalidOperationError: 
      `is_in` cannot check for Int64 values in Utf8 data
      `is_in` cannot check for Date values in Int64 data
    
  • Also spotted that we could return false positives on datetime/duration matches by allowing is_in values with a more granular time_unit than the target Series (as we don't cast to supertype); this could also lead to incorrect filtering:

    # data has millisecond precision
    df = pl.DataFrame({
        "d": [datetime(2022,10,20,10,30,45,123000)],
    }, schema_overrides={"d": pl.Datetime("ms")})
    
    # filter value has microsecond precision and should NOT match!
    df.filter( 
        pl.col("d").is_in([datetime(2022,10,20,10,30,45,123456)])  # (ms > μs) 🚫
    )
    
    # but... :(
    # ┌─────────────────────────┐
    # │ d                       │
    # │ ---                     │
    # │ datetime[ms]            │
    # ╞═════════════════════════╡
    # │ 2022-10-20 10:30:45.123 │
    # └─────────────────────────┘

    Solved this by validating that the relative time_unit values cannot result in silent truncation/rounding. The above example now raises the following:

    InvalidOperationError: 
      `is_in` cannot check for Microseconds precision values in Milliseconds Datetime data
    

(Note that we still allow all of the obvious/expected casts between basic int/float numeric dtypes, etc - only clearly mismatched dtypes -as above- will now raise an exception).

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Sep 29, 2023
@alexander-beedie alexander-beedie marked this pull request as draft September 29, 2023 22:38
@alexander-beedie alexander-beedie force-pushed the fix-invalid-is-in-dtype-casts branch 3 times, most recently from 2bbc535 to d953a64 Compare September 29, 2023 23:18
@alexander-beedie alexander-beedie marked this pull request as ready for review September 30, 2023 07:59
@alexander-beedie alexander-beedie changed the title fix(rust,python): address multiple issues caused by casting is_in values to the column type being searched fix(rust,python): address multiple issues caused by casting is_in values to the column dtype being searched Sep 30, 2023
@alexander-beedie alexander-beedie changed the title fix(rust,python): address multiple issues caused by casting is_in values to the column dtype being searched fix(rust,python): address multiple issues caused by implicit casting of is_in values to the column dtype being searched Sep 30, 2023
@ritchie46
Copy link
Member

Nice one thanks!

@ritchie46 ritchie46 merged commit 626b7a6 into pola-rs:main Oct 1, 2023
25 checks passed
@alexander-beedie alexander-beedie deleted the fix-invalid-is-in-dtype-casts branch October 1, 2023 10:43
romanovacca pushed a commit to romanovacca/polars that referenced this pull request Oct 1, 2023
…of `is_in` values to the column dtype being searched (pola-rs#11427)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

inconsistency with .is_in when filtering string digits
2 participants