fix(rust,python): address multiple issues caused by implicit casting of is_in
values to the column dtype being searched
#11427
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #11394.
The linked issue shows that we were casting
is_in
values to string if the target series wasUtf8
:This led down a surprisingly deep rabbit-hole, as various other incompatible dtype pairs were being allowed via implicit cast, leading to incorrect or weird/misleading results - for example, the following filters silently cast and drop all results:
I've resolved this by raising
InvalidOperationError
where the dtype of theis_in
values is not a reasonable match for the dtype of the target Series. We should fail with a clear error here; if the user wants to cast, they can do so explicitly without any trouble.In the cases above this means we now raise the following exceptions:
Also spotted that we could return false positives on datetime/duration matches by allowing
is_in
values with a more granulartime_unit
than the target Series (as we don't cast to supertype); this could also lead to incorrect filtering:Solved this by validating that the relative
time_unit
values cannot result in silent truncation/rounding. The above example now raises the following:(Note that we still allow all of the obvious/expected casts between basic int/float numeric dtypes, etc - only clearly mismatched dtypes -as above- will now raise an exception).