-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Fix maybe_promote #25425
REF: Fix maybe_promote #25425
Conversation
@h-vetinari pls make it bite-sized and piecemeal.These giant PR's very likely won't be merged as they take too much review time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a brief glance shows this as way too complicated. This must be split up function wise. So as I said on a prior PR this would need to be a separate module with supporting functions.
I split off the tests into #23982 already. This PR is just refactoring the method (~200LoC).
I'm not sure you were addressing me with that (or on which PR) - don't know which modularisation you mean...? In any case, I already made an attempt at modularising things, by splitting the scalar and array case into separate methods. The method itself is not very complicated, it just has lots of branches (in steps 2 & 4 below) to deal with all the possible inputs:
|
Codecov Report
@@ Coverage Diff @@
## master #25425 +/- ##
===========================================
- Coverage 91.73% 41.69% -50.05%
===========================================
Files 173 173
Lines 52856 52932 +76
===========================================
- Hits 48490 22072 -26418
- Misses 4366 30860 +26494
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25425 +/- ##
==========================================
+ Coverage 91.85% 91.99% +0.13%
==========================================
Files 180 180
Lines 50765 50850 +85
==========================================
+ Hits 46631 46777 +146
+ Misses 4134 4073 -61
Continue to review full report at Codecov.
|
@TomAugspurger @jbrockmendel |
@h-vetinari I'll take a look at this. My schedule is unusually hectic between now and Tuesday, so it might take a few days. |
Thanks! |
pandas/core/dtypes/cast.py
Outdated
else: | ||
fill_type = type(fill_value) | ||
raise ValueError('fill_value must either be scalar, or a Series / ' | ||
'Index / np.ndarray; received {}'.format(fill_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are/should EAs be supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a design decision, but IMO yes. I've suggested #24246, and would then dispatch in maybe_promote_with_array
pandas/core/dtypes/cast.py
Outdated
# ndarray, but too high-dimensional | ||
fill_value = fill_value.ravel() | ||
elif not isinstance(fill_value, (ABCSeries, ABCIndexClass)): | ||
fill_type = type(fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually we use type(foo).__name__
. Any particular reason to not use the .__name__
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, will adapt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel
Thanks for the review. A few small changes incoming.
pandas/core/dtypes/cast.py
Outdated
else: | ||
fill_type = type(fill_value) | ||
raise ValueError('fill_value must either be scalar, or a Series / ' | ||
'Index / np.ndarray; received {}'.format(fill_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a design decision, but IMO yes. I've suggested #24246, and would then dispatch in maybe_promote_with_array
pandas/core/dtypes/cast.py
Outdated
# ndarray, but too high-dimensional | ||
fill_value = fill_value.ravel() | ||
elif not isinstance(fill_value, (ABCSeries, ABCIndexClass)): | ||
fill_type = type(fill_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, will adapt.
@h-vetinari it looks like maybe_promote is only used in 8-10 places outside of the tests. is the ndarray fill_value case even needed? I find that makes the function much harder to reason about. |
The low number of call-sites is what made me think that this can be fixed at all. I'm quite sure that the array-case was necessary, otherwise all those gymnastics could have been avoided. Should be simple enough to test - I'll just take out the array-branch from |
@jbrockmendel Taking a step back, in #23823 I said
The culprit at the time was the array-codepath in The downside to that is that it would be harder to keep the various places where promotion logic is defined in sync. Ultimately, I think there's an even larger clean-up necessary, involving Line 1952 in 74cba56
Don't know if or when I ever get around to that, my more immediate goals had been #23192 and #22812 (or rather, not so immediate anymore after over a year, haha ;-)). |
Thanks for tracking down the history behind the ndarray support; I'll read up on that. You've seen #28561 and #28564; let's try to find other parts of this that you can break off into similarly sized/scoped pieaces. e.g. carving out something like maybe_promote_scalar. Anything else come to mind? Based on a local branch, taking the ndarray cases out of the existing maybe_promote tests simplifies them a ton. Whether to re-implement the ndarray cases separately or just get rid of them depends on whether we can drop the ndarray case completely. Many of the failing cases can be fixed by changing
to
Same for the Timestamp case. That said, I think that instead of Any other ideas? I can keep pushing at this if you're not interested, but I'm hoping you'll keep going just in smaller bits. |
elif fill_value.ndim > 1: | ||
# ndarray, but too high-dimensional | ||
fill_value = fill_value.ravel() | ||
elif not isinstance(fill_value, (ABCSeries, ABCIndexClass)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need Series/Index? ATM we just have ndarray right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no harm in permitting them (if arrays are permitted at all), as they they fit into the code without extra effort (and later uses of maybe_upcast_putmask
might well plop a Series in there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we have a compelling use case, let's restrict the inputs to 1D, non-empty ndarray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Yeah, this is what I meant above. If we keep everything about
I tried to stay as close as possible to existing behaviour, but since we're (potentially) rewriting the method, it's completely fair to question the tests I've written. One choice I made is that I think strings should not be cast to TD/DT automatically. I've tried to comment such decisions in the implementation resp. the test module as well. iNaT is used quite ubiquitously, so not sure how easy it is to get rid of it, I just considered it in the same category as the other NA-values. |
This reverts commit d5aa77b.
@h-vetinari can you rebase? Any plans to do small-pieces PRs for this in the near future? If not, I'm going to keep trying to chip away at this. |
Sorry for the delay, will try to get to merging soon (should be latest on Sunday) |
@h-vetinari are you planning to run with this? Its fine if not, but I think you have a better idea of whats needed here than i do |
@jbrockmendel I overlooked your response there - sorry. I'm a bit swamped at the moment, but I'll try to carve out a PR from this one that adds
I'll have to have a look at how the current
Right now, the only case in the testing suite I'm aware of are the tests for #23823. Not sure if some other code is using the array-path... |
@h-vetinari can you rebase |
Merge remote-tracking branch 'upstream/master' into fix_maybe_promote
Added a separate function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments
pandas/core/dtypes/cast.py
Outdated
|
||
>>> maybe_promote_with_array(np.dtype('datetime64[ns]'), | ||
... fill_value=np.array([None])) | ||
(dtype('<M8[ns]'), -9223372036854775808) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: fix this
@@ -654,15 +591,16 @@ def test_maybe_promote_any_with_datetime64( | |||
) | |||
|
|||
|
|||
# override parametrization due to to many xfails; see GH 23982 / 25425 | |||
@pytest.mark.parametrize("box", [(True, object)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel
here we something fell through the cracks in #23982 - this tests never ran with box=False
.
@@ -682,8 +620,6 @@ def test_maybe_promote_datetimetz_with_any_numpy_dtype( | |||
) | |||
|
|||
|
|||
# override parametrization due to to many xfails; see GH 23982 / 25425 | |||
@pytest.mark.parametrize("box", [(True, None), (True, object)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel
here we something fell through the cracks in #23982 - this tests never ran with box=False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it sounds like some of the changes in this test file are valid independent of the changes in the other file. is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, all removed xfails are due to diverting the array-path through maybe_promote_with_array
instead of maybe_promote
. For test_maybe_promote_datetimetz_with_any_numpy_dtype
and test_maybe_promote_datetimetz_with_datetimetz
however, I needed to add xfails for the box=False
case because those are not working within maybe_promote
yet.
@jbrockmendel The good part about having the unified testing module ( |
@@ -462,6 +501,281 @@ def maybe_promote(dtype, fill_value=np.nan): | |||
return dtype, fill_value | |||
|
|||
|
|||
def maybe_promote_with_array(dtype, fill_value=np.nan): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a huge additional to the technical debt. I am -1 on adding this at all. It is not at all clear whether this logic is correct and/or tested. more to the point, what is the purpose of all of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback you can pretty much ignore this PR; I'm asking @h-vetinari to keep it rebased for reference as we identify parts that are worthwhile to break off into bite-size pieces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more to the point, what is the purpose of all of this?
There are a handful of places where we call maybe_promote where we could have fill_value
that is an ndarray. Part of the plan for this is to identify in which of those cases we can rule out ndarray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i that’s fine
happy to pick off good changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback: this is a huge additional to the technical debt. I am -1 on adding this at all. It is not at all clear whether this logic is correct and/or tested. more to the point, what is the purpose of all of this?
Please read this comment, not just skim over it.
The array-path in maybe_promote
is broken and both you and @jbrockmendel were excited to rip it out. At the same time, there's several potential or future use-cases for the array-case, and so I asked twice how this should be handled.
Having a separate method is IMO the least invasive change, and would eventually still allow to rip out the array-path from maybe_promote
. And more importantly, the logic is tested with the same promotion tests, which was the whole point of the tests/dtypes/cast/test_promote.py
-module. Lastly, since it's a private method, there's no technical debt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lastly, since it's a private method, there's no technical debt.
@h-vetinari i don’t even know what to say anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the misunderstanding, I meant as in API debt.
The technical debt is already there, in the array-path of maybe_promote
. I'm trying to fix it. Feel free to address any of the comments or questions I've raised about this. But if you come into an ancient PR and - without regard for any of the existing context - assert that it must be garbage ("It is not at all clear whether this logic is correct and/or tested [it is]. more to the point, what's the purpose of all of this?"), then I'm gonna respond in kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@h-vetinari pls don't respond like this. it is not helpful to anyone.
I can and will come into every PR and make comments. My purpose is to avoid cluttering pandas with technical debt. This PR just adds to it.
|
||
# comparison mechanics are broken above _int64_max; | ||
# use greater equal instead of equal | ||
if fill_max >= _int64_max + 1 or fill_min <= _int64_min - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use the can_cast machinery machinery currently in the scalar function? or even just dispatch to the scalar function in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dispatching to the scalar case is IMO out of the question for performance reasons until this whole code is cythonized (or the logic somehow unified with lib.maybe_convert_object
).
See Also | ||
-------- | ||
maybe_promote_with_array : underlying method for array case | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A PR with just (most of) this docstring would be a good start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try to do that.
@@ -462,6 +501,281 @@ def maybe_promote(dtype, fill_value=np.nan): | |||
return dtype, fill_value | |||
|
|||
|
|||
def maybe_promote_with_array(dtype, fill_value=np.nan): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback: this is a huge additional to the technical debt. I am -1 on adding this at all. It is not at all clear whether this logic is correct and/or tested. more to the point, what is the purpose of all of this?
Please read this comment, not just skim over it.
The array-path in maybe_promote
is broken and both you and @jbrockmendel were excited to rip it out. At the same time, there's several potential or future use-cases for the array-case, and so I asked twice how this should be handled.
Having a separate method is IMO the least invasive change, and would eventually still allow to rip out the array-path from maybe_promote
. And more importantly, the logic is tested with the same promotion tests, which was the whole point of the tests/dtypes/cast/test_promote.py
-module. Lastly, since it's a private method, there's no technical debt.
elif fill_value.ndim > 1: | ||
# ndarray, but too high-dimensional | ||
fill_value = fill_value.ravel() | ||
elif not isinstance(fill_value, (ABCSeries, ABCIndexClass)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
||
# comparison mechanics are broken above _int64_max; | ||
# use greater equal instead of equal | ||
if fill_max >= _int64_max + 1 or fill_min <= _int64_min - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dispatching to the scalar case is IMO out of the question for performance reasons until this whole code is cythonized (or the logic somehow unified with lib.maybe_convert_object
).
@@ -134,7 +134,7 @@ def _check_promote( | |||
# box_dtype; the expected value returned from maybe_promote is the | |||
# missing value marker for the returned dtype. | |||
fill_array = np.array([fill_value], dtype=box_dtype) | |||
result_dtype, result_fill_value = maybe_promote(dtype, fill_array) | |||
result_dtype, result_fill_value = maybe_promote_with_array(dtype, fill_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback @jbrockmendel
This is the point which diverts the testing of all array-paths in this module to maybe_promote_with_array
.
@@ -682,8 +620,6 @@ def test_maybe_promote_datetimetz_with_any_numpy_dtype( | |||
) | |||
|
|||
|
|||
# override parametrization due to to many xfails; see GH 23982 / 25425 | |||
@pytest.mark.parametrize("box", [(True, None), (True, object)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, all removed xfails are due to diverting the array-path through maybe_promote_with_array
instead of maybe_promote
. For test_maybe_promote_datetimetz_with_any_numpy_dtype
and test_maybe_promote_datetimetz_with_datetimetz
however, I needed to add xfails for the box=False
case because those are not working within maybe_promote
yet.
See Also | ||
-------- | ||
maybe_promote_with_array : underlying method for array case | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try to do that.
I am happy for @jbrockmendel to pick off parts of this. But this PR will not be merged in any way like this. closing. |
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR is the culmination of ongoing work since the start of November, and is therefore a bit on the bigger side, with several notes to make.
Things started out with me wanting to unify
.update
for Series/DF (#22358), resp. aiming towards a beefed-upupdate
/combine_first
/coalesce
(#22812). While tackling the former (#23192), I encountered some problems withdf.update
upcasting stuff unnecessarily (#23606), and while trying to fix it, I ran into problems withmaybe_upcast_putmask
(#23823), which were directly caused by the utterly broken (and completely untested)maybe_promote
(#23833).I started with writing some tests (#23982), which turned out to be not so trivial, because there's a lot of complexity, and the correct behaviour wasn't alwasy immediate (also encountered some fun numpy bugs in the process: e.g. numpy/numpy#12525, numpy/numpy#12550)
I set out to write out a PR to fix those tests then, with the obvious goal of getting the test suite to pass - already that required a full rewrite of the method. I cracked my own tests after a while, but the test suite eluded me. As it turns out,
maybe_promote
mixes two very different behaviours - scalar values get cast to the new dtype, whereas arrays return their missing value marker. I tried kludging around this for a while, and decided it wasn't possible without creating a franken-solution.The next step was to separate these two different behaviours into different functions,
maybe_promote_with_scalar
andmaybe_promote_with_array
, wheremaybe_promote
is then just a thin wrapper that switches between the two. Actually alsomaybe_promote_with_scalar
is just a fairly thin wrapper aroundmaybe_promote_with_array
, so that the actual many-cased promotion logic does not have to be implemented twice.Often, the call-sites in the code just need the one or the other, and this could later be broken up correspondingly.
I updated the tests in #23982 (taking care to fully capture all the xfails there) and based this PR on that. This should give already an overview of what changed. In many cases, the current behaviour is broken, but I did make a few design decisions worth noting:
maybe_promote_with_array
consistently returns the missing value marker for the updated dtype. Since integer dtypes (plus bools and bytes) cannot holdnp.nan
, these cases now returnNone
.fill_value
is of type [x] as well, where x is one of (datetime, timedelta, bool, bytes). Datetimetz must additionally match the timezone.fill_values
now truly get cast to the updated dtype (before there were lots of ambiguities around int/float/complex/datetime/timedelta subtypes)iNaT
is considered a missing value from the POV ofmaybe_promote_with_array
in all situations. This takes one single integer out of the usableint64
-range, but I think this is much cleaner.There's still a few issues with
lib.infer_dtype
(e.g. #23554, of which I already fixed the complex case #25382), most notably that it cannot inferdatetime64tz
yet. Actually, through this PR, I'm learning how broken that method is as well, but fixing that will have to wait for some other time. Among other things, it currently faceplants forPeriodArray
/IntervalArray
(#23553). I haven't added tests for these types here, but ~9000 tests is already better than nothing, I hope. ;-)Another point that could/should be considered is how EAs should deal with this (#24246).