Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Fix maybe_promote #25425

Closed
wants to merge 37 commits into from
Closed

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Feb 24, 2019

This PR is the culmination of ongoing work since the start of November, and is therefore a bit on the bigger side, with several notes to make.

Things started out with me wanting to unify .update for Series/DF (#22358), resp. aiming towards a beefed-up update/combine_first/coalesce (#22812). While tackling the former (#23192), I encountered some problems with df.update upcasting stuff unnecessarily (#23606), and while trying to fix it, I ran into problems with maybe_upcast_putmask (#23823), which were directly caused by the utterly broken (and completely untested) maybe_promote (#23833).

I started with writing some tests (#23982), which turned out to be not so trivial, because there's a lot of complexity, and the correct behaviour wasn't alwasy immediate (also encountered some fun numpy bugs in the process: e.g. numpy/numpy#12525, numpy/numpy#12550)

I set out to write out a PR to fix those tests then, with the obvious goal of getting the test suite to pass - already that required a full rewrite of the method. I cracked my own tests after a while, but the test suite eluded me. As it turns out, maybe_promote mixes two very different behaviours - scalar values get cast to the new dtype, whereas arrays return their missing value marker. I tried kludging around this for a while, and decided it wasn't possible without creating a franken-solution.

The next step was to separate these two different behaviours into different functions, maybe_promote_with_scalar and maybe_promote_with_array, where maybe_promote is then just a thin wrapper that switches between the two. Actually also maybe_promote_with_scalar is just a fairly thin wrapper around maybe_promote_with_array, so that the actual many-cased promotion logic does not have to be implemented twice.

Often, the call-sites in the code just need the one or the other, and this could later be broken up correspondingly.

I updated the tests in #23982 (taking care to fully capture all the xfails there) and based this PR on that. This should give already an overview of what changed. In many cases, the current behaviour is broken, but I did make a few design decisions worth noting:

  • maybe_promote_with_array consistently returns the missing value marker for the updated dtype. Since integer dtypes (plus bools and bytes) cannot hold np.nan, these cases now return None.
  • all promotion logic is as conservative as possible, also within subtypes. For arrays, promotion always goes by value, and never by dtype. That means that, for example:
    >>> maybe_promote(np.dtype('uint8'), fill_value=np.iinfo('uint8').max + 1)
    (dtype('uint16'), 256)
    >>> maybe_promote(np.dtype('uint8'), fill_value=np.array([-1], dtype='int64'))
    (dtype('int16'), None)
  • all promotion logic is as type-safe as possible, which means that [x] only stays [x] if the fill_value is of type [x] as well, where x is one of (datetime, timedelta, bool, bytes). Datetimetz must additionally match the timezone.
  • all scalar fill_values now truly get cast to the updated dtype (before there were lots of ambiguities around int/float/complex/datetime/timedelta subtypes)
  • I have changed the behavior that strings get interpreted for datetimes/timedeltas. Since this is an untested private method, and the test suite still passes just fine, I think this is actually a good thing, because it's too much in one method. String to datetime/timedelta should need an explicit cast, IMO.
    >>> # master
    >>> maybe_promote(np.dtype('datetime64[ns]'), '2018-01-01')
    (dtype('<M8[ns]'), 1514764800000000000)
    >>> # PR
    >>> maybe_promote(np.dtype('datetime64[ns]'), '2018-01-01')
    (dtype('O'), '2018-01-01')
    >>> # master
    >>> maybe_promote(np.dtype('timedelta64[ns]'), '1 day')
    (dtype('<m8[ns]'), 86400000000000)
    >>> # PR
    >>> maybe_promote(np.dtype('timedelta64[ns]'), '1 day')
    (dtype('O'), '1 day')
  • iNaT is considered a missing value from the POV of maybe_promote_with_array in all situations. This takes one single integer out of the usable int64-range, but I think this is much cleaner.

There's still a few issues with lib.infer_dtype (e.g. #23554, of which I already fixed the complex case #25382), most notably that it cannot infer datetime64tz yet. Actually, through this PR, I'm learning how broken that method is as well, but fixing that will have to wait for some other time. Among other things, it currently faceplants for PeriodArray / IntervalArray (#23553). I haven't added tests for these types here, but ~9000 tests is already better than nothing, I hope. ;-)

Another point that could/should be considered is how EAs should deal with this (#24246).

@h-vetinari h-vetinari changed the title Fix maybe promote REF: Fix maybe_promote Feb 24, 2019
@jreback
Copy link
Contributor

jreback commented Feb 24, 2019

@h-vetinari pls make it bite-sized and piecemeal.These giant PR's very likely won't be merged as they take too much review time.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a brief glance shows this as way too complicated. This must be split up function wise. So as I said on a prior PR this would need to be a separate module with supporting functions.

@h-vetinari
Copy link
Contributor Author

@jreback: @h-vetinari pls make it bite-sized and piecemeal.These giant PR's very likely won't be merged as they take too much review time.

I split off the tests into #23982 already. This PR is just refactoring the method (~200LoC).

just a brief glance shows this as way too complicated. This must be split up function wise. So as I said on a prior PR this would need to be a separate module with supporting functions.

I'm not sure you were addressing me with that (or on which PR) - don't know which modularisation you mean...? In any case, I already made an attempt at modularising things, by splitting the scalar and array case into separate methods.

The method itself is not very complicated, it just has lots of branches (in steps 2 & 4 below) to deal with all the possible inputs:

  1. determine scalar/array case
  2. check if array is empty or all-na
  3. infer dtype
  4. handle promotion logic
  5. (scalar case only) handle casting of fill_value

@codecov
Copy link

codecov bot commented Feb 24, 2019

Codecov Report

Merging #25425 into master will decrease coverage by 50.04%.
The diff coverage is 47.65%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25425       +/-   ##
===========================================
- Coverage   91.73%   41.69%   -50.05%     
===========================================
  Files         173      173               
  Lines       52856    52932       +76     
===========================================
- Hits        48490    22072    -26418     
- Misses       4366    30860    +26494
Flag Coverage Δ
#multiple ?
#single 41.69% <47.65%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/cast.py 48.22% <47.65%> (-39.95%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3855a27...6792a54. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 24, 2019

Codecov Report

Merging #25425 into master will increase coverage by 0.13%.
The diff coverage is 93.7%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25425      +/-   ##
==========================================
+ Coverage   91.85%   91.99%   +0.13%     
==========================================
  Files         180      180              
  Lines       50765    50850      +85     
==========================================
+ Hits        46631    46777     +146     
+ Misses       4134     4073      -61
Flag Coverage Δ
#multiple 90.63% <93.7%> (+0.14%) ⬆️
#single 41.83% <47.24%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/dtypes/cast.py 90.88% <93.7%> (-0.18%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/arrays/integer.py 96.3% <0%> (-1.32%) ⬇️
pandas/core/internals/construction.py 95.95% <0%> (-0.8%) ⬇️
pandas/core/internals/blocks.py 94.38% <0%> (-0.72%) ⬇️
pandas/core/dtypes/concat.py 96.58% <0%> (-0.46%) ⬇️
pandas/core/internals/concat.py 96.48% <0%> (-0.37%) ⬇️
pandas/core/arrays/sparse.py 94.19% <0%> (-0.31%) ⬇️
pandas/core/internals/managers.py 96% <0%> (-0.22%) ⬇️
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e9f9ca1...321f08d. Read the comment docs.

@gfyoung gfyoung added Bug Dtype Conversions Unexpected or buggy dtype conversions Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Feb 25, 2019
@h-vetinari
Copy link
Contributor Author

@TomAugspurger @jbrockmendel
Care to take a look here or in #23982 please?

@jbrockmendel
Copy link
Member

@h-vetinari I'll take a look at this. My schedule is unusually hectic between now and Tuesday, so it might take a few days.

@h-vetinari
Copy link
Contributor Author

@jbrockmendel: @h-vetinari I'll take a look at this.

Thanks!

pandas/conftest.py Outdated Show resolved Hide resolved
else:
fill_type = type(fill_value)
raise ValueError('fill_value must either be scalar, or a Series / '
'Index / np.ndarray; received {}'.format(fill_type))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are/should EAs be supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a design decision, but IMO yes. I've suggested #24246, and would then dispatch in maybe_promote_with_array

# ndarray, but too high-dimensional
fill_value = fill_value.ravel()
elif not isinstance(fill_value, (ABCSeries, ABCIndexClass)):
fill_type = type(fill_value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually we use type(foo).__name__. Any particular reason to not use the .__name__ here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, will adapt.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
Thanks for the review. A few small changes incoming.

pandas/conftest.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
else:
fill_type = type(fill_value)
raise ValueError('fill_value must either be scalar, or a Series / '
'Index / np.ndarray; received {}'.format(fill_type))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a design decision, but IMO yes. I've suggested #24246, and would then dispatch in maybe_promote_with_array

pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
pandas/core/dtypes/cast.py Outdated Show resolved Hide resolved
# ndarray, but too high-dimensional
fill_value = fill_value.ravel()
elif not isinstance(fill_value, (ABCSeries, ABCIndexClass)):
fill_type = type(fill_value)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, will adapt.

@jbrockmendel
Copy link
Member

@h-vetinari it looks like maybe_promote is only used in 8-10 places outside of the tests. is the ndarray fill_value case even needed? I find that makes the function much harder to reason about.

@h-vetinari
Copy link
Contributor Author

@h-vetinari it looks like maybe_promote is only used in 8-10 places outside of the tests. is the ndarray fill_value case even needed? I find that makes the function much harder to reason about.

The low number of call-sites is what made me think that this can be fixed at all. I'm quite sure that the array-case was necessary, otherwise all those gymnastics could have been avoided. Should be simple enough to test - I'll just take out the array-branch from maybe_promote and see if the CI passes.

@h-vetinari
Copy link
Contributor Author

@jbrockmendel
So, at first glance a failing CI would show the necessity of the array-case, but on second glance, all the failures come from #25431 resp. #23823.

Taking a step back, in #23823 I said

In the context of #23192 (and #23604 / #23606), I want to use pandas.core.dtypes.cast.maybe_upcast_putmask, because it solves exactly the problem I need it to solve.
Unfortunately, it does not work as advertised (and I already found the culprit).
The docstring says:

def maybe_upcast_putmask(result, mask, other):
    """
    A safe version of putmask that potentially upcasts the result
    [...]

The culprit at the time was the array-codepath in maybe_promote, but with the last commit above, it seems that the code-base does not rely anymore on the fact that maybe_promote can consume arrays. As such, it would be possible to change the implementation of maybe_upcast_putmask to use something else (e.g. the array-code I have already, or something else entirely) or just temporarily skip/xfail the tests from #25431, and then have a much simpler maybe_promote replacement that only needs to handle the scalar case.

The downside to that is that it would be harder to keep the various places where promotion logic is defined in sync. Ultimately, I think there's an even larger clean-up necessary, involving maybe_promote, maybe_upcast_putmask, lib.infer_dtype, maybe_convert_objects, etc. which - IMO/AFAICT - should all share similar promotion logic based for example what already exists with the Seen-objects used e.g. in

def maybe_convert_objects(ndarray[object] objects, bint try_float=0,

Don't know if or when I ever get around to that, my more immediate goals had been #23192 and #22812 (or rather, not so immediate anymore after over a year, haha ;-)).

@jbrockmendel
Copy link
Member

Thanks for tracking down the history behind the ndarray support; I'll read up on that.

You've seen #28561 and #28564; let's try to find other parts of this that you can break off into similarly sized/scoped pieaces. e.g. carving out something like maybe_promote_scalar. Anything else come to mind?

Based on a local branch, taking the ndarray cases out of the existing maybe_promote tests simplifies them a ton. Whether to re-implement the ndarray cases separately or just get rid of them depends on whether we can drop the ndarray case completely.

Many of the failing cases can be fixed by changing

fill_value = Timedelta(fill_value).value

to

try:
    fv = Timedelta(fill_value)
except (TypeError, ValueError):
    dtype = np.object_
else:
    fill_value = fv.value

Same for the Timestamp case.

That said, I think that instead of .value these should be to_timedelta64() and to_datetime64(), respectively (except for the NaT case, which is another hassle). Try to get away from using iNaT, which is ambiguous. This change should be relatively late in the process.

Any other ideas? I can keep pushing at this if you're not interested, but I'm hoping you'll keep going just in smaller bits.

elif fill_value.ndim > 1:
# ndarray, but too high-dimensional
fill_value = fill_value.ravel()
elif not isinstance(fill_value, (ABCSeries, ABCIndexClass)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need Series/Index? ATM we just have ndarray right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no harm in permitting them (if arrays are permitted at all), as they they fit into the code without extra effort (and later uses of maybe_upcast_putmask might well plop a Series in there).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we have a compelling use case, let's restrict the inputs to 1D, non-empty ndarray

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@h-vetinari
Copy link
Contributor Author

Based on a local branch, taking the ndarray cases out of the existing maybe_promote tests simplifies them a ton. Whether to re-implement the ndarray cases separately or just get rid of them depends on whether we can drop the ndarray case completely.

Yeah, this is what I meant above. If we keep everything about maybe_promote in the scalar case, then both the code and the tests could be substantially reduced, but would need refactoring. I'm thinking this should be an entirely separate PR, that might serve as a different goal to aim for (and to cut individual chunks off).

That said, I think that instead of .value these should be to_timedelta64() and to_datetime64(), respectively (except for the NaT case, which is another hassle). Try to get away from using iNaT, which is ambiguous. This change should be relatively late in the process.

I tried to stay as close as possible to existing behaviour, but since we're (potentially) rewriting the method, it's completely fair to question the tests I've written. One choice I made is that I think strings should not be cast to TD/DT automatically. I've tried to comment such decisions in the implementation resp. the test module as well.

iNaT is used quite ubiquitously, so not sure how easy it is to get rid of it, I just considered it in the same category as the other NA-values.

@jbrockmendel
Copy link
Member

@h-vetinari can you rebase? Any plans to do small-pieces PRs for this in the near future? If not, I'm going to keep trying to chip away at this.

@h-vetinari
Copy link
Contributor Author

Sorry for the delay, will try to get to merging soon (should be latest on Sunday)

@jbrockmendel
Copy link
Member

@h-vetinari are you planning to run with this? Its fine if not, but I think you have a better idea of whats needed here than i do

@h-vetinari
Copy link
Contributor Author

@jbrockmendel
The main question I had (see this comment) was whether it's desired to support the array-case should in maybe_promote at all (since that seemed to be up in the air).

I overlooked your response there - sorry. I'm a bit swamped at the moment, but I'll try to carve out a PR from this one that adds maybe_promote_ndarray. Hopefully should have time on the weekend.

@jbrockmendel: Do you have fixes in mind for the remaining xfailed cases? bite-sized PRs for those would be welcome.

I'll have to have a look at how the current maybe_promote works (I haven't kept up with all your PRs). The array case will probably need to be handled separately anyway. One thing I wanted to avoid initially is to duplicate the promotion logic in two places (since looping over the scalar maybe_promote for the array-case is not an option performance-wise). But I think I'll have to start like that rather than replacing the whole method in one go.

@jbrockmendel: AFAICT the ndarray use cases of maybe_promote are a) pathological object-dtype cases where an ndarray is being treated like a scalar, and b) future use cases you describe. Am I missing any ways a user could get there at the moment?

Right now, the only case in the testing suite I'm aware of are the tests for #23823. Not sure if some other code is using the array-path...

@jbrockmendel
Copy link
Member

@h-vetinari can you rebase

Merge remote-tracking branch 'upstream/master' into fix_maybe_promote
@h-vetinari
Copy link
Contributor Author

@jbrockmendel: @h-vetinari can you rebase

Added a separate function maybe_promote_with_array that takes care of the array-path (and as such, the corresponding testing directly).

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments


>>> maybe_promote_with_array(np.dtype('datetime64[ns]'),
... fill_value=np.array([None]))
(dtype('<M8[ns]'), -9223372036854775808)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: fix this

@@ -654,15 +591,16 @@ def test_maybe_promote_any_with_datetime64(
)


# override parametrization due to to many xfails; see GH 23982 / 25425
@pytest.mark.parametrize("box", [(True, object)])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
here we something fell through the cracks in #23982 - this tests never ran with box=False.

@@ -682,8 +620,6 @@ def test_maybe_promote_datetimetz_with_any_numpy_dtype(
)


# override parametrization due to to many xfails; see GH 23982 / 25425
@pytest.mark.parametrize("box", [(True, None), (True, object)])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
here we something fell through the cracks in #23982 - this tests never ran with box=False.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sounds like some of the changes in this test file are valid independent of the changes in the other file. is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, all removed xfails are due to diverting the array-path through maybe_promote_with_array instead of maybe_promote. For test_maybe_promote_datetimetz_with_any_numpy_dtype and test_maybe_promote_datetimetz_with_datetimetz however, I needed to add xfails for the box=False case because those are not working within maybe_promote yet.

@h-vetinari
Copy link
Contributor Author

@jbrockmendel
This is updated and green. It externalises essentially all array-paths of maybe_promote to a new function (which would allow to rip it out of maybe_promote and replace it with the array-version where necessary).

The good part about having the unified testing module (box fixture and all) is that the code can still check uniformity of the results, even though the methods have completely separate implementations.

@@ -462,6 +501,281 @@ def maybe_promote(dtype, fill_value=np.nan):
return dtype, fill_value


def maybe_promote_with_array(dtype, fill_value=np.nan):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a huge additional to the technical debt. I am -1 on adding this at all. It is not at all clear whether this logic is correct and/or tested. more to the point, what is the purpose of all of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback you can pretty much ignore this PR; I'm asking @h-vetinari to keep it rebased for reference as we identify parts that are worthwhile to break off into bite-size pieces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more to the point, what is the purpose of all of this?

There are a handful of places where we call maybe_promote where we could have fill_value that is an ndarray. Part of the plan for this is to identify in which of those cases we can rule out ndarray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i that’s fine

happy to pick off good changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback: this is a huge additional to the technical debt. I am -1 on adding this at all. It is not at all clear whether this logic is correct and/or tested. more to the point, what is the purpose of all of this?

Please read this comment, not just skim over it.

The array-path in maybe_promote is broken and both you and @jbrockmendel were excited to rip it out. At the same time, there's several potential or future use-cases for the array-case, and so I asked twice how this should be handled.

Having a separate method is IMO the least invasive change, and would eventually still allow to rip out the array-path from maybe_promote. And more importantly, the logic is tested with the same promotion tests, which was the whole point of the tests/dtypes/cast/test_promote.py-module. Lastly, since it's a private method, there's no technical debt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lastly, since it's a private method, there's no technical debt.

@h-vetinari i don’t even know what to say anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the misunderstanding, I meant as in API debt.

The technical debt is already there, in the array-path of maybe_promote. I'm trying to fix it. Feel free to address any of the comments or questions I've raised about this. But if you come into an ancient PR and - without regard for any of the existing context - assert that it must be garbage ("It is not at all clear whether this logic is correct and/or tested [it is]. more to the point, what's the purpose of all of this?"), then I'm gonna respond in kind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h-vetinari pls don't respond like this. it is not helpful to anyone.

I can and will come into every PR and make comments. My purpose is to avoid cluttering pandas with technical debt. This PR just adds to it.


# comparison mechanics are broken above _int64_max;
# use greater equal instead of equal
if fill_max >= _int64_max + 1 or fill_min <= _int64_min - 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use the can_cast machinery machinery currently in the scalar function? or even just dispatch to the scalar function in some cases?

Copy link
Contributor Author

@h-vetinari h-vetinari Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatching to the scalar case is IMO out of the question for performance reasons until this whole code is cythonized (or the logic somehow unified with lib.maybe_convert_object).

See Also
--------
maybe_promote_with_array : underlying method for array case
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PR with just (most of) this docstring would be a good start

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try to do that.

@@ -462,6 +501,281 @@ def maybe_promote(dtype, fill_value=np.nan):
return dtype, fill_value


def maybe_promote_with_array(dtype, fill_value=np.nan):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback: this is a huge additional to the technical debt. I am -1 on adding this at all. It is not at all clear whether this logic is correct and/or tested. more to the point, what is the purpose of all of this?

Please read this comment, not just skim over it.

The array-path in maybe_promote is broken and both you and @jbrockmendel were excited to rip it out. At the same time, there's several potential or future use-cases for the array-case, and so I asked twice how this should be handled.

Having a separate method is IMO the least invasive change, and would eventually still allow to rip out the array-path from maybe_promote. And more importantly, the logic is tested with the same promotion tests, which was the whole point of the tests/dtypes/cast/test_promote.py-module. Lastly, since it's a private method, there's no technical debt.

elif fill_value.ndim > 1:
# ndarray, but too high-dimensional
fill_value = fill_value.ravel()
elif not isinstance(fill_value, (ABCSeries, ABCIndexClass)):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


# comparison mechanics are broken above _int64_max;
# use greater equal instead of equal
if fill_max >= _int64_max + 1 or fill_min <= _int64_min - 1:
Copy link
Contributor Author

@h-vetinari h-vetinari Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatching to the scalar case is IMO out of the question for performance reasons until this whole code is cythonized (or the logic somehow unified with lib.maybe_convert_object).

@@ -134,7 +134,7 @@ def _check_promote(
# box_dtype; the expected value returned from maybe_promote is the
# missing value marker for the returned dtype.
fill_array = np.array([fill_value], dtype=box_dtype)
result_dtype, result_fill_value = maybe_promote(dtype, fill_array)
result_dtype, result_fill_value = maybe_promote_with_array(dtype, fill_array)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @jbrockmendel
This is the point which diverts the testing of all array-paths in this module to maybe_promote_with_array.

@@ -682,8 +620,6 @@ def test_maybe_promote_datetimetz_with_any_numpy_dtype(
)


# override parametrization due to to many xfails; see GH 23982 / 25425
@pytest.mark.parametrize("box", [(True, None), (True, object)])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, all removed xfails are due to diverting the array-path through maybe_promote_with_array instead of maybe_promote. For test_maybe_promote_datetimetz_with_any_numpy_dtype and test_maybe_promote_datetimetz_with_datetimetz however, I needed to add xfails for the box=False case because those are not working within maybe_promote yet.

See Also
--------
maybe_promote_with_array : underlying method for array case
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try to do that.

@jreback
Copy link
Contributor

jreback commented Oct 30, 2019

I am happy for @jbrockmendel to pick off parts of this. But this PR will not be merged in any way like this. closing.

@jreback jreback closed this Oct 30, 2019
@pandas-dev pandas-dev locked and limited conversation to collaborators Oct 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG/Internals: maybe_promote
5 participants