-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: make _aggregate_series_pure_python extraction behave like the cython version #29641
Conversation
Hello @jbrockmendel! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-11-17 16:02:37 UTC |
Thanks for the explanation! Can you add the examples I gave in the other PR as test cases? |
Good idea, will do.
No idea. |
pandas/core/groupby/ops.py
Outdated
if len(res) == 1: | ||
# e.g. test_agg_lambda_with_timezone lambda e: e.head(1) | ||
# FIXME: are we potentially losing import res.index info? | ||
res = getattr(res, "_values", res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just do
res = np.array(res)[0]
? or
res = res.item()
(though I think we have deprecated .item()
, but are bringing it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.array
won't work because itll lose the timezone. im thinking next(iter(res))
with a comment saying its often equivalent to res[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think .item()
would be good here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.item()
will be good if/when we un-deprecate it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the current next(iter(res))
acceptable for the time being? extract_array(res)[0]
would also work.
I'm eager to see this go in because I think we can get rid of a bunch more _try_cast calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep just mark it with the issue (or fix me) so it’s clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
thanks. are these incorrect on 0.25.3? if so, need a release note, pls add in a followup. |
will do |
|
||
# TODO: use `.item()` if/when we un-deprecate it. | ||
# For non-Series we could just do `res[0]` | ||
res = next(iter(res)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this actually needed? For any later iteration, the len-1 res
is just assigned below with result[label] = res
, so why does it need to be unpacked for the first group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to get rid of this, but at this stage the goal is just to make the behavior match libreduction._extract_result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in the cython version that calls _extract_result
, this is done for each group, not just the first (so in that sense it still doesn't match that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in the cython version that calls _extract_result, this is done for each group, not just the first (so in that sense it still doesn't match that)
This is correct. The more closely matching behavior is that only the first group is checked for array-like (there's also a discrepancy in what types of arraylikes are checked) (there's also^2 a discrepancy in that Reducer.get_result does a res = res.values
check that is similar to _extract_result but not quite the same)
Could you then check it? |
After some investigation it turns out that the DTA casting in
_try_cast
is made necessary because we incorrectly pass datetime64tz to_aggregate_series_fast
, which calls the cythonlibreduction.SeriesGrouper
, which casts the input tondarray
, losing the timezone. By not going through the cython path for datetime64tz, we avoid the need to re-cast.That in turn surfaces a new problem, which is that _aggregate_series_pure_python checks the first group's result slightly differently than the cython version does (see libreduction._extract_result). This changes the way the pure-python version does it to more closely match the cython version. I intend to make these match more precisely in an upcoming pass.
If merged, makes #29589 unnecessary.
cc @jreback @jorisvandenbossche @WillAyd