Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: make _aggregate_series_pure_python extraction behave like the cython version #29641

Merged
merged 9 commits into from
Nov 18, 2019

Conversation

jbrockmendel
Copy link
Member

After some investigation it turns out that the DTA casting in _try_cast is made necessary because we incorrectly pass datetime64tz to _aggregate_series_fast, which calls the cython libreduction.SeriesGrouper, which casts the input to ndarray, losing the timezone. By not going through the cython path for datetime64tz, we avoid the need to re-cast.

That in turn surfaces a new problem, which is that _aggregate_series_pure_python checks the first group's result slightly differently than the cython version does (see libreduction._extract_result). This changes the way the pure-python version does it to more closely match the cython version. I intend to make these match more precisely in an upcoming pass.

If merged, makes #29589 unnecessary.

cc @jreback @jorisvandenbossche @WillAyd

@pep8speaks
Copy link

pep8speaks commented Nov 15, 2019

Hello @jbrockmendel! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-17 16:02:37 UTC

@jorisvandenbossche
Copy link
Member

Thanks for the explanation!
What is the impact on performance of not using the cython path for datetimetz ?

Can you add the examples I gave in the other PR as test cases?

@jbrockmendel
Copy link
Member Author

Can you add the examples I gave in the other PR as test cases?

Good idea, will do.

What is the impact on performance of not using the cython path for datetimetz ?

No idea.

if len(res) == 1:
# e.g. test_agg_lambda_with_timezone lambda e: e.head(1)
# FIXME: are we potentially losing import res.index info?
res = getattr(res, "_values", res)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just do
res = np.array(res)[0]? or
res = res.item() (though I think we have deprecated .item(), but are bringing it back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.array won't work because itll lose the timezone. im thinking next(iter(res)) with a comment saying its often equivalent to res[0]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .item() would be good here as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.item() will be good if/when we un-deprecate it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the current next(iter(res)) acceptable for the time being? extract_array(res)[0] would also work.

I'm eager to see this go in because I think we can get rid of a bunch more _try_cast calls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep just mark it with the issue (or fix me) so it’s clear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented

@jreback jreback added Groupby Refactor Internal refactoring of code labels Nov 16, 2019
@jreback jreback added this to the 1.0 milestone Nov 16, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback jreback merged commit e1cadfa into pandas-dev:master Nov 18, 2019
@jreback
Copy link
Contributor

jreback commented Nov 18, 2019

thanks.

are these incorrect on 0.25.3? if so, need a release note, pls add in a followup.

@jbrockmendel
Copy link
Member Author

are these incorrect on 0.25.3? if so, need a release note, pls add in a followup

will do

@jbrockmendel jbrockmendel deleted the faster-gb7 branch November 18, 2019 01:35

# TODO: use `.item()` if/when we un-deprecate it.
# For non-Series we could just do `res[0]`
res = next(iter(res))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this actually needed? For any later iteration, the len-1 res is just assigned below with result[label] = res, so why does it need to be unpacked for the first group?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to get rid of this, but at this stage the goal is just to make the behavior match libreduction._extract_result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the cython version that calls _extract_result, this is done for each group, not just the first (so in that sense it still doesn't match that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the cython version that calls _extract_result, this is done for each group, not just the first (so in that sense it still doesn't match that)

This is correct. The more closely matching behavior is that only the first group is checked for array-like (there's also a discrepancy in what types of arraylikes are checked) (there's also^2 a discrepancy in that Reducer.get_result does a res = res.values check that is similar to _extract_result but not quite the same)

Reksbril pushed a commit to Reksbril/pandas that referenced this pull request Nov 18, 2019
@jorisvandenbossche
Copy link
Member

What is the impact on performance of not using the cython path for datetimetz ?

No idea.

Could you then check it?

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
tonywu1999 pushed a commit to tonywu1999/pandas that referenced this pull request Jan 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants