-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG:Time Grouper bug fix when applied for list groupers #17587
Conversation
Hello @ruiann! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on October 01, 2017 at 15:43 Hours UTC |
The PEP8 errors reported are not caused by my code, and also hard to change. |
pandas/tests/test_resample.py
Outdated
@@ -3274,3 +3274,17 @@ def test_aggregate_with_nat(self): | |||
|
|||
# if NaT is included, 'var', 'std', 'mean', 'first','last' | |||
# and 'nth' doesn't work yet | |||
|
|||
def test_scalar_call_versus_list_call(self): | |||
data_frame = pd.DataFrame({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will not lint properly
add a comment with the issue number.
this goes in the groupby tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually ok with the file location of the tests, but there is a section in test_resample.py that does a groupby & a resample, pls put after that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which test? I think put it in TestTimeGrouper is ok?
@@ -1736,13 +1730,14 @@ class BaseGrouper(object): | |||
""" | |||
|
|||
def __init__(self, axis, groupings, sort=True, group_keys=True, | |||
mutated=False): | |||
mutated=False, indexer=None): | |||
self._filter_empty_groups = self.compressed = len(groupings) != 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string explaining params (I know you just added 1 but good time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry I don't really understand all parameters, I've added those I know😂
pandas/core/groupby.py
Outdated
@@ -2288,11 +2283,13 @@ def generate_bins_generic(values, binner, closed): | |||
|
|||
class BinGrouper(BaseGrouper): | |||
|
|||
def __init__(self, bins, binlabels, filter_empty=False, mutated=False): | |||
def __init__(self, bins, binlabels, filter_empty=False, mutated=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/core/groupby.py
Outdated
@@ -2536,8 +2532,11 @@ def ngroups(self): | |||
|
|||
@cache_readonly | |||
def indices(self): | |||
values = _ensure_categorical(self.grouper) | |||
return values._reverse_indexer() | |||
if isinstance(self.grouper, BaseGrouper): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is this hit?
@ruiann I saw some comments that you had trouble installing the env.
|
Codecov Report
@@ Coverage Diff @@
## master #17587 +/- ##
==========================================
- Coverage 91.22% 91.19% -0.03%
==========================================
Files 163 163
Lines 49625 49620 -5
==========================================
- Hits 45270 45252 -18
- Misses 4355 4368 +13
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17587 +/- ##
==========================================
- Coverage 91.25% 91.23% -0.03%
==========================================
Files 163 163
Lines 49779 49777 -2
==========================================
- Hits 45428 45413 -15
- Misses 4351 4364 +13
Continue to review full report at Codecov.
|
pandas/core/groupby.py
Outdated
if isinstance(self.grouper, BaseGrouper): | ||
labels, _, _ = self.grouper.group_info | ||
uniques = self.grouper.result_index | ||
if self.grouper.indexer is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor to make a method on the BaseGrouper itself and override in BinGrouper; same for indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid cannot use this way. Grouping use the group_info for unsorted axis, while some other scenarios call the group_info to group sorted axis, I think it's better to keep the group_info as sorted, and reorder to get the unsorted label sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine, but not my point. I don't want these if/else in the properties, rather they should simply be overriden methods on the grouper type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I get your point.
pandas/core/groupby.py
Outdated
some grouper (TimeGrouper eg) will sort its axis and the | ||
group_info of BinGrouper is also sorted | ||
can use the indexer to reorder as the unsorted axis | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add other params (ok to not document but list them)
pandas/core/groupby.py
Outdated
binlabels : the label list | ||
indexer: the indexer created by Grouper | ||
some grouper (TimeGrouper eg) will sort its axis and the | ||
group_info of BinGrouper is also sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too much info just the first line is good
index is a intp array
@jreback |
pandas/core/groupby.py
Outdated
|
||
Parameters | ||
---------- | ||
axis : the axis to group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on doc-strings, list the parameter type, then on the next line what it does, e.g
axis : int
the axis to group
...
pandas/core/groupby.py
Outdated
@@ -1888,6 +1897,15 @@ def group_info(self): | |||
comp_ids = _ensure_int64(comp_ids) | |||
return comp_ids, obs_group_ids, ngroups | |||
|
|||
# 17530 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put a comment / doc-string inside the function on what this is doing(don't need to add the issue number)
pandas/core/groupby.py
Outdated
|
||
then the group_info is | ||
(array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4]), array([0, 1, 2, 3, 4]), 5) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a tiny bit more expl, pretend you are a new reader and have no idea what this does
pandas/core/groupby.py
Outdated
@@ -2536,8 +2582,12 @@ def ngroups(self): | |||
|
|||
@cache_readonly | |||
def indices(self): | |||
values = _ensure_categorical(self.grouper) | |||
return values._reverse_indexer() | |||
# for the situation of groupby list of groupers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to move this to the base class and inherit on the subclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry but Grouping class inherits from object class. I don't really understand which method should be moved to base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to call a (private) method on the grouper here, which will then dispatch if its a BinGrouper or the BaseGrouper as appropriate. same as below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I were misunderstanding your meaning, I guess your suggestion is like below
def indices(self):
# for the situation of groupby list of groupers
if self.grouper.type:
return self.grouper.indices
else:
values = _ensure_categorical(self.grouper)
return values._reverse_indexer()
and then in BaseGrouper and BinGrouper gives a property like:
@property
def type(self):
return 'BaseGrouper_or_BinGrouper'
But what to do if the grouper is neither BaseGrouper instance nor BinGrouper instance? They don't have defined class to provide this property or method. I don't think it's necessary to provide this method for all possible grouper, use isinstance for BaseGrouper is good enough, in fact in groupby.py isinstance method occurs many times...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you are misuderstanding
def indices(self):
return self.grouper.get_indexer(self.grouper) # maybe better name
class BaserGrouper:
def get_indexer(self, grouper):
values = _ensure_categorical(self, grouper)
return values._reverse_indexer()
class BinGrouper:
def get_indexer(self, grouper):
# grouper is self here
labels = self.label_info
uniques = self.result_index
...
this is just multiple dispatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here comes another problem, grouper is not guaranteed to be a BaseGrouper or BinGrouper instance, it may be just a label ndarray, which I think is better not to add such a method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fine for now. I think we should fix this though. We are a bit inconsistent where a Grouping object reassigns .grouper to a Base/Bin Grouper if needed. So Maybe should have a class which we can wrap up and hide details like these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the better way is to convert all groupby call to the format of [pd.Grouper, pd.TimeGrouper] so that the .grouper of Grouping instance is guaranteed to be Base/Bin Grouper instance, and easy to handle. But currently there are variety of ways to do the groupby scenario, I think it may cost time to do this refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, you can create an issue about this, or send a PR when convenient (after this one!)
pandas/core/groupby.py
Outdated
self.grouper, sort=self.sort) | ||
uniques = Index(uniques, name=self.name) | ||
# for the situation of groupby list of groupers | ||
if isinstance(self.grouper, BaseGrouper): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/core/groupby.py
Outdated
@@ -2536,8 +2582,12 @@ def ngroups(self): | |||
|
|||
@cache_readonly | |||
def indices(self): | |||
values = _ensure_categorical(self.grouper) | |||
return values._reverse_indexer() | |||
# for the situation of groupby list of groupers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to call a (private) method on the grouper here, which will then dispatch if its a BinGrouper or the BaseGrouper as appropriate. same as below.
pandas/tests/test_resample.py
Outdated
@@ -3300,3 +3300,20 @@ def test_aggregate_with_nat(self): | |||
|
|||
# if NaT is included, 'var', 'std', 'mean', 'first','last' | |||
# and 'nth' doesn't work yet | |||
|
|||
# Issue: 17530 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move the comment inside the function, move to pandas/tests/groupy/test_timegrouper.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple comments, ping on green.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a whatsnew new, otherwise lgtm.
whatsnew note |
@ruiann just waiting for a whatsnew note |
@jreback |
need a 1-line entry in |
thanks @ruiann ! as discussed above, happy to have a PR to try to clean up some of the internal logic ! |
* 'master' of github.com:pandas-dev/pandas: (188 commits) Separate out _convert_datetime_to_tsobject (pandas-dev#17715) DOC: remove whatsnew note for xref pandas-dev#17131 BUG: Regression in .loc accepting a boolean Index as an indexer (pandas-dev#17738) DEPR: Deprecate cdate_range and merge into bdate_range (pandas-dev#17691) CLN: replace %s syntax with .format in pandas.core: categorical, common, config, config_init (pandas-dev#17735) Fixed the memory usage explanation of categorical in gotchas from O(nm) to O(n+m) (pandas-dev#17736) TST: add backward compat for offset testing for pickles (pandas-dev#17733) remove unused time conversion funcs (pandas-dev#17711) DEPR: Deprecate convert parameter in take (pandas-dev#17352) BUG:Time Grouper bug fix when applied for list groupers (pandas-dev#17587) BUG: Fix some PeriodIndex resampling issues (pandas-dev#16153) BUG: Fix unexpected sort in groupby (pandas-dev#17621) DOC: Fixed typo in documentation for 'pandas.DataFrame.replace' (pandas-dev#17731) BUG: Fix series rename called with str altering name rather index (GH17407) (pandas-dev#17654) DOC: Add examples for MultiIndex.get_locs + cleanups (pandas-dev#17675) Doc improvements for IntervalIndex and Interval (pandas-dev#17714) BUG: DataFrame sort_values and multiple "by" columns fails to order NaT correctly Last of the timezones funcs (pandas-dev#17669) Add missing file to _pyxfiles, delete commented-out (pandas-dev#17712) update imports of DateParseError, remove unused imports from tslib (pandas-dev#17713) ...
git diff upstream/master -u -- "*.py" | flake8 --diff