Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG:Time Grouper bug fix when applied for list groupers #17587

Merged
merged 10 commits into from
Oct 1, 2017

Conversation

ruiann
Copy link
Contributor

@ruiann ruiann commented Sep 19, 2017

  • save the axis indexer for Base/BinGrouper. As BinGrouper has get the sorted labels, need to use indexer to reorder them into right place for unsorted axis values
  • add label_info property to get unsorted labels
  • enable Grouping to use Base/BinGrouper instance generated by _get_grouper. Get label_info and group_info to get right labels and label indexes from Base/BinGrouper instance
  • add test for BUG: time grouper differs when passes as a list and as a scalar #17530
  • add comment

@pep8speaks
Copy link

pep8speaks commented Sep 19, 2017

Hello @ruiann! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on October 01, 2017 at 15:43 Hours UTC

@ruiann
Copy link
Contributor Author

ruiann commented Sep 19, 2017

The PEP8 errors reported are not caused by my code, and also hard to change.

@@ -3274,3 +3274,17 @@ def test_aggregate_with_nat(self):

# if NaT is included, 'var', 'std', 'mean', 'first','last'
# and 'nth' doesn't work yet

def test_scalar_call_versus_list_call(self):
data_frame = pd.DataFrame({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not lint properly

add a comment with the issue number.

this goes in the groupby tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually ok with the file location of the tests, but there is a section in test_resample.py that does a groupby & a resample, pls put after that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which test? I think put it in TestTimeGrouper is ok?

@@ -1736,13 +1730,14 @@ class BaseGrouper(object):
"""

def __init__(self, axis, groupings, sort=True, group_keys=True,
mutated=False):
mutated=False, indexer=None):
self._filter_empty_groups = self.compressed = len(groupings) != 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string explaining params (I know you just added 1 but good time)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I don't really understand all parameters, I've added those I know😂

@@ -2288,11 +2283,13 @@ def generate_bins_generic(values, binner, closed):

class BinGrouper(BaseGrouper):

def __init__(self, bins, binlabels, filter_empty=False, mutated=False):
def __init__(self, bins, binlabels, filter_empty=False, mutated=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -2536,8 +2532,11 @@ def ngroups(self):

@cache_readonly
def indices(self):
values = _ensure_categorical(self.grouper)
return values._reverse_indexer()
if isinstance(self.grouper, BaseGrouper):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is this hit?

@jreback
Copy link
Contributor

jreback commented Sep 19, 2017

@ruiann I saw some comments that you had trouble installing the env.

  • there is a bug in the docs w.r.t the recent addition of moto; simply remove this from the requirements.txt when conda install -f requirements.txt -c conda-forge, and then pip install moto; there is an issue to fix the docs
  • upgraded pytest, we require >= 3.1 (this is in the docs / requirements)
  • the flake errors are in your code

@codecov
Copy link

codecov bot commented Sep 20, 2017

Codecov Report

Merging #17587 into master will decrease coverage by 0.02%.
The diff coverage is 95.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17587      +/-   ##
==========================================
- Coverage   91.22%   91.19%   -0.03%     
==========================================
  Files         163      163              
  Lines       49625    49620       -5     
==========================================
- Hits        45270    45252      -18     
- Misses       4355     4368      +13
Flag Coverage Δ
#multiple 88.98% <95.65%> (-0.01%) ⬇️
#single 40.19% <8.69%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/resample.py 96.26% <100%> (+0.09%) ⬆️
pandas/core/groupby.py 92% <95.23%> (-0.23%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e85ca7...026223a. Read the comment docs.

@codecov
Copy link

codecov bot commented Sep 20, 2017

Codecov Report

Merging #17587 into master will decrease coverage by 0.02%.
The diff coverage is 96.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17587      +/-   ##
==========================================
- Coverage   91.25%   91.23%   -0.03%     
==========================================
  Files         163      163              
  Lines       49779    49777       -2     
==========================================
- Hits        45428    45413      -15     
- Misses       4351     4364      +13
Flag Coverage Δ
#multiple 89.02% <96.15%> (-0.01%) ⬇️
#single 40.32% <15.38%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/resample.py 96.14% <100%> (+0.08%) ⬆️
pandas/core/groupby.py 92.04% <95.83%> (-0.22%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.73% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d4a260...950de20. Read the comment docs.

if isinstance(self.grouper, BaseGrouper):
labels, _, _ = self.grouper.group_info
uniques = self.grouper.result_index
if self.grouper.indexer is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor to make a method on the BaseGrouper itself and override in BinGrouper; same for indices

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid cannot use this way. Grouping use the group_info for unsorted axis, while some other scenarios call the group_info to group sorted axis, I think it's better to keep the group_info as sorted, and reorder to get the unsorted label sequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine, but not my point. I don't want these if/else in the properties, rather they should simply be overriden methods on the grouper type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I get your point.

some grouper (TimeGrouper eg) will sort its axis and the
group_info of BinGrouper is also sorted
can use the indexer to reorder as the unsorted axis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add other params (ok to not document but list them)

binlabels : the label list
indexer: the indexer created by Grouper
some grouper (TimeGrouper eg) will sort its axis and the
group_info of BinGrouper is also sorted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too much info just the first line is good
index is a intp array

@ruiann
Copy link
Contributor Author

ruiann commented Sep 25, 2017

@jreback
There is a test failure in plotting/test_misc.py, I've tried this test but success. So what to do now?


Parameters
----------
axis : the axis to group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on doc-strings, list the parameter type, then on the next line what it does, e.g

axis : int
    the axis to group
...

@@ -1888,6 +1897,15 @@ def group_info(self):
comp_ids = _ensure_int64(comp_ids)
return comp_ids, obs_group_ids, ngroups

# 17530
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a comment / doc-string inside the function on what this is doing(don't need to add the issue number)


then the group_info is
(array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4]), array([0, 1, 2, 3, 4]), 5)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a tiny bit more expl, pretend you are a new reader and have no idea what this does

@@ -2536,8 +2582,12 @@ def ngroups(self):

@cache_readonly
def indices(self):
values = _ensure_categorical(self.grouper)
return values._reverse_indexer()
# for the situation of groupby list of groupers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to move this to the base class and inherit on the subclass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but Grouping class inherits from object class. I don't really understand which method should be moved to base class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to call a (private) method on the grouper here, which will then dispatch if its a BinGrouper or the BaseGrouper as appropriate. same as below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I were misunderstanding your meaning, I guess your suggestion is like below

    def indices(self):
        # for the situation of groupby list of groupers
        if self.grouper.type:
            return self.grouper.indices
        else:
            values = _ensure_categorical(self.grouper)
            return values._reverse_indexer()

and then in BaseGrouper and BinGrouper gives a property like:

    @property
    def type(self):
        return 'BaseGrouper_or_BinGrouper'

But what to do if the grouper is neither BaseGrouper instance nor BinGrouper instance? They don't have defined class to provide this property or method. I don't think it's necessary to provide this method for all possible grouper, use isinstance for BaseGrouper is good enough, in fact in groupby.py isinstance method occurs many times...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are misuderstanding

def indices(self):
     return self.grouper.get_indexer(self.grouper) # maybe better name
class BaserGrouper:
      def get_indexer(self, grouper):
             values = _ensure_categorical(self, grouper)
              return values._reverse_indexer()


class BinGrouper:
       def get_indexer(self, grouper):
           # grouper is self here
           labels = self.label_info
           uniques = self.result_index
           ...

this is just multiple dispatch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here comes another problem, grouper is not guaranteed to be a BaseGrouper or BinGrouper instance, it may be just a label ndarray, which I think is better not to add such a method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fine for now. I think we should fix this though. We are a bit inconsistent where a Grouping object reassigns .grouper to a Base/Bin Grouper if needed. So Maybe should have a class which we can wrap up and hide details like these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the better way is to convert all groupby call to the format of [pd.Grouper, pd.TimeGrouper] so that the .grouper of Grouping instance is guaranteed to be Base/Bin Grouper instance, and easy to handle. But currently there are variety of ways to do the groupby scenario, I think it may cost time to do this refactor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, you can create an issue about this, or send a PR when convenient (after this one!)

self.grouper, sort=self.sort)
uniques = Index(uniques, name=self.name)
# for the situation of groupby list of groupers
if isinstance(self.grouper, BaseGrouper):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -2536,8 +2582,12 @@ def ngroups(self):

@cache_readonly
def indices(self):
values = _ensure_categorical(self.grouper)
return values._reverse_indexer()
# for the situation of groupby list of groupers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to call a (private) method on the grouper here, which will then dispatch if its a BinGrouper or the BaseGrouper as appropriate. same as below.

@@ -3300,3 +3300,20 @@ def test_aggregate_with_nat(self):

# if NaT is included, 'var', 'std', 'mean', 'first','last'
# and 'nth' doesn't work yet

# Issue: 17530
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the comment inside the function, move to pandas/tests/groupy/test_timegrouper.py

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple comments, ping on green.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatsnew new, otherwise lgtm.

@jreback
Copy link
Contributor

jreback commented Sep 28, 2017

whatsnew note

@ruiann ruiann changed the title Time Grouper bug fix when applied for list groupers BUG:Time Grouper bug fix when applied for list groupers Sep 29, 2017
@jreback
Copy link
Contributor

jreback commented Sep 29, 2017

@ruiann just waiting for a whatsnew note

@ruiann
Copy link
Contributor Author

ruiann commented Sep 29, 2017

@jreback
I'm sorry but I don't know how to write a good one.😂 I've list all my changes in whatsnew entry.

@jreback
Copy link
Contributor

jreback commented Sep 29, 2017

need a 1-line entry in pandas/source/doc/whatsnew/v0.21.0.txt its a Bug Fix. in the grouping sub-section.

@jreback jreback added the Bug label Sep 29, 2017
@jreback jreback added this to the 0.21.0 milestone Sep 29, 2017
@jreback jreback added the Resample resample method label Sep 29, 2017
@jreback jreback merged commit cdbbf80 into pandas-dev:master Oct 1, 2017
@jreback
Copy link
Contributor

jreback commented Oct 1, 2017

thanks @ruiann !

as discussed above, happy to have a PR to try to clean up some of the internal logic !

ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 2, 2017
* 'master' of github.com:pandas-dev/pandas: (188 commits)
  Separate out _convert_datetime_to_tsobject (pandas-dev#17715)
  DOC: remove whatsnew note for xref pandas-dev#17131
  BUG: Regression in .loc accepting a boolean Index as an indexer (pandas-dev#17738)
  DEPR: Deprecate cdate_range and merge into bdate_range (pandas-dev#17691)
  CLN: replace %s syntax with .format in pandas.core: categorical, common, config, config_init (pandas-dev#17735)
  Fixed the memory usage explanation of categorical in gotchas from O(nm) to O(n+m) (pandas-dev#17736)
  TST: add backward compat for offset testing for pickles (pandas-dev#17733)
  remove unused time conversion funcs (pandas-dev#17711)
  DEPR: Deprecate convert parameter in take (pandas-dev#17352)
  BUG:Time Grouper bug fix when applied for list groupers (pandas-dev#17587)
  BUG: Fix some PeriodIndex resampling issues (pandas-dev#16153)
  BUG: Fix unexpected sort in groupby (pandas-dev#17621)
  DOC: Fixed typo in documentation for 'pandas.DataFrame.replace' (pandas-dev#17731)
  BUG: Fix series rename called with str altering name rather index (GH17407) (pandas-dev#17654)
  DOC: Add examples for MultiIndex.get_locs + cleanups (pandas-dev#17675)
  Doc improvements for IntervalIndex and Interval (pandas-dev#17714)
  BUG: DataFrame sort_values and multiple "by" columns fails to order NaT correctly
  Last of the timezones funcs (pandas-dev#17669)
  Add missing file to _pyxfiles, delete commented-out (pandas-dev#17712)
  update imports of DateParseError, remove unused imports from tslib (pandas-dev#17713)
  ...
@ruiann ruiann deleted the time_grouper branch October 6, 2017 06:36
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: time grouper differs when passes as a list and as a scalar
3 participants