BUG: time grouper differs when passes as a list and as a scalar #17530

ruiann · 2017-09-14T09:36:37Z

Code Sample, a copy-pastable example if possible

data_frame.groupby(pd.TimeGrouper('D', convention='start', key=time_column)).count()
data_frame.groupby([pd.TimeGrouper('D', convention='start', key=time_column)]).count()

give below two different outputs

            location  price  profit  type
time                                     
2017-07-23         1      1       1     1
2017-07-24         0      0       0     0
2017-07-25         1      1       1     1
2017-07-26         1      1       1     1
2017-07-27         3      3       3     3
2017-07-28         6      6       6     6

            location  price  profit  type
time                                     
2017-07-23         1      1       1     1
2017-07-25         1      1       1     1
2017-07-26         1      1       1     1
2017-07-27         3      3       3     3
2017-07-28         6      6       6     6

Problem description

The origin data frame doesn't contain a record in 2017-07-24, you can see if I use time grouper with convention='start' lonely, the 2017-07-24 item has been filled, but if by group list, that item hasn't been filled

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-09-14T16:30:59Z

need a copy-pastable example

ruiann · 2017-09-15T02:52:02Z

df = pd.DataFrame(np.random.randn(3), index=pd.Series(['2017-08-09 13:32:23', '2017-08-11 23:23:15', '2017-08-11 22:23:15'], dtype='datetime64[ns]'))

grouper = pd.TimeGrouper('D')
grouped = df.groupby(grouper)
data = grouped.count()
grouped = df.groupby([grouper])
data = grouped.count()

you can see the result is different

Sorry that convention parameter doesn't influence this scenario, but the result is what you can see, there is no way to resample and group by another categorical columns in the same time I think?

ruiann · 2017-09-15T04:35:33Z

    def _get_binner_for_grouping(self, obj):
        # return an ordering of the transformed group labels,
        # suitable for multi-grouping, e.g the labels for
        # the resampled intervals
        binner, grouper, obj = self._get_grouper(obj)

        l = []
        for key, group in grouper.get_iterator(self.ax):
            l.extend([key] * len(group))
        pdb.set_trace()
        if isinstance(self.ax, PeriodIndex):
            grouper = binner.__class__(l, freq=binner.freq, name=binner.name)
        else:
            # resampling causes duplicated values, specifying freq is invalid
            grouper = binner.__class__(l, name=binner.name)

        # since we may have had to sort
        # may need to reorder groups here
        if self.indexer is not None:
            indexer = self.indexer.argsort(kind='quicksort')
            grouper = grouper.take(indexer)
        return grouper

the second grouper generated by this method, while the first way generate grouper by

binner, grouper, obj = self._get_grouper(obj)

it seems that the second way get the same grouper of TimeGrouper as the first way and then sample it once more, which only uses the existed time slices in group axis by

l.extend([key] * len(group))

and then do some other operations

ruiann · 2017-09-15T08:00:48Z

So what is the expected behavior? If follow the SQL rule, the missing time slices shouldn't be filled for TimeGrouper, only the resample method can get a complete time axis index. As the resample method call the TimeGrouper to implement the time axis index, may a new parameter needed to control TimeGrouper's behavior?

My target is to group by a categorical column and then resample to get a complete statistic in time dimension, I cannot do this in a groupby operation.

jreback · 2017-09-15T10:09:35Z

ok, this is a duplicate of #16746, but I like your example better, slightly formatted:

should all result the same [6] and [8] are correct, [7] is the bug.

In [5]: df = pd.DataFrame(np.random.randn(3), index=pd.Series(['2017-08-09 13:32:23', '2017-08-11 23:23:15', '2017-08-11 22:23:15'], dtype='datetime64[ns]'))
   ...: 
   ...: grouper = pd.TimeGrouper('D')
   ...: grouped1 = df.groupby(grouper)
   ...: data1 = grouped1.count()
   ...: grouped2 = df.groupby([grouper])
   ...: data2 = grouped2.count()
   ...: 

In [6]: data1
Out[6]: 
            0
2017-08-09  1
2017-08-10  0
2017-08-11  2

In [7]: data2
Out[7]: 
            0
2017-08-09  1
2017-08-11  2

In [8]: df.resample('D').count()
Out[8]: 
            0
2017-08-09  1
2017-08-10  0
2017-08-11  2

Should be straightforward to fix if you'd like to dig in.

ruiann · 2017-09-15T10:22:04Z

@jreback I think 7 is also correct if follow the SQL rule, missed data can not be queried, maybe a new parameter needed? After hours debug, I think this situation is very complex and hard to fix, when by scalar TimeGrouper returns a BinGrouper, while by list returns a BaseGrouper that may have other grouping. I'll take some effort in this bug but can't promise to fix it out.

jreback · 2017-09-15T10:36:11Z

you are diving too deep. look more around here

[7] is not a resample, rather a straight groupby by date

In [42]: df.groupby(df.index.normalize()).count()
Out[42]: 
            0
2017-08-09  1
2017-08-11  2

you could add this example to the groupby docs, but another parameter is not needed.

ruiann · 2017-09-16T03:11:38Z

I think the situation is complex than your thought.
For list groupers, groupings are built for each grouper to make a BaseGrouper, and Grouping class only get binner by _get_binner_for_grouping of each grouper, the group_info of BaseGrouper property generated by the labels and group_index of its groupings, and each grouping get them by the binner, which is the label list of each row.

# the labels and group_index of Grouping class, with them the BaseGrouper generate group_info
    @property
    def labels(self):
        if self._labels is None:
            self._make_labels()
        return self._labels

    @property
    def group_index(self):
        if self._group_index is None:
            self._make_labels()
        return self._group_index

    def _make_labels(self):
        if self._labels is None or self._group_index is None:
            pdb.set_trace()
            # self.grouper is the label of each row
            labels, uniques = algorithms.factorize(
                self.grouper, sort=self.sort)
            uniques = Index(uniques, name=self.name)
            self._labels = labels
            self._group_index = uniques

But for BinGrouper, the group_info property generated in different way, which directly map the labels and group_index.

    def group_info(self):
        pdb.set_trace()
        ngroups = self.ngroups
        obs_group_ids = np.arange(ngroups)
        rep = np.diff(np.r_[0, self.bins])

        rep = _ensure_platform_int(rep)
        if ngroups == len(self.bins):
            comp_ids = np.repeat(np.arange(ngroups), rep)
        else:
            comp_ids = np.repeat(np.r_[-1, np.arange(ngroups)], rep)

        return comp_ids.astype('int64', copy=False), \
            obs_group_ids.astype('int64', copy=False), ngroups

So the bug occurs, the binner can only provide the labels of each row in selected axis, and then unique them to get unique labels and group_index, so the missed date time cannot be generated.

I think the way to fix this bug is to provide group_info in _get_binner_for_grouping of grouper to get more info? I think this logic change may be bigger? I'll take further look.

jreback · 2017-09-17T14:07:33Z

@ruiann again, you can simply have a len-1 list-like grouper case turn into the same case as the scalar passed case. its just going down the wrong path.

ruiann · 2017-09-17T14:20:12Z

Then how about the result of ['categorical_axis', TimeGrouper('D', key='time_axis')]? It can not get all time slices for each category item.
In fact I think the

    def _get_binner_for_grouping(self, obj):
        # return an ordering of the transformed group labels,
        # suitable for multi-grouping, e.g the labels for
        # the resampled intervals
        binner, grouper, obj = self._get_grouper(obj)

        l = []
        for key, group in grouper.get_iterator(self.ax):
            l.extend([key] * len(group))
        pdb.set_trace()
        if isinstance(self.ax, PeriodIndex):
            grouper = binner.__class__(l, freq=binner.freq, name=binner.name)
        else:
            # resampling causes duplicated values, specifying freq is invalid
            grouper = binner.__class__(l, name=binner.name)

        # since we may have had to sort
        # may need to reorder groups here
        if self.indexer is not None:
            indexer = self.indexer.argsort(kind='quicksort')
            grouper = grouper.take(indexer)
        return grouper

method of BaseGrouper (and BinGrouper) class works wrong, the group_info can be get from the BaseGrouper/BinGrouper instance generated by _get_grouper method, but not to read and sort them again item by item.

I'm trying to return the BaseGrouper/BinGrouper instance generated by _get_grouper method if the Grouping instance is given by Grouper instance, and get labels, group_index from the group_info, but there are still some bugs reported by test case, I'm now trying to fix them.

At last, I think groupby related logics need some refactor (I mean the name, alias, call stack sometimes confuse me, also some duplicated calls happened, maybe convert them all to list groupers could be better, with which only one implement need to deal).

jreback · 2017-09-17T15:00:01Z

At last, I think groupby related logics need some refactor (I mean the name, alias, call stack sometimes confuse me, also some duplicated calls happened, maybe convert them all to list groupers could be better, with which only one implement need to deal).

If you want to attempt that, sure. Pls do so w/o adding any features / fixing any bugs, (though you can add xfailing tests). Try to make things pass everything that currently works, then in different commits / PR add things.

ruiann · 2017-09-19T03:31:28Z

@jreback
I've fixed this bug in this PR

The fix is to allow the Grouping to use grouper generated by _get_grouper of passed TimeGrouper\OtherGrouper item of the list directly, and provide indexer for BinGrouper to reorder them correctly for the values of the grouped axis

There are some PEP8 errors, which not caused by my change, and hard to change 😢

Also the Contributing guide doesn't work for me
When I try to use conda to build the development env

conda create -n pandas_dev --file ci/requirements_dev.txt

shell said

Fetching package metadata .........

PackageNotFoundError: Package missing in current osx-64 channels: 
  - moto

Close matches found; did you mean one of these?

    moto: boto

When I build env by setup.py and run all tests

pytest pandas

same bug occurs in some test cases

pandas/tests/io/test_excel.py:2403: in <module>
    pytest.param('xlwt',
E   AttributeError: 'module' object has no attribute 'param'

closes #17530

…7587) closes pandas-dev#17530

jreback added the Can't Repro label Sep 14, 2017

jreback added Bug Groupby Resample resample method Difficulty Intermediate and removed Can't Repro labels Sep 15, 2017

jreback added this to the Next Major Release milestone Sep 15, 2017

jreback changed the title ~~time grouper with convention doesn't work for groupby list situtation~~ BUG: time grouper differs when passes as a list and as a scalar Sep 15, 2017

jreback mentioned this issue Sep 15, 2017

Grouper using frequency has inconsistent behavior inside and outside a list #16746

Closed

ruiann mentioned this issue Sep 19, 2017

BUG:Time Grouper bug fix when applied for list groupers #17587

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.21.0 Sep 28, 2017

jreback closed this as completed in #17587 Oct 1, 2017

jreback pushed a commit that referenced this issue Oct 1, 2017

BUG:Time Grouper bug fix when applied for list groupers (#17587)

cdbbf80

closes #17530

jreback mentioned this issue Oct 8, 2017

BUG: Timegrouping not mirroring resample when groupby is a list indexer #10084

Closed

alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017

BUG:Time Grouper bug fix when applied for list groupers (pandas-dev#1…

99a46c2

…7587) closes pandas-dev#17530

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

BUG:Time Grouper bug fix when applied for list groupers (pandas-dev#1…

01d1820

…7587) closes pandas-dev#17530

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: time grouper differs when passes as a list and as a scalar #17530

BUG: time grouper differs when passes as a list and as a scalar #17530

ruiann commented Sep 14, 2017

INSTALLED VERSIONS

jreback commented Sep 14, 2017

ruiann commented Sep 15, 2017

ruiann commented Sep 15, 2017

ruiann commented Sep 15, 2017

jreback commented Sep 15, 2017

ruiann commented Sep 15, 2017

jreback commented Sep 15, 2017

ruiann commented Sep 16, 2017

jreback commented Sep 17, 2017

ruiann commented Sep 17, 2017 •

edited

Loading

jreback commented Sep 17, 2017

ruiann commented Sep 19, 2017

BUG: time grouper differs when passes as a list and as a scalar #17530

BUG: time grouper differs when passes as a list and as a scalar #17530

Comments

ruiann commented Sep 14, 2017

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Sep 14, 2017

ruiann commented Sep 15, 2017

ruiann commented Sep 15, 2017

ruiann commented Sep 15, 2017

jreback commented Sep 15, 2017

ruiann commented Sep 15, 2017

jreback commented Sep 15, 2017

ruiann commented Sep 16, 2017

jreback commented Sep 17, 2017

ruiann commented Sep 17, 2017 • edited Loading

jreback commented Sep 17, 2017

ruiann commented Sep 19, 2017

Output of `pd.show_versions()`

ruiann commented Sep 17, 2017 •

edited

Loading