-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: time grouper differs when passes as a list and as a scalar #17530
Comments
need a copy-pastable example |
df = pd.DataFrame(np.random.randn(3), index=pd.Series(['2017-08-09 13:32:23', '2017-08-11 23:23:15', '2017-08-11 22:23:15'], dtype='datetime64[ns]'))
grouper = pd.TimeGrouper('D')
grouped = df.groupby(grouper)
data = grouped.count()
grouped = df.groupby([grouper])
data = grouped.count() you can see the result is different Sorry that convention parameter doesn't influence this scenario, but the result is what you can see, there is no way to resample and group by another categorical columns in the same time I think? |
def _get_binner_for_grouping(self, obj):
# return an ordering of the transformed group labels,
# suitable for multi-grouping, e.g the labels for
# the resampled intervals
binner, grouper, obj = self._get_grouper(obj)
l = []
for key, group in grouper.get_iterator(self.ax):
l.extend([key] * len(group))
pdb.set_trace()
if isinstance(self.ax, PeriodIndex):
grouper = binner.__class__(l, freq=binner.freq, name=binner.name)
else:
# resampling causes duplicated values, specifying freq is invalid
grouper = binner.__class__(l, name=binner.name)
# since we may have had to sort
# may need to reorder groups here
if self.indexer is not None:
indexer = self.indexer.argsort(kind='quicksort')
grouper = grouper.take(indexer)
return grouper the second grouper generated by this method, while the first way generate grouper by binner, grouper, obj = self._get_grouper(obj) it seems that the second way get the same grouper of TimeGrouper as the first way and then sample it once more, which only uses the existed time slices in group axis by l.extend([key] * len(group)) and then do some other operations |
So what is the expected behavior? If follow the SQL rule, the missing time slices shouldn't be filled for TimeGrouper, only the resample method can get a complete time axis index. As the resample method call the TimeGrouper to implement the time axis index, may a new parameter needed to control TimeGrouper's behavior? My target is to group by a categorical column and then resample to get a complete statistic in time dimension, I cannot do this in a groupby operation. |
ok, this is a duplicate of #16746, but I like your example better, slightly formatted: should all result the same [6] and [8] are correct, [7] is the bug.
Should be straightforward to fix if you'd like to dig in. |
@jreback I think 7 is also correct if follow the SQL rule, missed data can not be queried, maybe a new parameter needed? After hours debug, I think this situation is very complex and hard to fix, when by scalar TimeGrouper returns a BinGrouper, while by list returns a BaseGrouper that may have other grouping. I'll take some effort in this bug but can't promise to fix it out. |
you are diving too deep. look more around here [7] is not a resample, rather a straight groupby by date
you could add this example to the groupby docs, but another parameter is not needed. |
I think the situation is complex than your thought. # the labels and group_index of Grouping class, with them the BaseGrouper generate group_info
@property
def labels(self):
if self._labels is None:
self._make_labels()
return self._labels
@property
def group_index(self):
if self._group_index is None:
self._make_labels()
return self._group_index
def _make_labels(self):
if self._labels is None or self._group_index is None:
pdb.set_trace()
# self.grouper is the label of each row
labels, uniques = algorithms.factorize(
self.grouper, sort=self.sort)
uniques = Index(uniques, name=self.name)
self._labels = labels
self._group_index = uniques But for BinGrouper, the group_info property generated in different way, which directly map the labels and group_index. def group_info(self):
pdb.set_trace()
ngroups = self.ngroups
obs_group_ids = np.arange(ngroups)
rep = np.diff(np.r_[0, self.bins])
rep = _ensure_platform_int(rep)
if ngroups == len(self.bins):
comp_ids = np.repeat(np.arange(ngroups), rep)
else:
comp_ids = np.repeat(np.r_[-1, np.arange(ngroups)], rep)
return comp_ids.astype('int64', copy=False), \
obs_group_ids.astype('int64', copy=False), ngroups So the bug occurs, the binner can only provide the labels of each row in selected axis, and then unique them to get unique labels and group_index, so the missed date time cannot be generated. I think the way to fix this bug is to provide group_info in _get_binner_for_grouping of grouper to get more info? I think this logic change may be bigger? I'll take further look. |
@ruiann again, you can simply have a len-1 list-like grouper case turn into the same case as the scalar passed case. its just going down the wrong path. |
Then how about the result of ['categorical_axis', TimeGrouper('D', key='time_axis')]? It can not get all time slices for each category item. def _get_binner_for_grouping(self, obj):
# return an ordering of the transformed group labels,
# suitable for multi-grouping, e.g the labels for
# the resampled intervals
binner, grouper, obj = self._get_grouper(obj)
l = []
for key, group in grouper.get_iterator(self.ax):
l.extend([key] * len(group))
pdb.set_trace()
if isinstance(self.ax, PeriodIndex):
grouper = binner.__class__(l, freq=binner.freq, name=binner.name)
else:
# resampling causes duplicated values, specifying freq is invalid
grouper = binner.__class__(l, name=binner.name)
# since we may have had to sort
# may need to reorder groups here
if self.indexer is not None:
indexer = self.indexer.argsort(kind='quicksort')
grouper = grouper.take(indexer)
return grouper method of BaseGrouper (and BinGrouper) class works wrong, the group_info can be get from the BaseGrouper/BinGrouper instance generated by _get_grouper method, but not to read and sort them again item by item. I'm trying to return the BaseGrouper/BinGrouper instance generated by _get_grouper method if the Grouping instance is given by Grouper instance, and get labels, group_index from the group_info, but there are still some bugs reported by test case, I'm now trying to fix them. At last, I think groupby related logics need some refactor (I mean the name, alias, call stack sometimes confuse me, also some duplicated calls happened, maybe convert them all to list groupers could be better, with which only one implement need to deal). |
If you want to attempt that, sure. Pls do so w/o adding any features / fixing any bugs, (though you can add xfailing tests). Try to make things pass everything that currently works, then in different commits / PR add things. |
@jreback The fix is to allow the Grouping to use grouper generated by _get_grouper of passed TimeGrouper\OtherGrouper item of the list directly, and provide indexer for BinGrouper to reorder them correctly for the values of the grouped axis There are some PEP8 errors, which not caused by my change, and hard to change 😢 Also the Contributing guide doesn't work for me
shell said
When I build env by setup.py and run all tests
same bug occurs in some test cases
|
Code Sample, a copy-pastable example if possible
give below two different outputs
Problem description
The origin data frame doesn't contain a record in 2017-07-24, you can see if I use time grouper with convention='start' lonely, the 2017-07-24 item has been filled, but if by group list, that item hasn't been filled
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.2.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: