BUG: Metadata propagation for groupby iterator #37461

Japanuspus · 2020-10-28T07:30:05Z

Addresses part of #28283

This PR ensures that __finalize__ is called for objects returned by iterator over .groupby-results.
Based on the discussion and benchmarks in PR #35688 (see #35688 (comment)), __finalize__ is not called for intermediate objects used by .apply (and maybe other code paths).

The implemented solution is the safe choice with regard to performance: An alternative solution would be to call __finalize__ immediately after initialization (in DataSplitter._chop()).
The downside to doing so would be the performance hit due to the overhead of finalizing intermediate objects in .apply (and maybe other code paths). In the benchmarks referenced above, this amounted to around 4%.
The upside of finalizing immediately after initialization would be reduced complexity and that it would allow .apply (and any other code paths using DataSplitter._chop to access metadata)..

closes BUG: groupby __iter__ on pandas 1.1.x not propagating _metadata on DataFrame subclasses #37343
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Japanuspus · 2020-10-28T08:26:56Z

@jorisvandenbossche You asked about this PR. Do you have any position on the performance question? Would you be able to review?

I will have a look at the failing tests.

simonjayhawkins · 2020-10-29T16:24:49Z

Thanks @Japanuspus for the PR.

Addresses part of #28283

This PR may also fix #37343, see #37343 (comment)

If you could add a test for that issue, that'll be great. Also if you could keep the changes in this PR minimal as we would likely want to backport this.

Japanuspus · 2020-10-30T09:22:57Z

Thanks @simonjayhawkins!

This PR may also fix #37343, see #37343 (comment)

If you could add a test for that issue, that'll be great. Also if you could keep the changes in this PR minimal as we would likely want to backport this.

Yes, this was exactly the issue that I saw in my own code as well after the 1.1 update. I will add a test. It appears I need to merge 2414c75 to get passing tests: will rebase and force push to keep things clean for backport.

Japanuspus · 2020-10-30T10:25:05Z

Still seeing unrelated test failures. Will try waiting for things to settle down once 1.1.4 is released.

jreback

can you also update the metadata issue in the 1.2 whatsnew (just add the issue number on)

pandas/tests/groupby/test_grouping.py

jbrockmendel · 2020-10-30T18:07:49Z

Does something similar need to be done for the cython path?

Japanuspus · 2020-11-02T08:54:49Z

Does something similar need to be done for the cython path?

I believe not: The functional change is in BaseGrouper.get_iterator which does not appear to have a cython alternative path.

This solution does not call `__finalize__` immediately on object creation in the `DataSplitter` instance. This is to allow `.apply` to avoid the overhead of calling `__finalize__` for intermediate objects.

See https://pandas.pydata.org/pandas-docs/stable/development/extending.html#override-constructor-properties

simonjayhawkins · 2020-11-02T15:17:54Z

@jreback are you OK with backporting this?

jreback · 2020-11-02T21:23:37Z

@jreback are you OK with backporting this?

actually this is fine to backport.

@Japanuspus can you update the whatsnew note to put in 1.1.5

jreback

lgtm if you can move the note and merge master. ping on green-ish.

doc/source/whatsnew/v1.2.0.rst

simonjayhawkins · 2020-11-04T11:24:12Z

lgtm if you can move the note and merge master. ping on green-ish.

32 bit failure unrelated. (caused by recent merge to master of #36842)

jreback · 2020-11-04T13:22:59Z

thanks @Japanuspus

simonjayhawkins · 2020-11-04T13:26:35Z

@meeseeksdev backport 1.1.x

lumberbot-app · 2020-11-04T13:27:06Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

$ git checkout 1.1.x
$ git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

$ git cherry-pick -m1 cc9c646463d4a93abdc7c61bbb47e7d2ccf2fc4b

You will likely have some merge/cherry-pick conflict here, fix them and commit:

$ git commit -am 'Backport PR #37461: BUG: Metadata propagation for groupby iterator'

Push to a named branch :

git push YOURFORK 1.1.x:auto-backport-of-pr-37461-on-1.1.x

Create a PR against branch 1.1.x, I would have named this PR:

"Backport PR #37461 on branch 1.1.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

…on for groupby iterator

…upby iterator (#37628) Co-authored-by: Janus <janus@insignificancegalore.net>

… (#37655) * Moving the file test_frame.py to a new directory * Сreated file test_frame_color.py * Transfer tests of test_frame.py to test_frame_color.py * PEP 8 fixes * Transfer tests of test_frame.py to test_frame_groupby.py and test_frame_subplots.py * Removing unnecessary imports * PEP 8 fixes * Fixed class name * Transfer tests of test_frame.py to test_frame_subplots.py * Transfer tests of test_frame.py to test_frame_groupby.py, test_frame_subplots.py, test_frame_color.py * Changed class names * Removed unnecessary imports * Removed import * catch FutureWarnings (#37587) * TST/REF: collect indexing tests by method (#37590) * REF: prelims for single-path setitem_with_indexer (#37588) * ENH: __repr__ for 2D DTA/TDA (#37164) * CLN: de-duplicate _validate_where_value with _validate_setitem_value (#37595) * TST/REF: collect tests by method (#37589) * TST/REF: move remaining setitem tests from test_timeseries * TST/REF: rehome test_timezones test * move misplaced arithmetic test * collect tests by method * move misplaced file * REF: Categorical.is_dtype_equal -> categories_match_up_to_permutation (#37545) * CLN refactor non-core (#37580) * refactor core/computation (#37585) * TST/REF: share method tests between DataFrame and Series (#37596) * BUG: Index.where casting ints to str (#37591) * REF: IntervalArray comparisons (#37124) * regression fix for merging DF with datetime index with empty DF (#36897) * ERR: fix error message in Period for invalid frequency (#37602) * CLN: remove rebox_native (#37608) * TST/REF: tests.generic (#37618) * TST: collect tests by method (#37617) * TST/REF: collect test_timeseries tests by method * misplaced DataFrame.values tst * misplaced dataframe.values test * collect test by method * TST/REF: share tests across Series/DataFrame (#37616) * Gh 36562 typeerror comparison not supported between float and str (#37096) * docs: fix punctuation (#37612) * REGR: pd.to_hdf(..., dropna=True) not dropping missing rows (#37564) * parametrize set_axis tests (#37619) * CLN: clean color selection in _matplotlib/style (#37203) * DEPR: DataFrame/Series.slice_shift (#37601) * REF: re-use validate_setitem_value in Categorical.fillna (#37597) * PERF: release gil for ewma_time (#37389) * BUG: Groupy dropped nan groups from result when grouping over single column (#36842) * ENH: implement timeszones support for read_json(orient='table') and astype() from 'object' (#35973) * REF/BUG/TYP: read_csv shouldn't close user-provided file handles (#36997) * BUG/REF: read_csv shouldn't close user-provided file handles * get_handle: typing, returns is_wrapped, use dataclass, and make sure that all created handlers are returned * remove unused imports * added IOHandleArgs.close * added IOArgs.close * mostly comments * move memory_map from TextReader to CParserWrapper * moved IOArgs and IOHandles * more comments Co-authored-by: Jeff Reback <jeff@reback.net> * more typing checks to pre-commit (#37539) * TST: 32bit dtype compat test_groupby_dropna (#37623) * BUG: Metadata propagation for groupby iterator (#37461) * BUG: read-only values in cython funcs (#37613) * CLN refactor core/arrays (#37581) * Fixed Metadata Propogation in DataFrame (#37381) * TYP: add Shape alias to pandas._typing (#37128) * DOC: Fix typo (#37630) * CLN: parametrize test_nat_comparisons (#37195) * dataframe dataclass docstring updated (#37632) * refactor core/groupby (#37583) * BUG: set index of DataFrame.apply(f) when f returns dict (#37544) (#37606) * BUG: to_dict should return a native datetime object for NumPy backed dataframes (#37571) * ENH: memory_map for compressed files (#37621) * DOC: add example & prose of slicing with labels when index has duplicate labels (#36814) * DOC: add example & prose of slicing with labels when index has duplicate labels #36251 * DOC: proofread the sentence. Co-authored-by: Jun Kudo <jun-lab@junnoMacBook-Pro.local> * DOC: Fix typo (#37636) "columns(s)" sounded odd, I believe it was supposed to be just "column(s)". * CI: troubleshoot win py38 builds (#37652) * TST/REF: collect indexing tests by method (#37638) * TST/REF: collect tests for get_numeric_data (#37634) * misplaced loc test * TST/REF: collect get_numeric_data tests * REF: de-duplicate _validate_insert_value with _validate_scalar (#37640) * CI: catch windows py38 OSError (#37659) * share test (#37679) * TST: match matplotlib warning message (#37666) * TST: match matplotlib warning message * TST: match full message * pd.Series.loc.__getitem__ promotes to float64 instead of raising KeyError (#37687) * REF/TST: misplaced Categorical tests (#37678) * REF/TST: collect indexing tests by method (#37677) * CLN: only call _wrap_results one place in nanmedian (#37673) * TYP: Index._concat (#37671) * BUG: CategoricalIndex.equals casting non-categories to np.nan (#37667) * CLN: _replace_single (#37683) * REF: simplify _replace_single by noting regex kwarg is bool * Annotate * CLN: remove never-False convert kwarg * TYP: make more internal funcs keyword-only (#37688) * REF: make Series._replace_single a regular method (#37691) * REF: simplify cycling through colors (#37664) * REF: implement _wrap_reduction_result (#37660) * BUG: preserve fold in Timestamp.replace (#37644) * CLN: Clean indexing tests (#37689) * TST: fix warning for pie chart (#37669) * PERF: reverted change from commit 7d257c6 to solve issue #37081 (#37426) * DataFrameGroupby.boxplot fails when subplots=False (#28102) * ENH: Improve error reporting for wrong merge cols (#37547) * Transfer tests of test_frame.py to test_frame_color.py * PEP8 * Fixes for linter * Сhange pd.DateFrame to DateFrame * Move inconsistent namespace check to pre-commit, fixup more files (#37662) * check for inconsistent namespace usage * doc * typos * verbose regex * use verbose flag * use verbose flag * match both directions * add test * don't import annotations from future * update extra couple of cases * 🚚 rename * typing * don't use subprocess * don't type tests * use pathlib * REF: simplify NDFrame.replace, ObjectBlock.replace (#37704) * REF: implement Categorical.encode_with_my_categories (#37650) * REF: implement Categorical.encode_with_my_categories * privatize * BUG: unpickling modifies Block.ndim (#37657) * REF: dont support dt64tz in nanmean (#37658) * CLN: Simplify groupby head/tail tests (#37702) * Bug in loc raised for numeric label even when label is in Index (#37675) * REF: implement replace_regex, remove unreachable branch in ObjectBlock.replace (#37696) * TYP: Check untyped defs (except vendored) (#37556) * REF: remove ObjectBlock._replace_single (#37710) * Transfer tests of test_frame.py to test_frame_color.py * TST/REF: collect indexing tests by method (#37590) * PEP8 * Сhange DateFrame to pd.DateFrame * Сhange pd.DateFrame to DateFrame * Removing imports * Bug fixes * Bug fixes * Fix incorrect merge * test_frame_color.py edit * Transfer tests of test_frame.py to test_frame_color.py, test_frame_groupby.py and test_frame_subplots.py * Removing unnecessary imports * PEP8 * # Conflicts: # pandas/tests/plotting/frame/test_frame.py # pandas/tests/plotting/frame/test_frame_color.py # pandas/tests/plotting/frame/test_frame_subplots.py * Moving the file test_frame.py to a new directory * Transfer tests of test_frame.py to test_frame_color.py, test_frame_groupby.py and test_frame_subplots.py * Removing unnecessary imports * PEP8 * CLN: clean categorical indexes tests (#37721) * Fix merge error * PEP 8 fixes * Fix merge error * Removing unnecessary imports * PEP 8 fixes * Fixed class name * Transfer tests of test_frame.py to test_frame_subplots.py * Transfer tests of test_frame.py to test_frame_groupby.py, test_frame_subplots.py, test_frame_color.py * Changed class names * Removed unnecessary imports * Removed import * TST/REF: collect indexing tests by method (#37590) * TST: match matplotlib warning message (#37666) * TST: match matplotlib warning message * TST: match full message * TST: fix warning for pie chart (#37669) * Transfer tests of test_frame.py to test_frame_color.py * PEP8 * Fixes for linter * Сhange pd.DateFrame to DateFrame * Transfer tests of test_frame.py to test_frame_color.py * PEP8 * Сhange DateFrame to pd.DateFrame * Сhange pd.DateFrame to DateFrame * Removing imports * Bug fixes * Bug fixes * Fix incorrect merge * test_frame_color.py edit * Fix merge error * Fix merge error * Removing unnecessary features * Resolving Commit Conflicts daf999f 365d843 * black fix Co-authored-by: jbrockmendel <jbrockmendel@gmail.com> Co-authored-by: Marco Gorelli <m.e.gorelli@gmail.com> Co-authored-by: Philip Cerles <philip.cerles@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Sven <sven.schellenberg@paradynsystems.com> Co-authored-by: Micael Jarniac <micael@jarniac.com> Co-authored-by: Andrew Wieteska <48889395+arw2019@users.noreply.github.com> Co-authored-by: Maxim Ivanov <41443370+ivanovmg@users.noreply.github.com> Co-authored-by: Erfan Nariman <34067903+erfannariman@users.noreply.github.com> Co-authored-by: Fangchen Li <fangchen.li@outlook.com> Co-authored-by: patrick <61934744+phofl@users.noreply.github.com> Co-authored-by: attack68 <24256554+attack68@users.noreply.github.com> Co-authored-by: Torsten Wörtwein <twoertwein@users.noreply.github.com> Co-authored-by: Jeff Reback <jeff@reback.net> Co-authored-by: Janus <janus@insignificancegalore.net> Co-authored-by: Joel Whittier <rootbeerfriend@gmail.com> Co-authored-by: taytzehao <jtth95@gmail.com> Co-authored-by: ma3da <34522496+ma3da@users.noreply.github.com> Co-authored-by: junk <juntrp0207@gmail.com> Co-authored-by: Jun Kudo <jun-lab@junnoMacBook-Pro.local> Co-authored-by: Alex Kirko <alexander.kirko@gmail.com> Co-authored-by: Yassir Karroum <ukarroum17@gmail.com> Co-authored-by: Kaiqi Dong <kaiqi@kth.se> Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com> Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

ryandvmartin mentioned this pull request Oct 29, 2020

BUG: groupby __iter__ on pandas 1.1.x not propagating _metadata on DataFrame subclasses #37343

Closed

3 tasks

Japanuspus force-pushed the groupby_iter_finalize branch from 9d8b79b to eeaf14d Compare October 30, 2020 09:25

jreback requested changes Oct 30, 2020

View reviewed changes

pandas/tests/groupby/test_grouping.py Outdated Show resolved Hide resolved

jreback added the metadata _metadata, .attrs label Oct 30, 2020

Japanuspus added 2 commits November 2, 2020 12:58

Call __finalize__ in groupby iterator

afd8491

This solution does not call `__finalize__` immediately on object creation in the `DataSplitter` instance. This is to allow `.apply` to avoid the overhead of calling `__finalize__` for intermediate objects.

Add test for pandas-devGH-37343

e733ba4

See https://pandas.pydata.org/pandas-docs/stable/development/extending.html#override-constructor-properties

Japanuspus force-pushed the groupby_iter_finalize branch 2 times, most recently from 03eaaf0 to b5286ea Compare November 2, 2020 13:48

jreback modified the milestones: 1.2, 1.1.5 Nov 2, 2020

jreback requested changes Nov 2, 2020

View reviewed changes

doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved

has2k1 mentioned this pull request Nov 3, 2020

group_by broken has2k1/plydata#23

Closed

Japanuspus force-pushed the groupby_iter_finalize branch from b5286ea to e733ba4 Compare November 4, 2020 10:24

Japanuspus added 2 commits November 4, 2020 11:25

Merge branch 'master' into groupby_iter_finalize

91ac7ec

Add 1.1.5 whatsnew entry

f7d0a4b

Japanuspus requested a review from jreback November 4, 2020 12:30

jreback approved these changes Nov 4, 2020

View reviewed changes

jreback merged commit cc9c646 into pandas-dev:master Nov 4, 2020

lumberbot-app bot added the Still Needs Manual Backport label Nov 4, 2020

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Nov 4, 2020

Backport PR pandas-dev#37461 on branch 1.1.x: BUG: Metadata propagati…

bf91185

…on for groupby iterator

simonjayhawkins mentioned this pull request Nov 4, 2020

Backport PR #37461 on branch 1.1.x: BUG: Metadata propagation for groupby iterator #37628

Merged

simonjayhawkins removed the Still Needs Manual Backport label Nov 4, 2020

simonjayhawkins added a commit that referenced this pull request Nov 4, 2020

Backport PR #37461 on branch 1.1.x: BUG: Metadata propagation for gro…

d601416

…upby iterator (#37628) Co-authored-by: Janus <janus@insignificancegalore.net>

nhoening mentioned this pull request Dec 11, 2020

Tests fail with pandas==1.1(metadata lost on groupby) SeitaBV/timely-beliefs#26

Closed

jorisvandenbossche mentioned this pull request Jan 14, 2022

BUG: correctly instantiate subclassed DataFrame/Series in groupby apply #45363

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Metadata propagation for groupby iterator #37461

BUG: Metadata propagation for groupby iterator #37461

Japanuspus commented Oct 28, 2020 •

edited

Loading

Japanuspus commented Oct 28, 2020

simonjayhawkins commented Oct 29, 2020

Japanuspus commented Oct 30, 2020

Japanuspus commented Oct 30, 2020

jreback left a comment

jbrockmendel commented Oct 30, 2020

Japanuspus commented Nov 2, 2020

simonjayhawkins commented Nov 2, 2020

jreback commented Nov 2, 2020

jreback left a comment

simonjayhawkins commented Nov 4, 2020

jreback commented Nov 4, 2020

simonjayhawkins commented Nov 4, 2020

lumberbot-app bot commented Nov 4, 2020

BUG: Metadata propagation for groupby iterator #37461

BUG: Metadata propagation for groupby iterator #37461

Conversation

Japanuspus commented Oct 28, 2020 • edited Loading

Japanuspus commented Oct 28, 2020

simonjayhawkins commented Oct 29, 2020

Japanuspus commented Oct 30, 2020

Japanuspus commented Oct 30, 2020

jreback left a comment

Choose a reason for hiding this comment

jbrockmendel commented Oct 30, 2020

Japanuspus commented Nov 2, 2020

simonjayhawkins commented Nov 2, 2020

jreback commented Nov 2, 2020

jreback left a comment

Choose a reason for hiding this comment

simonjayhawkins commented Nov 4, 2020

jreback commented Nov 4, 2020

simonjayhawkins commented Nov 4, 2020

lumberbot-app bot commented Nov 4, 2020

Japanuspus commented Oct 28, 2020 •

edited

Loading