Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. #20374

Conversation

pandres
Copy link
Contributor

@pandres pandres commented Mar 15, 2018

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

  • PR title is "DOC: update the docstring"
  • The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single <your-function-or-method>
  • It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

################################################################################
########## Docstring (pandas.core.groupby.DataFrameGroupBy.resample)  ##########
################################################################################

Provide resampling when using a TimeGrouper.

Given a grouper the function resamples it according to a string
"string" -> "frequency".

See the :ref:`frequency aliases <timeseries.offset-aliases>`
documentation for more details.

Parameters
----------
rule : str or Offset
    The offset string or object representing target grouper conversion.
*args, **kwargs : [closed, label, loffset]
    For compatibility with other groupby methods. See below for some
    example parameters.
closed : {‘right’, ‘left’}
    Which side of bin interval is closed.
label : {‘right’, ‘left’}
    Which bin edge label to label bucket with.
loffset : timedelta
    Adjust the resampled time labels.

Returns
-------
Grouper
    Return a new grouper with our resampler appended.

Examples
--------
Start by creating a length-4 DataFrame with minute frequency.

>>> idx = pd.date_range('1/1/2000', periods=4, freq='T')
>>> df = pd.DataFrame(data=4 * [range(2)],
...                   index=idx,
...                   columns=['a', 'b'])
>>> df.iloc[2, 0] = 5
>>> df
                     a  b
2000-01-01 00:00:00  0  1
2000-01-01 00:01:00  0  1
2000-01-01 00:02:00  5  1
2000-01-01 00:03:00  0  1

Downsample the DataFrame into 3 minute bins and sum the values of
the timestamps falling into a bin.

>>> df.groupby('a').resample('3T').sum()
                         a  b
a
0   2000-01-01 00:00:00  0  2
    2000-01-01 00:03:00  0  1
5   2000-01-01 00:00:00  5  1

Upsample the series into 30 second bins.

>>> df.groupby('a').resample('30S').sum()
                         a  b
a
0   2000-01-01 00:00:00  0  1
    2000-01-01 00:00:30  0  0
    2000-01-01 00:01:00  0  1
    2000-01-01 00:01:30  0  0
    2000-01-01 00:02:00  0  0
    2000-01-01 00:02:30  0  0
    2000-01-01 00:03:00  0  1
5   2000-01-01 00:02:00  5  1

Resample by month. Values are assigned to the month of the period.

>>> df.groupby('a').resample('M').sum()
                a  b
a
0   2000-01-31  0  3
5   2000-01-31  5  1

Downsample the series into 3 minute bins as above, but close the right
side of the bin interval.

>>> df.groupby('a').resample('3T', closed='right').sum()
                         a  b
a
0   1999-12-31 23:57:00  0  1
    2000-01-01 00:00:00  0  2
5   2000-01-01 00:00:00  5  1

Downsample the series into 3 minute bins and close the right side of
the bin interval, but label each bin using the right edge instead of
the left.

>>> df.groupby('a').resample('3T', closed='right', label='right').sum()
                         a  b
a
0   2000-01-01 00:00:00  0  1
    2000-01-01 00:03:00  0  2
5   2000-01-01 00:03:00  5  1

Add an offset of twenty seconds.

>>> df.groupby('a').resample('3T', loffset='20s').sum()
                         a  b
a
0   2000-01-01 00:00:20  0  2
    2000-01-01 00:03:20  0  1
5   2000-01-01 00:00:20  5  1


See also
--------
pandas.Series.groupby
pandas.DataFrame.groupby
pandas.Panel.groupby

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
	Use only one blank line to separate sections or paragraphs
	Errors in parameters section
		Parameters {'args', 'kwargs'} not documented
		Unknown parameters {'*args, **kwargs', 'loffset', 'label', 'closed'}

args and kwargs are described with asterisks as requested, even if the validator does not recognizes it.

See also complains of duplication if left as in the commit in the docstring. Not showed above.

@jreback jreback added Docs Resample resample method labels Mar 16, 2018
Return a new grouper with our resampler appended
Provide resampling when using a TimeGrouper.

Given a grouper the function resamples it according to a string and an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"string" -> "frequency" and link to the frequency aliases:

:ref:`frequency aliases <timeseries.offset_aliases>`

Provide resampling when using a TimeGrouper.

Given a grouper the function resamples it according to a string and an
optional list and dictionary of parameters. Returns a new grouper with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"optional list and dictionary of parameters" doesn't mean much to me as a user. I'd say remove that bit.


Parameters
----------
rule : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rule : str or Offset I think. @jorisvandenbossche does that look OK for Offset, instead of pandas.core.offsets.Offset? We used just Offset in a couple places.

----------
rule : str
The offset string or object representing target grouper conversion.
*args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit too specific to the implementation to me. Do you have any specific args / kwargs that the user would care about? Otherwise, I'd just write it as

*args, **kwargs
    For compatibility with other groupby methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some keywords to think about (explicitly document and show examples for)?

closed : {'right', 'left'}
label : {'right', 'left'}
loffset : timedelta

See pandas.core.generic.NDFrame.resample for some others that may or may not have an effect. I'm not sure.

-------
Grouper
Return a new grouper with our resampler appended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Also
--------
pandas.Grouper : specify a frequency to resample with when
    grouping by a key.
DatetimeIndex.resample : Frequency conversion and resampling of
    time series.


Examples
--------
Start by creating a DataFrame with 9 one minute timestamps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"9 one" is a bit awkward. Maybe "a length-9 DataFrame with minute frequency."

Start by creating a DataFrame with 9 one minute timestamps.

>>> idx = pd.date_range('1/1/2000', periods=9, freq='T')
>>> df = pd.DataFrame(data=9*[range(4)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: space around *.

>>> df = pd.DataFrame(data=9*[range(4)],
... index=idx,
... columns=['a', 'b', 'c', 'd'])
>>> df.iloc[[6], [0]] = 5 # change a value for grouping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the inner lists, just df.iloc[6, 0] = 5. Don't need the comment.

@codecov
Copy link

codecov bot commented Apr 3, 2018

Codecov Report

Merging #20374 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20374      +/-   ##
==========================================
- Coverage   92.24%   92.24%   -0.01%     
==========================================
  Files         161      161              
  Lines       51318    51317       -1     
==========================================
- Hits        47339    47338       -1     
  Misses       3979     3979
Flag Coverage Δ
#multiple 90.63% <ø> (-0.01%) ⬇️
#single 42.3% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 96.5% <ø> (-0.01%) ⬇️
pandas/core/dtypes/common.py 94.37% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe52d9f...f375ceb. Read the comment docs.

@pandres pandres force-pushed the docstring_pandas.core.groupby.DataFrameGroupBy.resample branch from 822c90c to 92825c6 Compare April 4, 2018 19:50
@pandres
Copy link
Contributor Author

pandres commented Apr 4, 2018

################################################################################
########### Docstring (pandas.core.groupby.groupby.GroupBy.resample) ###########
################################################################################

Provide resampling when using a TimeGrouper.

Given a grouper the function resamples it according to a string
"string" -> "frequency".

See the :ref:`frequency aliases <timeseries.offset-aliases>`
documentation for more details.

Parameters
----------
rule : str or Offset
    The offset string or object representing target grouper conversion.
*args, **kwargs
    For compatibility with other groupby methods. See below for some
    example parameters.
closed : {‘right’, ‘left’}
    Which side of bin interval is closed.
label : {‘right’, ‘left’}
    Which bin edge label to label bucket with.
loffset : timedelta
    Adjust the resampled time labels.

Returns
-------
Grouper
    Return a new grouper with our resampler appended.

Examples
--------
Start by creating a length-9 DataFrame with minute frequency.

>>> idx = pd.date_range('1/1/2000', periods=9, freq='T')
>>> df = pd.DataFrame(data=9 * [range(4)],
...                   index=idx,
...                   columns=['a', 'b', 'c', 'd'])
>>> df.iloc[6, 0] = 5
>>> df
                     a  b  c  d
2000-01-01 00:00:00  0  1  2  3
2000-01-01 00:01:00  0  1  2  3
2000-01-01 00:02:00  0  1  2  3
2000-01-01 00:03:00  0  1  2  3
2000-01-01 00:04:00  0  1  2  3
2000-01-01 00:05:00  0  1  2  3
2000-01-01 00:06:00  5  1  2  3
2000-01-01 00:07:00  0  1  2  3
2000-01-01 00:08:00  0  1  2  3

Downsample the DataFrame into 3 minute bins and sum the values of
the timestamps falling into a bin.

>>> df.groupby('a').resample('3T').sum()
                         a  b  c  d
a
0   2000-01-01 00:00:00  0  3  6  9
    2000-01-01 00:03:00  0  3  6  9
    2000-01-01 00:06:00  0  2  4  6
5   2000-01-01 00:06:00  5  1  2  3

Upsample the series into 30 second bins.

>>> df.groupby('a').resample('30S').sum()
                         a  b  c  d
a
0   2000-01-01 00:00:00  0  1  2  3
    2000-01-01 00:00:30  0  0  0  0
    2000-01-01 00:01:00  0  1  2  3
    2000-01-01 00:01:30  0  0  0  0
    2000-01-01 00:02:00  0  1  2  3
    2000-01-01 00:02:30  0  0  0  0
    2000-01-01 00:03:00  0  1  2  3
    2000-01-01 00:03:30  0  0  0  0
    2000-01-01 00:04:00  0  1  2  3
    2000-01-01 00:04:30  0  0  0  0
    2000-01-01 00:05:00  0  1  2  3
    2000-01-01 00:05:30  0  0  0  0
    2000-01-01 00:06:00  0  0  0  0
    2000-01-01 00:06:30  0  0  0  0
    2000-01-01 00:07:00  0  1  2  3
    2000-01-01 00:07:30  0  0  0  0
    2000-01-01 00:08:00  0  1  2  3
5   2000-01-01 00:06:00  5  1  2  3

Resample by month. Values are assigned to the month of the period.

>>> df.groupby('a').resample('M').sum()
                a  b   c   d
a
0   2000-01-31  0  8  16  24
5   2000-01-31  5  1   2   3

Downsample the series into 3 minute bins as above, but close the right
side of the bin interval.

>>> df.groupby('a').resample('3T', closed='right').sum()
                         a  b  c  d
a
0   1999-12-31 23:57:00  0  1  2  3
    2000-01-01 00:00:00  0  3  6  9
    2000-01-01 00:03:00  0  2  4  6
    2000-01-01 00:06:00  0  2  4  6
5   2000-01-01 00:03:00  5  1  2  3

Downsample the series into 3 minute bins and close the right side of
the bin interval, but label each bin using the right edge instead of
the left.

>>> df.groupby('a').resample('3T', closed='right', label='right').sum()
                         a  b  c  d
a
0   2000-01-01 00:00:00  0  1  2  3
    2000-01-01 00:03:00  0  3  6  9
    2000-01-01 00:06:00  0  2  4  6
    2000-01-01 00:09:00  0  2  4  6
5   2000-01-01 00:06:00  5  1  2  3

Add an offset of twenty seconds.

>>> df.groupby('a').resample('3T', loffset='20s').sum()
                         a  b  c  d
a
0   2000-01-01 00:00:20  0  3  6  9
    2000-01-01 00:03:20  0  3  6  9
    2000-01-01 00:06:20  0  2  4  6
5   2000-01-01 00:06:20  5  1  2  3

See Also
--------
pandas.Grouper : specify a frequency to resample with when
    grouping by a key.
DatetimeIndex.resample : Frequency conversion and resampling of
    time series.

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
	Use only one blank line to separate sections or paragraphs
	Errors in parameters section
		Parameters {'args', 'kwargs'} not documented
		Unknown parameters {'loffset', 'label', 'closed', '*args, **kwargs'}
		Parameter "*args, **kwargs" has no type

@pandres pandres force-pushed the docstring_pandas.core.groupby.DataFrameGroupBy.resample branch from 92825c6 to deea3c7 Compare April 4, 2018 19:56
@pandres
Copy link
Contributor Author

pandres commented Apr 4, 2018

I'm confused about the 'See also' section. It auto-generates the message:

See also
--------
pandas.Series.groupby
pandas.DataFrame.groupby
pandas.Panel.groupby

With the errors:
Missing description for See Also "pandas.Series.groupby" reference
Missing description for See Also "pandas.DataFrame.groupby" reference
Missing description for See Also "pandas.Panel.groupby" reference

I've replaced it with the commentary noted above in the responses.

label : {‘right’, ‘left’}
Which bin edge label to label bucket with.
loffset : timedelta
Adjust the resampled time labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters are not in the signature, are they the possible kwargs? If that's the case, we can add them as a list in the kwargs description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the are just some samples of the kwargs, as they were requested.

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
2000-01-01 00:05:00 0 1 2 3
2000-01-01 00:06:00 5 1 2 3
2000-01-01 00:07:00 0 1 2 3
2000-01-01 00:08:00 0 1 2 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it'd be possible to use a more compact example. May be 4 rows of 1 minute intervals, that can be downsampled to 2 rows of 30 seconds? Also, I think 2 columns should be enough.

@datapythonista datapythonista self-assigned this Jul 22, 2018
@pep8speaks
Copy link

Hello @pandres! Thanks for updating the PR.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed some minor fixes.

@jbrockmendel can you take a quick look and merge on green if you're happy?


Parameters
----------
rule : str or Offset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offset --> DateOffset

Which side of bin interval is closed.
* label : {'right', 'left'}
Which bin edge label to label bucket with.
* loffset : timedelta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "loffset" right? I don't know this section of the code all that well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert myself, I guess the type should be DateOffset, tseries.offsets, timedelta, or str?

For what I can see, valid keywords should be how, fill_method, limit, kind and on (in get_resampler_for_grouping), closed, label, how, axis, fill_method, limit, loffset, kind, convention, base (in TimeGrouper), and key, level, freq, axis, sort (in Grouper).

What I'd do is to add the ones from get_resampler_for_grouping as explicit arguments, and then document that **kwargs will be passed to TimeGrouper.

@jreback is it ok to change the signature?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

level, axis, freq, key, sort are all part of the grouper and not args to .resample() or any aggregation function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pandres can you replace the description by something like:

Possible arguments are `how`, `fill_method`, `limit`, `kind` and `on`, and other arguments of `TimeGrouper`.

We can improve that later in a separate PR, but I think we can merge all the rest of the changes for now.

Thanks!

@jbrockmendel
Copy link
Member

@datapythonista I gave this a read and made some comments, but don't know this section of the code well enough to form an informed opinion.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just rebased master. I think should be good to merge on green.

@TomAugspurger TomAugspurger merged commit 5a04e6e into pandas-dev:master Nov 14, 2018
@TomAugspurger
Copy link
Contributor

Thanks @pandres!

@pandres
Copy link
Contributor Author

pandres commented Nov 14, 2018

Thank you guys!

thoo added a commit to thoo/pandas that referenced this pull request Nov 15, 2018
* upstream/master: (25 commits)
  DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651)
  DOC: Change release and whatsnew (pandas-dev#21599)
  DOC: Fix format of the See Also descriptions (pandas-dev#23654)
  DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374)
  ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)
  CLN: Remove unnecessary code (pandas-dev#23696)
  Pin flake8-rst version (pandas-dev#23699)
  Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643)
  CI: raise clone depth limit on CI
  BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688)
  REF: Move Excel names parameter handling to CSV (pandas-dev#23690)
  DOC: Accessing files from a S3 bucket. (pandas-dev#23639)
  Fix errorbar visualization (pandas-dev#23674)
  DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678)
  DOC: Update is_sparse docstring (pandas-dev#19983)
  BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661)
  Add to_flat_index method to MultiIndex (pandas-dev#22866)
  CLN: Move to_excel to generic.py (pandas-dev#23656)
  TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660)
  CI: Allow to compile docs with ipython 7.11 pandas-dev#22990 (pandas-dev#23655)
  ...
thoo added a commit to thoo/pandas that referenced this pull request Nov 15, 2018
…fixed

* upstream/master:
  DOC: Delete trailing blank lines in docstrings. (pandas-dev#23651)
  DOC: Change release and whatsnew (pandas-dev#21599)
  DOC: Fix format of the See Also descriptions (pandas-dev#23654)
  DOC: update pandas.core.groupby.DataFrameGroupBy.resample docstring. (pandas-dev#20374)
  ENH: Allow export of mixed columns to Stata strl (pandas-dev#23692)
  CLN: Remove unnecessary code (pandas-dev#23696)
  Pin flake8-rst version (pandas-dev#23699)
  Implement _most_ of the EA interface for DTA/TDA (pandas-dev#23643)
  CI: raise clone depth limit on CI
  BUG: Fix Series/DataFrame.rank(pct=True) with more than 2**24 rows (pandas-dev#23688)
  REF: Move Excel names parameter handling to CSV (pandas-dev#23690)
  DOC: Accessing files from a S3 bucket. (pandas-dev#23639)
  Fix errorbar visualization (pandas-dev#23674)
  DOC: Surface / doc mangle_dupe_cols in read_excel (pandas-dev#23678)
  DOC: Update is_sparse docstring (pandas-dev#19983)
  BUG: Fix read_excel w/parse_cols & empty dataset (pandas-dev#23661)
  Add to_flat_index method to MultiIndex (pandas-dev#22866)
  CLN: Move to_excel to generic.py (pandas-dev#23656)
  TST: IntervalTree.get_loc_interval should return platform int (pandas-dev#23660)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Resample resample method
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants