-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: improve doc string for .aggregate and .transform #22641
Conversation
Hello @topper-123! Thanks for submitting the PR.
|
dc30d7a
to
b9d0dd3
Compare
Codecov Report
@@ Coverage Diff @@
## master #22641 +/- ##
==========================================
+ Coverage 92.16% 92.16% +<.01%
==========================================
Files 169 169
Lines 50708 50716 +8
==========================================
+ Hits 46734 46742 +8
Misses 3974 3974
Continue to review full report at Codecov.
|
pandas/core/series.py
Outdated
|
||
@Appender(_transform_doc) | ||
@Appender(generic._shared_docs['transform'] % _shared_doc_kwargs) | ||
def transform(self, func, axis=0, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm generally not sure its worth changing actual implementation for docstrings. If this is solely to isolate the various Examples I'd think it preferable to just have one shared Example docstring that covers Series and DataFrame rather than making code changes like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc string is currently inherited from NDFrame, but it's not pretty IMO, see here
The issues are:
- The current doc string discussed NDFrames, not Series, which is confusing,
- the links in the SeeAlso don't work,
Series.transform
's signature is in master different from the signature forSeries.agg
andSeries.apply
in that it misses theaxis
parameter that the other two have. This may/may not be a problem (I actually couldn't produce a bug caused by this), but the signature should be consistent amongst the three methods, and I agree with the current design foragg
andapply
(i.e. to have an axis parameter).
Thoughts? I can remove this if there's not consensus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say just use substitution for the class name (i.e. Series or DataFrame) and update the See Also links to point to both the Series and DataFrame methods. One of those will obviously be self referencing, but we've done this in other places as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see what you mean. I've made a new commit with common examples but the SeeAlso can actually be made correct if Series gets it's own transform method (which is needed for the signature issue to be resolved).
pandas/core/generic.py
Outdated
func : function, string, list of string/functions or dictionary | ||
Function to use for transforming the data. If a function, must either | ||
work when passed a %(klass)s or when passed to %(klass)s.apply. | ||
The function (or each function in a list/dict) must return an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO would be better served in a Notes section along with description of what happens if something is returned that doesn't meet this requirement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it is already described in the Raises section, so I could just drop this part here...
abd633a
to
27bbb72
Compare
@Appender(generic._shared_docs['transform'] % _shared_doc_kwargs) | ||
def transform(self, func, axis=0, *args, **kwargs): | ||
# Validate the axis parameter | ||
self._get_axis_number(axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of this statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks that the value passed to axis is 0 or “index”, else an exception is raised. So, a minor check for consistency.
27bbb72
to
59ca537
Compare
59ca537
to
650f639
Compare
all tests have passed. Is this ok? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@datapythonista can you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look great. I added some comments, mainly related to the original docstring, that I think would make them follow our standards and be more clear.
Besides the comments, I think the previous examples were failing the doctests, and yours pass, so we should be able to stop skipping these docstrings in ci/doctest.sh
.
Also, did you run ./scripts/validate_docstrings.py pandas.Series.transform
...?
pandas/core/generic.py
Outdated
@@ -4545,17 +4545,16 @@ def pipe(self, func, *args, **kwargs): | |||
|
|||
Parameters | |||
---------- | |||
func : function, string, dictionary, or list of string/functions | |||
func : function, string, list of string/functions or dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to use only Python types in this row, and be consistent with the format. func : function, str, list or dict
would be the preferred format (or specifying the list types, like list of str, list of function
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, but the list can contain bot functions and strings ([np.exp, 'sqrt']
). Maybe func : function, string, list of functions and/or strings or dict
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point I'd like to validate that all the types provided in that line are from a subset, to avoid typos and inconsistencies. I understand your point about mixing both, but I'd prefer function, str, list or dict
(note str
, the Python type, instead of string
), and then provide the details about mixing strings and everything else in the description.
pandas/core/generic.py
Outdated
@@ -4581,38 +4580,61 @@ def pipe(self, func, *args, **kwargs): | |||
|
|||
Parameters | |||
---------- | |||
func : callable, string, dictionary, or list of string/callables | |||
To apply to column | |||
func : function, string, list of string/functions or dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before
pandas/core/generic.py
Outdated
*args | ||
Positional arguments to pass to `func`. | ||
**kwargs | ||
Keyword arguments to pass to `func`. | ||
|
||
Returns | ||
------- | ||
transformed : %(klass)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transformed
name doesn't add a lot of value. Can you just leave the type here, and add a description in the next line. Same for aggregate
.
pandas/core/generic.py
Outdated
|
||
Returns | ||
------- | ||
transformed : %(klass)s | ||
|
||
Raises | ||
------ | ||
ValueError: if the returned %(klass)s has a different length than self. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if sphinx requires a space before the colon. Did you render the docstring? Also, if you can capitalize the sentence If the returned...
.
Then, I don't find the description very clear on what the user did wrong, and how they should fix the problem. Do you think you can add a bit more information?
pandas/core/generic.py
Outdated
1 1.000000 2.718282 | ||
2 1.414214 7.389056 | ||
3 1.732051 20.085537 | ||
4 2.000000 54.598150 | ||
|
||
See also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See Also
should be placed before the examples, and should have a capital A
in Also
.
pandas/core/generic.py
Outdated
2000-01-10 -1.366388 -0.614710 0.005378 | ||
A B | ||
2000-01-01 -1.143001 1.143001 | ||
2000-01-02 -0.889001 0.889001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this example could be simplified, so users don't need to spend a lot of time understanding what's going on. I think 3 or 4 rows should be enough to illustrate transform
. Also, standardizing the values may be a more real-world example, but I don't think anybody is able to do the mental math, to compare what they think the function is doing, with what we show here. Also, I would use the default index, as using dates eems to have a meaning, and it's misleading.
So, I'd do:
- A much shorter DataFrame (e.g. 3 rows)
- A much simpler function (e.g.
lambda x: x + 1
) - Use the default index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point, changed. For the Series, I've also shortened it, but kept s.transform([np.sqrt, np.exp])
, as I think that also is quite simple.
pandas/core/generic.py
Outdated
|
||
It is only required for the axis specified in the ``axis`` parameter | ||
to have the same length for output and for self. The other axis may have a | ||
different length: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand what you mean here. We're not specifying the axis
parameter in the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code-wise transform
is the same as aggregate
, except there's a check that the result has the same length as self. I try to bring out this requirement for transform
.
I've tried to word the doc string differently to highlight this better.
@datapythonista , I'm not familiar with Wrt. other stuff, I've made various changes according to the comments (great comments BTW). EDIT: BTW, I ran |
@topper-123 at the moment we don't do the validation of docstrings in the CI (there are too many failing, and the script also reports false positives). the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just added few comments about the formatting.
pandas/core/generic.py
Outdated
@@ -4564,7 +4563,7 @@ def pipe(self, func, *args, **kwargs): | |||
|
|||
Returns | |||
------- | |||
aggregated : %(klass)s | |||
pandas.%(klass)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we prefix Series
and DataFrame
with pandas
. I'd just leave with the class for consistency.
pandas/core/generic.py
Outdated
- dict of column names -> functions (or list of functions). | ||
- string function name | ||
- function | ||
- list of functions and/or function names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be you can add the example you wrote in the comments? I think that would make much clearer that we can mix both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be you can add the example you wrote in the comment?
Not sure I understand whst you mean here, could you expand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant that we could have an example like [np.exp, 'sqrt']
that you mentioned before, so it's easier to see that you can use both strings and functions together in the same list.
pandas/core/generic.py
Outdated
|
||
.. versionadded:: 0.20.0 | ||
|
||
Parameters | ||
---------- | ||
func : callable, string, dictionary, or list of string/callables | ||
To apply to column | ||
func : function, string, list of functions and/or strings or dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str
instead of string
*args | ||
Positional arguments to pass to `func`. | ||
**kwargs | ||
Keyword arguments to pass to `func`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you prefer, numpydoc also accepts having both in one line:
*args, **kwargs
Arguments to pass to `func`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I did that, though my own preference would be to not use star arguments and use args=None, kwargs=None
instead to better have a distinction what gets passed to func
and what doesn't get passed on. But that's for a whole another discussion, and may be too late now :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found out that scripts/validate_docstrings.py
doesnt accept putting those on the same line, so I've reverted to the previous style.
pandas/core/generic.py
Outdated
|
||
Returns | ||
------- | ||
transformed : %(klass)s | ||
pandas.%(klass)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before.
pandas/core/generic.py
Outdated
Examples | ||
Raises | ||
------ | ||
ValueError : if the returned %(klass)s has a different length than self. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capital I
in if
.
pandas/core/generic.py
Outdated
|
||
See also | ||
pandas.%(klass)s.agg : only perform aggregating type operations | ||
pandas.%(klass)s.apply : Invoke function on a Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for pandas.
. Capital O
in only
. Descriptions finishing with period.
pandas/core/generic.py
Outdated
pandas.%(klass)s.aggregate | ||
pandas.%(klass)s.apply | ||
>>> df = pd.DataFrame({'A': range(3), 'B': range(1, 4)}) | ||
>>> df.transform(lambda x: x + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would display df
first before the transform. So, it's immediate to compare the original data and the transformed data, even without reading the constructor of the DataFrame.
pandas/core/generic.py
Outdated
%(klass)s, it is possible to provide several input functions: | ||
|
||
>>> s = pd.Series(range(3)) | ||
>>> s.transform([np.sqrt, np.exp]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also show s
before the transform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, I think the new docstrings are very clear and informative. There are couple of minor things regarding conventions we follow, but otherwise lgtm.
pandas/core/generic.py
Outdated
|
||
.. versionadded:: 0.20.0 | ||
|
||
Parameters | ||
---------- | ||
func : callable, string, dictionary, or list of string/callables | ||
To apply to column | ||
func : function, str, list of functions and/or strings or dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you leave it as the previous?
pandas/core/generic.py
Outdated
|
||
Returns | ||
------- | ||
aggregated : %(klass)s | ||
%(klass)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind adding a description to what is returned, that would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type is actually a bit more complex than expected:
>>> df = pd.DataFrame({'A': range(5), 'B': 5})
>>> df.agg(['sum', 'mean']) -> a DataFrame
>>>df.agg('sum') -> a Series
>>> df.A.agg(['sum', 'mean']) -> a Series
>>>df.A.agg('sum') -> a scalar
I suggest we should do a %(return_value)s
in the Returns section and add as appropriate in Series.agg and DataFrame.agg appenders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd simply have DataFrame, Series or scalar
as the type, and then in the return description explain a bit better what is being returned in each case. May be there are better options, but this is what I've been doing, and I personally prefer to not overcomplicate the docstrings with variables, unless it really adds significant value. But that's my opinion, it's you call.
pandas/core/generic.py
Outdated
|
||
See also | ||
%(klass)s.agg : Only perform aggregating type operations | ||
%(klass)s.apply : Invoke function on a Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you finish these descriptions with a period? I think we need to add this validation to the script, but it's what we've been doing.
f33e2f9
to
81ca449
Compare
81ca449
to
fbe270c
Compare
I think all is done now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Thanks for all the work on this @topper-123
thanks @topper-123 |
Since #21224, operations using
axis=1
in df.aggregate and df.transform now work the same as when axis=0.This PR updates the methods' doc strings to reflect the new reality. For example, we can now pass a dict to DataFrame.agg/transform when
axis=1
also, andDataFrame.transform
now has anaxis
parameter.There's a minor API change, as
Series.transform
should have aaxis=0
parameter to have the same API asSeries.aggregate
.Also some related minor clarifications.