DOC: improve doc string for .aggregate and .transform #22641

topper-123 · 2018-09-08T21:31:34Z

Since #21224, operations using axis=1 in df.aggregate and df.transform now work the same as when axis=0.

This PR updates the methods' doc strings to reflect the new reality. For example, we can now pass a dict to DataFrame.agg/transform when axis=1 also, and DataFrame.transform now has an axis parameter.

There's a minor API change, as Series.transform should have a axis=0 parameter to have the same API as Series.aggregate.

Also some related minor clarifications.

pep8speaks · 2018-09-08T21:31:39Z

Hello @topper-123! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/frame.py !
There are no PEP8 issues in the file pandas/core/generic.py !
There are no PEP8 issues in the file pandas/core/series.py !

codecov · 2018-09-08T23:00:10Z

Codecov Report

Merging #22641 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22641      +/-   ##
==========================================
+ Coverage   92.16%   92.16%   +<.01%     
==========================================
  Files         169      169              
  Lines       50708    50716       +8     
==========================================
+ Hits        46734    46742       +8     
  Misses       3974     3974

Flag	Coverage Δ
#multiple	`90.57% <100%> (ø)`	⬆️
#single	`42.35% <77.77%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.19% <100%> (ø)`	⬆️
pandas/core/series.py	`93.76% <100%> (+0.03%)`	⬆️
pandas/core/generic.py	`96.67% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 996f361...b9d0dd3. Read the comment docs.

WillAyd · 2018-09-11T01:28:56Z

pandas/core/series.py

+
+    @Appender(_transform_doc)
+    @Appender(generic._shared_docs['transform'] % _shared_doc_kwargs)
+    def transform(self, func, axis=0, *args, **kwargs):


Hmm generally not sure its worth changing actual implementation for docstrings. If this is solely to isolate the various Examples I'd think it preferable to just have one shared Example docstring that covers Series and DataFrame rather than making code changes like this

The doc string is currently inherited from NDFrame, but it's not pretty IMO, see here

The issues are:

The current doc string discussed NDFrames, not Series, which is confusing,

the links in the SeeAlso don't work,

Series.transform's signature is in master different from the signature for Series.agg and Series.apply in that it misses the axis parameter that the other two have. This may/may not be a problem (I actually couldn't produce a bug caused by this), but the signature should be consistent amongst the three methods, and I agree with the current design for agg and apply (i.e. to have an axis parameter).

Thoughts? I can remove this if there's not consensus.

I'd say just use substitution for the class name (i.e. Series or DataFrame) and update the See Also links to point to both the Series and DataFrame methods. One of those will obviously be self referencing, but we've done this in other places as well

Ok, I see what you mean. I've made a new commit with common examples but the SeeAlso can actually be made correct if Series gets it's own transform method (which is needed for the signature issue to be resolved).

WillAyd · 2018-09-11T01:29:48Z

pandas/core/generic.py

+    func : function, string, list of string/functions or dictionary
+        Function to use for transforming the data. If a function, must either
+        work when passed a %(klass)s or when passed to %(klass)s.apply.
+        The function (or each function in a list/dict) must return an


IMO would be better served in a Notes section along with description of what happens if something is returned that doesn't meet this requirement

Hmm, it is already described in the Raises section, so I could just drop this part here...

WillAyd · 2018-09-13T23:42:57Z

pandas/core/series.py

+    @Appender(generic._shared_docs['transform'] % _shared_doc_kwargs)
+    def transform(self, func, axis=0, *args, **kwargs):
+        # Validate the axis parameter
+        self._get_axis_number(axis)


What's the point of this statement?

This checks that the value passed to axis is 0 or “index”, else an exception is raised. So, a minor check for consistency.

topper-123 · 2018-09-15T19:00:36Z

all tests have passed. Is this ok?

WillAyd

@datapythonista can you take a look?

datapythonista

The changes look great. I added some comments, mainly related to the original docstring, that I think would make them follow our standards and be more clear.

Besides the comments, I think the previous examples were failing the doctests, and yours pass, so we should be able to stop skipping these docstrings in ci/doctest.sh.

Also, did you run ./scripts/validate_docstrings.py pandas.Series.transform...?

datapythonista · 2018-09-16T10:16:08Z

pandas/core/generic.py

@@ -4545,17 +4545,16 @@ def pipe(self, func, *args, **kwargs):

    Parameters
    ----------
-    func : function, string, dictionary, or list of string/functions
+    func : function, string, list of string/functions or dictionary


We try to use only Python types in this row, and be consistent with the format. func : function, str, list or dict would be the preferred format (or specifying the list types, like list of str, list of function).

Alright, but the list can contain bot functions and strings ([np.exp, 'sqrt']). Maybe func : function, string, list of functions and/or strings or dict?

At some point I'd like to validate that all the types provided in that line are from a subset, to avoid typos and inconsistencies. I understand your point about mixing both, but I'd prefer function, str, list or dict (note str, the Python type, instead of string), and then provide the details about mixing strings and everything else in the description.

datapythonista · 2018-09-16T10:17:59Z

pandas/core/generic.py

@@ -4581,38 +4580,61 @@ def pipe(self, func, *args, **kwargs):

    Parameters
    ----------
-    func : callable, string, dictionary, or list of string/callables
-        To apply to column
+    func : function, string, list of string/functions or dictionary


same as before

datapythonista · 2018-09-16T10:19:25Z

pandas/core/generic.py

+    *args
+        Positional arguments to pass to `func`.
+    **kwargs
+        Keyword arguments to pass to `func`.

    Returns
    -------
    transformed : %(klass)s


The transformed name doesn't add a lot of value. Can you just leave the type here, and add a description in the next line. Same for aggregate.

datapythonista · 2018-09-16T10:22:08Z

pandas/core/generic.py


    Returns
    -------
    transformed : %(klass)s

+    Raises
+    ------
+    ValueError: if the returned %(klass)s has a different length than self.


Not sure if sphinx requires a space before the colon. Did you render the docstring? Also, if you can capitalize the sentence If the returned....

Then, I don't find the description very clear on what the user did wrong, and how they should fix the problem. Do you think you can add a bit more information?

datapythonista · 2018-09-16T10:24:02Z

pandas/core/generic.py

+    1  1.000000   2.718282
+    2  1.414214   7.389056
+    3  1.732051  20.085537
+    4  2.000000  54.598150

    See also


See Also should be placed before the examples, and should have a capital A in Also.

datapythonista · 2018-09-16T10:30:19Z

pandas/core/generic.py

-    2000-01-10 -1.366388 -0.614710  0.005378
+                       A         B
+    2000-01-01 -1.143001  1.143001
+    2000-01-02 -0.889001  0.889001


I think this example could be simplified, so users don't need to spend a lot of time understanding what's going on. I think 3 or 4 rows should be enough to illustrate transform. Also, standardizing the values may be a more real-world example, but I don't think anybody is able to do the mental math, to compare what they think the function is doing, with what we show here. Also, I would use the default index, as using dates eems to have a meaning, and it's misleading.

So, I'd do:

A much shorter DataFrame (e.g. 3 rows)

A much simpler function (e.g. lambda x: x + 1)

Use the default index

Yeah, that's a good point, changed. For the Series, I've also shortened it, but kept s.transform([np.sqrt, np.exp]), as I think that also is quite simple.

datapythonista · 2018-09-16T10:31:46Z

pandas/core/generic.py

+
+    It is only required for the axis specified in the ``axis`` parameter
+    to have the same length for output and for self. The other axis may have a
+    different length:


I don't quite understand what you mean here. We're not specifying the axis parameter in the example.

Code-wise transform is the same as aggregate, except there's a check that the result has the same length as self. I try to bring out this requirement for transform.

I've tried to word the doc string differently to highlight this better.

topper-123 · 2018-09-16T23:45:11Z

@datapythonista , I'm not familiar with ci/doctest.sh. Are you saying I should remove the two -transform parameters from that file?

Wrt. other stuff, I've made various changes according to the comments (great comments BTW).

EDIT: BTW, I ran ./scripts/validate_docstrings.py pandas.Series.transform and it caught a few issues, that I've corrected in the latest commit. so if ci/doctest.sh catches issues like that, I assume I should delete the -transform- parameters, right? (haven't done it atm).

datapythonista · 2018-09-17T09:21:03Z

@topper-123 at the moment we don't do the validation of docstrings in the CI (there are too many failing, and the script also reports false positives). the ci/doctest.sh just runs the doctests in the examples from the CI (so far for Series and DataFrame method only I think). As there are still many failing, they are skipped in doctest.sh. So, yes, my point was that as you know fixed the doctests on those docstrings, we don't need to skip them anymore from the doctests, and any further change in the examples in the future, the CI will validate that we don't break them.

datapythonista

Looks great, just added few comments about the formatting.

datapythonista · 2018-09-17T09:22:51Z

pandas/core/generic.py

@@ -4564,7 +4563,7 @@ def pipe(self, func, *args, **kwargs):

    Returns
    -------
-    aggregated : %(klass)s
+    pandas.%(klass)s


I don't think we prefix Series and DataFrame with pandas. I'd just leave with the class for consistency.

datapythonista · 2018-09-17T09:23:24Z

pandas/core/generic.py

-        - dict of column names -> functions (or list of functions).
+        - string function name
+        - function
+        - list of functions and/or function names


may be you can add the example you wrote in the comments? I think that would make much clearer that we can mix both

may be you can add the example you wrote in the comment?

Not sure I understand whst you mean here, could you expand?

Sorry, I meant that we could have an example like [np.exp, 'sqrt'] that you mentioned before, so it's easier to see that you can use both strings and functions together in the same list.

datapythonista · 2018-09-17T09:23:55Z

pandas/core/generic.py


    .. versionadded:: 0.20.0

    Parameters
    ----------
-    func : callable, string, dictionary, or list of string/callables
-        To apply to column
+    func : function, string, list of functions and/or strings or dict


str instead of string

datapythonista · 2018-09-17T09:24:51Z

pandas/core/generic.py

+    *args
+        Positional arguments to pass to `func`.
+    **kwargs
+        Keyword arguments to pass to `func`.


If you prefer, numpydoc also accepts having both in one line:

*args, **kwargs Arguments to pass to `func`.

Alright, I did that, though my own preference would be to not use star arguments and use args=None, kwargs=None instead to better have a distinction what gets passed to func and what doesn't get passed on. But that's for a whole another discussion, and may be too late now :-)

Just found out that scripts/validate_docstrings.py doesnt accept putting those on the same line, so I've reverted to the previous style.

datapythonista · 2018-09-17T09:25:11Z

pandas/core/generic.py


    Returns
    -------
-    transformed : %(klass)s
+    pandas.%(klass)s


Same as before.

datapythonista · 2018-09-17T09:25:34Z

pandas/core/generic.py

-    Examples
+    Raises
+    ------
+    ValueError : if the returned %(klass)s has a different length than self.


Capital I in if.

datapythonista · 2018-09-17T09:26:05Z

pandas/core/generic.py

-
-    See also
+    pandas.%(klass)s.agg : only perform aggregating type operations
+    pandas.%(klass)s.apply : Invoke function on a Series


No need for pandas.. Capital O in only. Descriptions finishing with period.

datapythonista · 2018-09-17T09:27:23Z

pandas/core/generic.py

-    pandas.%(klass)s.aggregate
-    pandas.%(klass)s.apply
+    >>> df = pd.DataFrame({'A': range(3), 'B': range(1, 4)})
+    >>> df.transform(lambda x: x + 1)


I would display df first before the transform. So, it's immediate to compare the original data and the transformed data, even without reading the constructor of the DataFrame.

datapythonista · 2018-09-17T09:27:55Z

pandas/core/generic.py

+    %(klass)s, it is possible to provide several input functions:
+
+    >>> s = pd.Series(range(3))
+    >>> s.transform([np.sqrt, np.exp])


I'd also show s before the transform.

datapythonista

Looks great, I think the new docstrings are very clear and informative. There are couple of minor things regarding conventions we follow, but otherwise lgtm.

datapythonista · 2018-09-17T13:24:57Z

pandas/core/generic.py


    .. versionadded:: 0.20.0

    Parameters
    ----------
-    func : callable, string, dictionary, or list of string/callables
-        To apply to column
+    func : function, str, list of functions and/or strings or dict


can you leave it as the previous?

datapythonista · 2018-09-17T13:25:18Z

pandas/core/generic.py


    Returns
    -------
-    aggregated : %(klass)s
+    %(klass)s


If you don't mind adding a description to what is returned, that would be great.

The return type is actually a bit more complex than expected:

>>> df = pd.DataFrame({'A': range(5), 'B': 5}) >>> df.agg(['sum', 'mean']) -> a DataFrame >>>df.agg('sum') -> a Series >>> df.A.agg(['sum', 'mean']) -> a Series >>>df.A.agg('sum') -> a scalar

I suggest we should do a %(return_value)s in the Returns section and add as appropriate in Series.agg and DataFrame.agg appenders.

I'd simply have DataFrame, Series or scalar as the type, and then in the return description explain a bit better what is being returned in each case. May be there are better options, but this is what I've been doing, and I personally prefer to not overcomplicate the docstrings with variables, unless it really adds significant value. But that's my opinion, it's you call.

datapythonista · 2018-09-17T13:26:09Z

pandas/core/generic.py

-
-    See also
+    %(klass)s.agg : Only perform aggregating type operations
+    %(klass)s.apply : Invoke function on a Series


Can you finish these descriptions with a period? I think we need to add this validation to the script, but it's what we've been doing.

pandas/core/frame.py

topper-123 · 2018-09-18T09:24:26Z

I think all is done now.

datapythonista

lgtm. Thanks for all the work on this @topper-123

jreback · 2018-09-18T12:18:11Z

thanks @topper-123

topper-123 force-pushed the transform_docs branch 2 times, most recently from dc30d7a to b9d0dd3 Compare September 8, 2018 21:47

WillAyd requested changes Sep 11, 2018

View reviewed changes

WillAyd added the Docs label Sep 11, 2018

topper-123 force-pushed the transform_docs branch 3 times, most recently from abd633a to 27bbb72 Compare September 13, 2018 22:48

WillAyd reviewed Sep 13, 2018

View reviewed changes

topper-123 force-pushed the transform_docs branch from 27bbb72 to 59ca537 Compare September 14, 2018 19:20

topper-123 added 2 commits September 14, 2018 22:23

improve doc string for df.aggregate and df.transform

23609c3

adjusted for comments

650f639

topper-123 force-pushed the transform_docs branch from 59ca537 to 650f639 Compare September 14, 2018 21:24

WillAyd approved these changes Sep 15, 2018

View reviewed changes

datapythonista requested changes Sep 16, 2018

View reviewed changes

adjust for comments

4bd8490

datapythonista reviewed Sep 17, 2018

View reviewed changes

topper-123 force-pushed the transform_docs branch 2 times, most recently from f33e2f9 to 81ca449 Compare September 17, 2018 18:35

topper-123 commented Sep 17, 2018

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

adjust for more comments

fbe270c

topper-123 force-pushed the transform_docs branch from 81ca449 to fbe270c Compare September 17, 2018 18:40

datapythonista approved these changes Sep 18, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Sep 18, 2018

jreback merged commit 654ff52 into pandas-dev:master Sep 18, 2018

aeltanawy pushed a commit to aeltanawy/pandas that referenced this pull request Sep 20, 2018

DOC: improve doc string for .aggregate and .transform (pandas-dev#22641)

bbf119d

topper-123 deleted the transform_docs branch September 20, 2018 21:12

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

DOC: improve doc string for .aggregate and .transform (pandas-dev#22641)

8f144bb

DOC: improve doc string for .aggregate and .transform #22641

DOC: improve doc string for .aggregate and .transform #22641

Conversation

topper-123 commented Sep 8, 2018 • edited Loading

pep8speaks commented Sep 8, 2018

codecov bot commented Sep 8, 2018

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Sep 15, 2018

WillAyd left a comment

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Sep 16, 2018 • edited Loading

datapythonista commented Sep 17, 2018

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Sep 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Sep 18, 2018

datapythonista left a comment

Choose a reason for hiding this comment

jreback commented Sep 18, 2018

topper-123 commented Sep 8, 2018 •

edited

Loading

topper-123 commented Sep 16, 2018 •

edited

Loading

topper-123 Sep 17, 2018 •

edited

Loading