DOC: Updated the DataFrame.assign docstring #21917

aeltanawy · 2018-07-14T23:11:49Z

Updated the DataFrame.assign docstring example to use np.arange instead of np.random.randn to pass the validation test.

jschendel

Thanks, lgtm aside from one small comment

jschendel · 2018-07-15T02:31:01Z

pandas/core/frame.py

@@ -3353,38 +3353,39 @@ def assign(self, **kwargs):

        Examples
        --------
-        >>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
+        >>> df = pd.DataFrame({'A': range(1, 11),
+        ...                    'B':np.arange(-1.0, 2.0, 0.3)})


needs a space between 'B': and np.arange

Thanks @jschendel for spotting this! I'll send a fix right away.

codecov · 2018-07-15T03:36:14Z

Codecov Report

Merging #21917 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #21917      +/-   ##
==========================================
+ Coverage   92.18%   92.18%   +<.01%     
==========================================
  Files         169      169              
  Lines       50804    50810       +6     
==========================================
+ Hits        46833    46839       +6     
  Misses       3971     3971

Flag	Coverage Δ
#multiple	`90.6% <ø> (ø)`	⬆️
#single	`42.37% <ø> (+0.04%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.2% <ø> (ø)`	⬆️
pandas/core/generic.py	`96.67% <0%> (ø)`	⬆️
pandas/io/formats/excel.py	`97.4% <0%> (ø)`	⬆️
pandas/core/arrays/datetimelike.py	`95.53% <0%> (+0.02%)`	⬆️
pandas/io/parquet.py	`73.72% <0%> (+0.68%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0480f4c...ecfaf47. Read the comment docs.

datapythonista · 2018-07-15T08:18:19Z

Thanks for the fix, that's much better than using random.

What would you think about simplifying the example even further? I don't think we need 10 rows to show what assign does. And using a "simpler" function than log (e.g. lambda x: x ** 2) would probably help users understand the example faster.

jorisvandenbossche · 2018-07-15T15:23:00Z

I agree with the number of rows, but np.log is also nice in that it does not use "lambda" IMO

aeltanawy · 2018-07-15T15:58:41Z

Thanks @datapythonista and @jorisvandenbossche! I'll shorten the table(s) to hold only 2 rows.

jreback · 2018-07-16T11:12:01Z

lgtm. @aeltanawy normally don't put 2 commits like this together (e.g. the statsmodels change is already in master)

datapythonista · 2018-07-16T18:15:08Z

pandas/core/frame.py

@@ -3353,38 +3353,23 @@ def assign(self, **kwargs):

        Examples
        --------
-        >>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
+        >>> df = pd.DataFrame({'A': range(1, 3),
+        ...                    'B': np.arange(-1.0, 2.0, 1.5)})


Now that there are only few items I think using something like [1, 2, 3] would be clearer than range(1, 3). And same for the other parameter.

datapythonista · 2018-07-16T18:38:30Z

pandas/core/frame.py

-        9  10 -0.758542  2.302585
+            A    B      ln_A
+         0  1 -1.0  0.000000
+         1  2  0.5  0.693147


I agree with @jorisvandenbossche that not using lambda would be better. But I don't think that's possible, given that the column from the dataframe need to be specified. So, unless there is an easy way to avoid lambda, I'd prefer a function where the user can quickly compute the result mentally, and understand faster what happened in the operation. And if it's a real example, even better, I think that makes things even simpler for the reader.

Not sure if this is too complex, but something like that would be better in my opinion:

import pandas as pd df = pd.DataFrame([('liquid', 100.), ('liquid', 356.73), ('gas', -252.87)], index=['water', 'mercury', 'hydrogen'], columns=('state', 'boiling_point_c')) df.assign(state=lambda x: pd.Categorical(x.state), boiling_point_f=lambda x: x.boiling_point_c * 9 / 5 + 32)

jorisvandenbossche · 2018-07-16T20:14:34Z

I agree with @jorisvandenbossche that not using lambda would be better.

Ah, yes, missed that the np.log example was also using lambda .. Indeed not easy to avoid in this case.

aeltanawy · 2018-07-17T07:00:09Z

@aeltanawy normally don't put 2 commits like this together (e.g. the statsmodels change is already in master)

Ah! Sorry for that, I messed up my branch when fetching upstream then rebasing. Hopefully all sorted out now.

@datapythonista inspired by your example, how about using simply temperatures in Celsius and Fahrenheit indexed with cities?

datapythonista · 2018-07-18T07:37:43Z

Sure, I just gave an example to illustrate my point, and not leave an abstract comment. But feel free to use the example that you find more useful/clear for users.

aeltanawy · 2018-07-20T05:52:02Z

I apologize for adding already committed commits, again! Anyone willing to show me what I'm doing wrong? (please let me know if this is not the place to ask)

My workflow is as follows:

edit the script in my local branch (doc)
git add, the git commit
then git pull origin doc
then git push origin doc

jorisvandenbossche · 2018-07-21T17:57:10Z

@aeltanawy no problem!

Looking at your workflow, the one thing that is normally not needed is third step (git pull origin doc).

Now, I am not really sure how doing that ended up in having those extra commits, but let's try to solve it.

I would first try:

git pull upstream master
git push origin doc

and check if that fixes the commits here in the PR (the above updates the PR with changes that have happened in the meantime in the upstream (the pandas-dev/pandas repo) master branch).

If that didn't fix it, another approach is doing a rebase:

git fetch upstream
git rebase upstream/master
git push origin doc --force

(note here you will need to force push because rebasing means rewriting the history of the branch, not simply adding commits to the branch as merging does).

If you have more questions, feel free to ask!

datapythonista · 2018-07-22T21:44:27Z

@aeltanawy I fixed the git problems for you. If you simply do a git pull in your branch, you should be able to continue working with the docstring.

aeltanawy · 2018-07-22T22:27:11Z

Thanks a lot @jorisvandenbossche and @datapythonista !! Now I'm able to push changes without git rejecting it and, hopefully for good, just submitting my commits only.

I have simplified the examples even further by including one column in the initial DataFrame.

jorisvandenbossche

Looks very nice now!

datapythonista

Looks quite good, just couple of ideas to make the examples simpler.

datapythonista · 2018-07-23T12:12:25Z

pandas/core/frame.py

@@ -3354,38 +3354,23 @@ def assign(self, **kwargs):

        Examples
        --------
-        >>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
+        >>> df = pd.DataFrame({'temp_c': (17.0, 25.0)},
+                                index=['portland', 'berkeley'])


Do you mind displaying df content after creating it? I think it'll help users understand faster that a new column is created in the assign later on.

Also, do you think it could look better capitalizing the names of the cities?

I second your thoughts! Will send an update soon.

datapythonista · 2018-07-23T12:15:53Z

pandas/core/frame.py

+        >>> df.assign(temp_f=newcol)
+                  temp_c  temp_f
+        portland    17.0    62.6
+        berkeley    25.0    77.0


May be we could manually create a Series with Kelvin degrees and assign it. I personally find it a bit confusing that both examples do the same thing, but in a different way. Or we should say it if you think this is better.

I think a single example:

>>> df.assign(temp_f=lambda... >>> temp_k=kelvin_series)

could illustrate both cases in a simpler way.

Maybe expanding the exposition could clear this up a bit? E.g, "Where the value already exists and is inserted:" -> "Alternatively, the same behavior can be achieved by directly referencing an existing Series or list-like:".

Would be fine with something along the lines of your suggestion too though.

I'm leaning more towards keeping this example but to change its exposition to: "Alternatively, the same behavior can be achieved by directly referencing an existing Series or list-like".

datapythonista · 2018-07-23T12:18:10Z

pandas/core/frame.py

+        >>> df.assign(temp_f=newcol)
+                  temp_c  temp_f
+        portland    17.0    62.6
+        berkeley    25.0    77.0

        Where the keyword arguments depend on each other


I think this last example is not useful anymore, or do you think it is?

This example is still necessary in some form, as it shows a non-obvious Python 3.6+ feature that was recently added (#18852). It could definitely be improved though.

Reversing the setup and starting from a DataFrame with Fahrenheit data:

In [2]: df = pd.DataFrame({'temp_f': [-40, 0, 32, 100, 212]}) In [3]: df Out[3]: temp_f 0 -40 1 0 2 32 3 100 4 212

Then what this is section is trying to illustrate is that, in Python 3.6+, you can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign.

For example, if you wanted to add both Celsius and Kelvin columns, you could use a formula to build Celsius from Fahrenheit, then use the Celsius column to create the Kelvin one (instead of using a more complex formula referencing Fahrenheit), all from within the same assign:

In [4]: df.assign(temp_c=lambda df: 5 / 9 * (df['temp_f'] - 32), temp_k=lambda df: df['temp_c'] + 273.15) Out[4]: temp_f temp_c temp_k 0 -40 -40.000000 233.150000 1 0 -17.777778 255.372222 2 32 0.000000 273.150000 3 100 37.777778 310.927778 4 212 100.000000 373.150000

There have also been some more generic questions about how to add multiple columns using assign (e.g. StackOverflow). It might be beneficial to first give an example similar to the one above, but where you create Kelvin from Fahrenheit, to simply show how to add multiple columns within the same assign. Then the 3.6+ behavior could be discussed using something similar to my example above.

Ah! I understand now why this example is needed. Let me add a different one that better portrays its usage.

jreback · 2018-08-09T11:01:54Z

@datapythonista merge when satisfied

…thon3.6+ feature.

aeltanawy · 2018-09-04T04:07:46Z

Here are the recent changes to the DataFrame.assign docstring:

Showed the output of the initial dataframe.
Re-wrote the third example to illustrate python3.6+ feature of being able to assign multiple columns that depends on each other. The example is derived from @datapythonista and @jschendel comments.

datapythonista

Thanks for the changes, the example looks great, except for the formatting issues.

Can you run ./scripts/validate_docstrings.py pandas.DataFrame.assign after you perform the requested changes? that should tell you if all the formatting problems I commented are fixed.

datapythonista · 2018-09-04T09:00:11Z

pandas/core/frame.py

@@ -3250,48 +3250,34 @@ def assign(self, **kwargs):

        Examples
        --------
-        >>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
+        >>> df = pd.DataFrame({'temp_c': (17.0, 25.0)},


Can you use a list instead of a tuple for the data? I think it's more conventional.

datapythonista · 2018-09-04T09:00:44Z

pandas/core/frame.py

@@ -3250,48 +3250,34 @@ def assign(self, **kwargs):

        Examples
        --------
-        >>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
+        >>> df = pd.DataFrame({'temp_c': (17.0, 25.0)},
+                                index=['Portland', 'Berkeley'])


This is missing ... at the same level of the >>>, and the indentation is not correct

datapythonista · 2018-09-04T09:02:16Z

pandas/core/frame.py

+        Alternatively, the same behavior can be achieved by directly
+        referencing an existing Series or list-like:
+        >>> newcol = df['temp_c'] * 9 / 5 + 32
+        >>> df.assign(temp_f=newcol)


I'd use the expression directly in the assignment instead (i.e. >>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)), but it's your call

From my understanding, this example is to show that you can refer to an existing list or series and assign it to the df. The example above it is already using direct assignment. Do you think it is not necessary to show this?

In the previous example, the new column is assigned to a callable (which is run with the DataFrame as a parameter). In this example the new column is assigned to a Series. What I'm saying is that instead of saving the Series to newcol, and then assign the new column to the variable newcol, we can simply create the Series as a parameter.

datapythonista · 2018-09-04T09:02:57Z

pandas/core/frame.py

+        where one of the columns depends on another one defined within the same
+        assign:
+        >>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,
+                        temp_k=lambda x: (x['temp_f'] +  459.67) * 5 / 9)


same as before regarding ... and indentation

datapythonista

Besides the previous comments, can you remove the df in the Returns section (just leave DataFrame in the first line).

And also, add the ** to the kwargs parameter, and change the type, so the first line is **kwargs : dict of {str: callable or Series}

Thanks!

datapythonista · 2018-09-04T22:19:55Z

pandas/core/frame.py

+        Alternatively, the same behavior can be achieved by directly
+        referencing an existing Series or list-like:
+        >>> newcol = df['temp_c'] * 9 / 5 + 32
+        >>> df.assign(temp_f=newcol)


In the previous example, the new column is assigned to a callable (which is run with the DataFrame as a parameter). In this example the new column is assigned to a Series. What I'm saying is that instead of saving the Series to newcol, and then assign the new column to the variable newcol, we can simply create the Series as a parameter.

…2054) closes pandas-dev#21980

…-dev#22293) * Fix bug #GH22092 * Update v0.24.0.txt * Update v0.24.0.txt * Update ops.py * Update test_operators.py * Update v0.24.0.txt * Update test_operators.py

…#22731)

…das-dev#22738)

…#22653)

…ng. (pandas-dev#21531)

…ndas-dev#22673)

…#22602)

aeltanawy · 2018-09-20T06:08:12Z

Do I need to rebase my changes against master?

datapythonista · 2018-09-20T09:49:32Z

It's like the other way round I'd say. You get the latest changes from pandas git fetch master, you merge those changes into your branch git merge upstream/master, so your branch has your changes on top of the latest pandas, not the pandas of the time you started the changes, and then you update the PR with them git push. Besides having your changes on the latest development version of pandas, any update to the PR will make the continuous integration to run again. So the checks that failed (unrelated to your changes), will luckily pass this time.

aeltanawy · 2018-09-20T17:56:30Z

Pretty sure my branch tree is messed up! For now, I have merged upstream master to my branch and pushed the changes. Waiting for your directions @datapythonista after the checks are finished.

TomAugspurger · 2018-09-20T19:55:14Z

Git changes look OK.

@aeltanawy could you remove -assign from

pandas/ci/doctests.sh

Line 24 in 0480f4c

    
           -k"-assign -axes -combine -isin -itertuples -join -nlargest -nsmallest -nunique -pivot_table -quantile -query -reindex -reindex_axis -replace -round -set_index -stack -to_dict -to_stata"

? Then the docstring will be tested on each PR.

aeltanawy · 2018-09-21T15:53:43Z

Looks like all checks passed.

@TomAugspurger, I'll create another pull request for removing -assign so as not to confuse this one which is entirely a DOC case. How does this sound?

datapythonista · 2018-09-21T15:58:39Z

@aeltanawy it's better if you do it in this one, as removing the -assign will make the CI validate that your examples pass the doctests already.

aeltanawy · 2018-09-21T16:09:07Z

ah!
removed -assign from pandas/ci/doctests.sh

datapythonista · 2018-09-21T16:16:30Z

Thanks @aeltanawy, merging on green.

jschendel · 2018-09-22T23:36:30Z

Thanks @aeltanawy!

aeltanawy · 2018-09-23T00:20:05Z

Time to celebrate my first contribution! Thanks to everyone :).

jorisvandenbossche · 2018-09-27T12:53:25Z

@aeltanawy Thanks a lot!

jschendel reviewed Jul 15, 2018

View reviewed changes

jschendel added the Docs label Jul 15, 2018

jschendel added this to the 0.24.0 milestone Jul 15, 2018

datapythonista reviewed Jul 16, 2018

View reviewed changes

aeltanawy force-pushed the doc branch from cb2dba9 to 4452014 Compare July 22, 2018 21:41

datapythonista force-pushed the doc branch from 4452014 to 3d44884 Compare July 22, 2018 21:43

jorisvandenbossche approved these changes Jul 23, 2018

View reviewed changes

datapythonista reviewed Jul 23, 2018

View reviewed changes

datapythonista and others added 3 commits September 3, 2018 20:54

Working on the assign docstring

58942fc

DOC: cont'd simplified examples in DataFrame.assign docstring

de61b38

DOC: adjusted docstring examples in DataFrame.assign to illustrate py…

ef49f88

…thon3.6+ feature.

aeltanawy force-pushed the doc branch from c47fd3c to ef49f88 Compare September 4, 2018 03:57

datapythonista requested changes Sep 4, 2018

View reviewed changes

datapythonista reviewed Sep 4, 2018

View reviewed changes

DOC: Adjusted DataFrame.assign docstring

1fa9bc5

illegalnumbers and others added 15 commits September 19, 2018 23:01

fix raise of TypeError when subtracting timedelta array (pandas-dev#2…

4f000f5

…2054) closes pandas-dev#21980

Bug: Logical operator of Series with Index (pandas-dev#22092) (pandas…

79b8763

…-dev#22293) * Fix bug #GH22092 * Update v0.24.0.txt * Update v0.24.0.txt * Update ops.py * Update test_operators.py * Update v0.24.0.txt * Update test_operators.py

DOC: Fix Series nsmallest and nlargest docstring/doctests (pandas-dev…

1aaefe5

…#22731)

Fixturize tests/frame/test_api and tests/sparse/frame/test_frame (pan…

9fe0fbc

…das-dev#22738)

BUG SeriesGroupBy.mean() overflowed on some integer array (pandas-dev…

d64c0a8

…#22653)

TST: Fail on warning (pandas-dev#22699)

0ba7b16

BUG: Allow IOErrors when attempting to retrieve default client encodi…

73ff71e

…ng. (pandas-dev#21531)

API: Git version (pandas-dev#22745)

b7d9884

DOC: add more links to the API in advanced.rst (pandas-dev#22746)

22b2e4a

DOC: Fix DataFrame.to_xarray doctests and allow the CI to run it. (pa…

27ea656

…ndas-dev#22673)

Set up CI with Azure Pipelines (pandas-dev#22760)

4a2a24c

CI: Fix travis CI (pandas-dev#22765)

96b7d84

CI: Publish test summary (pandas-dev#22770)

113ff50

BUG: Check types in Index.__contains__ (pandas-dev#22085) (pandas-dev…

5474d32

…#22602)

Merge remote-tracking branch 'upstream/master' into doc

6c765d3

Merge remote-tracking branch 'upstream/master' into doc

61e4dee

Removing -assign from pandas/ci/doctests.sh

ecfaf47

jschendel merged commit fb784ca into pandas-dev:master Sep 22, 2018

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

DOC: Updated the DataFrame.assign docstring (pandas-dev#21917)

d3846e4

DOC: Updated the DataFrame.assign docstring #21917

DOC: Updated the DataFrame.assign docstring #21917

Conversation

aeltanawy commented Jul 14, 2018

jschendel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 15, 2018 • edited Loading

Codecov Report

datapythonista commented Jul 15, 2018

jorisvandenbossche commented Jul 15, 2018

aeltanawy commented Jul 15, 2018

jreback commented Jul 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 16, 2018

aeltanawy commented Jul 17, 2018

datapythonista commented Jul 18, 2018

aeltanawy commented Jul 20, 2018 • edited Loading

jorisvandenbossche commented Jul 21, 2018

datapythonista commented Jul 22, 2018 • edited Loading

aeltanawy commented Jul 22, 2018 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jschendel Jul 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 9, 2018

aeltanawy commented Sep 4, 2018

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datapythonista left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aeltanawy commented Sep 20, 2018

datapythonista commented Sep 20, 2018

aeltanawy commented Sep 20, 2018

TomAugspurger commented Sep 20, 2018

aeltanawy commented Sep 21, 2018

datapythonista commented Sep 21, 2018

aeltanawy commented Sep 21, 2018

datapythonista commented Sep 21, 2018

jschendel commented Sep 22, 2018

aeltanawy commented Sep 23, 2018

jorisvandenbossche commented Sep 27, 2018

codecov bot commented Jul 15, 2018 •

edited

Loading

aeltanawy commented Jul 20, 2018 •

edited

Loading

datapythonista commented Jul 22, 2018 •

edited

Loading

aeltanawy commented Jul 22, 2018 •

edited

Loading

jschendel Jul 24, 2018 •

edited

Loading