Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Updated the DataFrame.assign docstring #21917

Merged
merged 88 commits into from
Sep 22, 2018
Merged

Conversation

aeltanawy
Copy link
Contributor

Updated the DataFrame.assign docstring example to use np.arange instead of np.random.randn to pass the validation test.

Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm aside from one small comment

@@ -3353,38 +3353,39 @@ def assign(self, **kwargs):

Examples
--------
>>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
>>> df = pd.DataFrame({'A': range(1, 11),
... 'B':np.arange(-1.0, 2.0, 0.3)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a space between 'B': and np.arange

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jschendel for spotting this! I'll send a fix right away.

@jschendel jschendel added the Docs label Jul 15, 2018
@jschendel jschendel added this to the 0.24.0 milestone Jul 15, 2018
@codecov
Copy link

codecov bot commented Jul 15, 2018

Codecov Report

Merging #21917 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21917      +/-   ##
==========================================
+ Coverage   92.18%   92.18%   +<.01%     
==========================================
  Files         169      169              
  Lines       50804    50810       +6     
==========================================
+ Hits        46833    46839       +6     
  Misses       3971     3971
Flag Coverage Δ
#multiple 90.6% <ø> (ø) ⬆️
#single 42.37% <ø> (+0.04%) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 97.2% <ø> (ø) ⬆️
pandas/core/generic.py 96.67% <0%> (ø) ⬆️
pandas/io/formats/excel.py 97.4% <0%> (ø) ⬆️
pandas/core/arrays/datetimelike.py 95.53% <0%> (+0.02%) ⬆️
pandas/io/parquet.py 73.72% <0%> (+0.68%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0480f4c...ecfaf47. Read the comment docs.

@datapythonista
Copy link
Member

Thanks for the fix, that's much better than using random.

What would you think about simplifying the example even further? I don't think we need 10 rows to show what assign does. And using a "simpler" function than log (e.g. lambda x: x ** 2) would probably help users understand the example faster.

@jorisvandenbossche
Copy link
Member

I agree with the number of rows, but np.log is also nice in that it does not use "lambda" IMO

@aeltanawy
Copy link
Contributor Author

Thanks @datapythonista and @jorisvandenbossche! I'll shorten the table(s) to hold only 2 rows.

@jreback
Copy link
Contributor

jreback commented Jul 16, 2018

lgtm. @aeltanawy normally don't put 2 commits like this together (e.g. the statsmodels change is already in master)

@@ -3353,38 +3353,23 @@ def assign(self, **kwargs):

Examples
--------
>>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
>>> df = pd.DataFrame({'A': range(1, 3),
... 'B': np.arange(-1.0, 2.0, 1.5)})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there are only few items I think using something like [1, 2, 3] would be clearer than range(1, 3). And same for the other parameter.

9 10 -0.758542 2.302585
A B ln_A
0 1 -1.0 0.000000
1 2 0.5 0.693147
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jorisvandenbossche that not using lambda would be better. But I don't think that's possible, given that the column from the dataframe need to be specified. So, unless there is an easy way to avoid lambda, I'd prefer a function where the user can quickly compute the result mentally, and understand faster what happened in the operation. And if it's a real example, even better, I think that makes things even simpler for the reader.

Not sure if this is too complex, but something like that would be better in my opinion:

import pandas as pd

df = pd.DataFrame([('liquid', 100.),
                   ('liquid', 356.73),
                   ('gas', -252.87)],
                  index=['water', 'mercury', 'hydrogen'],
                  columns=('state', 'boiling_point_c'))

df.assign(state=lambda x: pd.Categorical(x.state),
          boiling_point_f=lambda x: x.boiling_point_c * 9 / 5 + 32)

@jorisvandenbossche
Copy link
Member

I agree with @jorisvandenbossche that not using lambda would be better.

Ah, yes, missed that the np.log example was also using lambda .. Indeed not easy to avoid in this case.

@aeltanawy
Copy link
Contributor Author

@aeltanawy normally don't put 2 commits like this together (e.g. the statsmodels change is already in master)

Ah! Sorry for that, I messed up my branch when fetching upstream then rebasing. Hopefully all sorted out now.

@datapythonista inspired by your example, how about using simply temperatures in Celsius and Fahrenheit indexed with cities?

@datapythonista
Copy link
Member

Sure, I just gave an example to illustrate my point, and not leave an abstract comment. But feel free to use the example that you find more useful/clear for users.

@aeltanawy
Copy link
Contributor Author

aeltanawy commented Jul 20, 2018

I apologize for adding already committed commits, again! Anyone willing to show me what I'm doing wrong? (please let me know if this is not the place to ask)

My workflow is as follows:

  1. edit the script in my local branch (doc)
  2. git add, the git commit
  3. then git pull origin doc
  4. then git push origin doc

@jorisvandenbossche
Copy link
Member

@aeltanawy no problem!

Looking at your workflow, the one thing that is normally not needed is third step (git pull origin doc).

Now, I am not really sure how doing that ended up in having those extra commits, but let's try to solve it.

I would first try:

git pull upstream master
git push origin doc

and check if that fixes the commits here in the PR (the above updates the PR with changes that have happened in the meantime in the upstream (the pandas-dev/pandas repo) master branch).

If that didn't fix it, another approach is doing a rebase:

git fetch upstream
git rebase upstream/master
git push origin doc --force

(note here you will need to force push because rebasing means rewriting the history of the branch, not simply adding commits to the branch as merging does).

If you have more questions, feel free to ask!

@datapythonista
Copy link
Member

datapythonista commented Jul 22, 2018

@aeltanawy I fixed the git problems for you. If you simply do a git pull in your branch, you should be able to continue working with the docstring.

@aeltanawy
Copy link
Contributor Author

aeltanawy commented Jul 22, 2018

Thanks a lot @jorisvandenbossche and @datapythonista !! Now I'm able to push changes without git rejecting it and, hopefully for good, just submitting my commits only.

I have simplified the examples even further by including one column in the initial DataFrame.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice now!

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good, just couple of ideas to make the examples simpler.

@@ -3354,38 +3354,23 @@ def assign(self, **kwargs):

Examples
--------
>>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
>>> df = pd.DataFrame({'temp_c': (17.0, 25.0)},
index=['portland', 'berkeley'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind displaying df content after creating it? I think it'll help users understand faster that a new column is created in the assign later on.

Also, do you think it could look better capitalizing the names of the cities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second your thoughts! Will send an update soon.

>>> df.assign(temp_f=newcol)
temp_c temp_f
portland 17.0 62.6
berkeley 25.0 77.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we could manually create a Series with Kelvin degrees and assign it. I personally find it a bit confusing that both examples do the same thing, but in a different way. Or we should say it if you think this is better.

I think a single example:

>>> df.assign(temp_f=lambda...
>>>                 temp_k=kelvin_series)

could illustrate both cases in a simpler way.

Copy link
Member

@jschendel jschendel Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe expanding the exposition could clear this up a bit? E.g, "Where the value already exists and is inserted:" -> "Alternatively, the same behavior can be achieved by directly referencing an existing Series or list-like:".

Would be fine with something along the lines of your suggestion too though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning more towards keeping this example but to change its exposition to: "Alternatively, the same behavior can be achieved by directly referencing an existing Series or list-like".

>>> df.assign(temp_f=newcol)
temp_c temp_f
portland 17.0 62.6
berkeley 25.0 77.0

Where the keyword arguments depend on each other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this last example is not useful anymore, or do you think it is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is still necessary in some form, as it shows a non-obvious Python 3.6+ feature that was recently added (#18852). It could definitely be improved though.

Reversing the setup and starting from a DataFrame with Fahrenheit data:

In [2]: df = pd.DataFrame({'temp_f': [-40, 0, 32, 100, 212]})

In [3]: df
Out[3]:
   temp_f
0     -40
1       0
2      32
3     100
4     212

Then what this is section is trying to illustrate is that, in Python 3.6+, you can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign.

For example, if you wanted to add both Celsius and Kelvin columns, you could use a formula to build Celsius from Fahrenheit, then use the Celsius column to create the Kelvin one (instead of using a more complex formula referencing Fahrenheit), all from within the same assign:

In [4]: df.assign(temp_c=lambda df: 5 / 9 * (df['temp_f'] - 32), 
                  temp_k=lambda df: df['temp_c'] + 273.15)
Out[4]:
   temp_f      temp_c      temp_k
0     -40  -40.000000  233.150000
1       0  -17.777778  255.372222
2      32    0.000000  273.150000
3     100   37.777778  310.927778
4     212  100.000000  373.150000

There have also been some more generic questions about how to add multiple columns using assign (e.g. StackOverflow). It might be beneficial to first give an example similar to the one above, but where you create Kelvin from Fahrenheit, to simply show how to add multiple columns within the same assign. Then the 3.6+ behavior could be discussed using something similar to my example above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I understand now why this example is needed. Let me add a different one that better portrays its usage.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2018

@datapythonista merge when satisfied

@aeltanawy
Copy link
Contributor Author

Here are the recent changes to the DataFrame.assign docstring:

  1. Showed the output of the initial dataframe.
  2. Re-wrote the third example to illustrate python3.6+ feature of being able to assign multiple columns that depends on each other. The example is derived from @datapythonista and @jschendel comments.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, the example looks great, except for the formatting issues.

Can you run ./scripts/validate_docstrings.py pandas.DataFrame.assign after you perform the requested changes? that should tell you if all the formatting problems I commented are fixed.

@@ -3250,48 +3250,34 @@ def assign(self, **kwargs):

Examples
--------
>>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
>>> df = pd.DataFrame({'temp_c': (17.0, 25.0)},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use a list instead of a tuple for the data? I think it's more conventional.

@@ -3250,48 +3250,34 @@ def assign(self, **kwargs):

Examples
--------
>>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
>>> df = pd.DataFrame({'temp_c': (17.0, 25.0)},
index=['Portland', 'Berkeley'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing ... at the same level of the >>>, and the indentation is not correct

Alternatively, the same behavior can be achieved by directly
referencing an existing Series or list-like:
>>> newcol = df['temp_c'] * 9 / 5 + 32
>>> df.assign(temp_f=newcol)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use the expression directly in the assignment instead (i.e. >>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)), but it's your call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, this example is to show that you can refer to an existing list or series and assign it to the df. The example above it is already using direct assignment. Do you think it is not necessary to show this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous example, the new column is assigned to a callable (which is run with the DataFrame as a parameter). In this example the new column is assigned to a Series. What I'm saying is that instead of saving the Series to newcol, and then assign the new column to the variable newcol, we can simply create the Series as a parameter.

where one of the columns depends on another one defined within the same
assign:
>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,
temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before regarding ... and indentation

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the previous comments, can you remove the df in the Returns section (just leave DataFrame in the first line).

And also, add the ** to the kwargs parameter, and change the type, so the first line is **kwargs : dict of {str: callable or Series}

Thanks!

Alternatively, the same behavior can be achieved by directly
referencing an existing Series or list-like:
>>> newcol = df['temp_c'] * 9 / 5 + 32
>>> df.assign(temp_f=newcol)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous example, the new column is assigned to a callable (which is run with the DataFrame as a parameter). In this example the new column is assigned to a Series. What I'm saying is that instead of saving the Series to newcol, and then assign the new column to the variable newcol, we can simply create the Series as a parameter.

@aeltanawy
Copy link
Contributor Author

Do I need to rebase my changes against master?

@datapythonista
Copy link
Member

It's like the other way round I'd say. You get the latest changes from pandas git fetch master, you merge those changes into your branch git merge upstream/master, so your branch has your changes on top of the latest pandas, not the pandas of the time you started the changes, and then you update the PR with them git push. Besides having your changes on the latest development version of pandas, any update to the PR will make the continuous integration to run again. So the checks that failed (unrelated to your changes), will luckily pass this time.

@aeltanawy
Copy link
Contributor Author

Pretty sure my branch tree is messed up! For now, I have merged upstream master to my branch and pushed the changes. Waiting for your directions @datapythonista after the checks are finished.

@TomAugspurger
Copy link
Contributor

Git changes look OK.

@aeltanawy could you remove -assign from

-k"-assign -axes -combine -isin -itertuples -join -nlargest -nsmallest -nunique -pivot_table -quantile -query -reindex -reindex_axis -replace -round -set_index -stack -to_dict -to_stata"
? Then the docstring will be tested on each PR.

@aeltanawy
Copy link
Contributor Author

Looks like all checks passed.

@TomAugspurger, I'll create another pull request for removing -assign so as not to confuse this one which is entirely a DOC case. How does this sound?

@datapythonista
Copy link
Member

@aeltanawy it's better if you do it in this one, as removing the -assign will make the CI validate that your examples pass the doctests already.

@aeltanawy
Copy link
Contributor Author

ah!
removed -assign from pandas/ci/doctests.sh

@datapythonista
Copy link
Member

Thanks @aeltanawy, merging on green.

@jschendel jschendel merged commit fb784ca into pandas-dev:master Sep 22, 2018
@jschendel
Copy link
Member

Thanks @aeltanawy!

@aeltanawy
Copy link
Contributor Author

Time to celebrate my first contribution! Thanks to everyone :).

@jorisvandenbossche
Copy link
Member

@aeltanawy Thanks a lot!

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.