-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Updated the DataFrame.assign docstring #21917
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, lgtm aside from one small comment
pandas/core/frame.py
Outdated
@@ -3353,38 +3353,39 @@ def assign(self, **kwargs): | |||
|
|||
Examples | |||
-------- | |||
>>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)}) | |||
>>> df = pd.DataFrame({'A': range(1, 11), | |||
... 'B':np.arange(-1.0, 2.0, 0.3)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a space between 'B':
and np.arange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jschendel for spotting this! I'll send a fix right away.
Codecov Report
@@ Coverage Diff @@
## master #21917 +/- ##
==========================================
+ Coverage 92.18% 92.18% +<.01%
==========================================
Files 169 169
Lines 50804 50810 +6
==========================================
+ Hits 46833 46839 +6
Misses 3971 3971
Continue to review full report at Codecov.
|
Thanks for the fix, that's much better than using random. What would you think about simplifying the example even further? I don't think we need 10 rows to show what assign does. And using a "simpler" function than log (e.g. |
I agree with the number of rows, but |
Thanks @datapythonista and @jorisvandenbossche! I'll shorten the table(s) to hold only 2 rows. |
lgtm. @aeltanawy normally don't put 2 commits like this together (e.g. the statsmodels change is already in master) |
pandas/core/frame.py
Outdated
@@ -3353,38 +3353,23 @@ def assign(self, **kwargs): | |||
|
|||
Examples | |||
-------- | |||
>>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)}) | |||
>>> df = pd.DataFrame({'A': range(1, 3), | |||
... 'B': np.arange(-1.0, 2.0, 1.5)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that there are only few items I think using something like [1, 2, 3]
would be clearer than range(1, 3)
. And same for the other parameter.
pandas/core/frame.py
Outdated
9 10 -0.758542 2.302585 | ||
A B ln_A | ||
0 1 -1.0 0.000000 | ||
1 2 0.5 0.693147 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jorisvandenbossche that not using lambda
would be better. But I don't think that's possible, given that the column from the dataframe need to be specified. So, unless there is an easy way to avoid lambda, I'd prefer a function where the user can quickly compute the result mentally, and understand faster what happened in the operation. And if it's a real example, even better, I think that makes things even simpler for the reader.
Not sure if this is too complex, but something like that would be better in my opinion:
import pandas as pd
df = pd.DataFrame([('liquid', 100.),
('liquid', 356.73),
('gas', -252.87)],
index=['water', 'mercury', 'hydrogen'],
columns=('state', 'boiling_point_c'))
df.assign(state=lambda x: pd.Categorical(x.state),
boiling_point_f=lambda x: x.boiling_point_c * 9 / 5 + 32)
Ah, yes, missed that the |
Ah! Sorry for that, I messed up my branch when fetching upstream then rebasing. Hopefully all sorted out now. @datapythonista inspired by your example, how about using simply temperatures in Celsius and Fahrenheit indexed with cities? |
Sure, I just gave an example to illustrate my point, and not leave an abstract comment. But feel free to use the example that you find more useful/clear for users. |
I apologize for adding already committed commits, again! Anyone willing to show me what I'm doing wrong? (please let me know if this is not the place to ask) My workflow is as follows:
|
@aeltanawy no problem! Looking at your workflow, the one thing that is normally not needed is third step ( Now, I am not really sure how doing that ended up in having those extra commits, but let's try to solve it. I would first try:
and check if that fixes the commits here in the PR (the above updates the PR with changes that have happened in the meantime in the upstream (the pandas-dev/pandas repo) master branch). If that didn't fix it, another approach is doing a rebase:
(note here you will need to force push because rebasing means rewriting the history of the branch, not simply adding commits to the branch as merging does). If you have more questions, feel free to ask! |
@aeltanawy I fixed the git problems for you. If you simply do a |
Thanks a lot @jorisvandenbossche and @datapythonista !! Now I'm able to push changes without git rejecting it and, hopefully for good, just submitting my commits only. I have simplified the examples even further by including one column in the initial DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very nice now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good, just couple of ideas to make the examples simpler.
pandas/core/frame.py
Outdated
@@ -3354,38 +3354,23 @@ def assign(self, **kwargs): | |||
|
|||
Examples | |||
-------- | |||
>>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)}) | |||
>>> df = pd.DataFrame({'temp_c': (17.0, 25.0)}, | |||
index=['portland', 'berkeley']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind displaying df
content after creating it? I think it'll help users understand faster that a new column is created in the assign later on.
Also, do you think it could look better capitalizing the names of the cities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second your thoughts! Will send an update soon.
pandas/core/frame.py
Outdated
>>> df.assign(temp_f=newcol) | ||
temp_c temp_f | ||
portland 17.0 62.6 | ||
berkeley 25.0 77.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we could manually create a Series
with Kelvin degrees and assign it. I personally find it a bit confusing that both examples do the same thing, but in a different way. Or we should say it if you think this is better.
I think a single example:
>>> df.assign(temp_f=lambda...
>>> temp_k=kelvin_series)
could illustrate both cases in a simpler way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe expanding the exposition could clear this up a bit? E.g, "Where the value already exists and is inserted:" -> "Alternatively, the same behavior can be achieved by directly referencing an existing Series or list-like:".
Would be fine with something along the lines of your suggestion too though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning more towards keeping this example but to change its exposition to: "Alternatively, the same behavior can be achieved by directly referencing an existing Series or list-like".
pandas/core/frame.py
Outdated
>>> df.assign(temp_f=newcol) | ||
temp_c temp_f | ||
portland 17.0 62.6 | ||
berkeley 25.0 77.0 | ||
|
||
Where the keyword arguments depend on each other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this last example is not useful anymore, or do you think it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example is still necessary in some form, as it shows a non-obvious Python 3.6+ feature that was recently added (#18852). It could definitely be improved though.
Reversing the setup and starting from a DataFrame with Fahrenheit data:
In [2]: df = pd.DataFrame({'temp_f': [-40, 0, 32, 100, 212]})
In [3]: df
Out[3]:
temp_f
0 -40
1 0
2 32
3 100
4 212
Then what this is section is trying to illustrate is that, in Python 3.6+, you can create multiple columns within the same assign
where one of the columns depends on another one defined within the same assign
.
For example, if you wanted to add both Celsius and Kelvin columns, you could use a formula to build Celsius from Fahrenheit, then use the Celsius column to create the Kelvin one (instead of using a more complex formula referencing Fahrenheit), all from within the same assign
:
In [4]: df.assign(temp_c=lambda df: 5 / 9 * (df['temp_f'] - 32),
temp_k=lambda df: df['temp_c'] + 273.15)
Out[4]:
temp_f temp_c temp_k
0 -40 -40.000000 233.150000
1 0 -17.777778 255.372222
2 32 0.000000 273.150000
3 100 37.777778 310.927778
4 212 100.000000 373.150000
There have also been some more generic questions about how to add multiple columns using assign
(e.g. StackOverflow). It might be beneficial to first give an example similar to the one above, but where you create Kelvin from Fahrenheit, to simply show how to add multiple columns within the same assign
. Then the 3.6+ behavior could be discussed using something similar to my example above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I understand now why this example is needed. Let me add a different one that better portrays its usage.
@datapythonista merge when satisfied |
Here are the recent changes to the DataFrame.assign docstring:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, the example looks great, except for the formatting issues.
Can you run ./scripts/validate_docstrings.py pandas.DataFrame.assign
after you perform the requested changes? that should tell you if all the formatting problems I commented are fixed.
pandas/core/frame.py
Outdated
@@ -3250,48 +3250,34 @@ def assign(self, **kwargs): | |||
|
|||
Examples | |||
-------- | |||
>>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)}) | |||
>>> df = pd.DataFrame({'temp_c': (17.0, 25.0)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use a list instead of a tuple for the data? I think it's more conventional.
pandas/core/frame.py
Outdated
@@ -3250,48 +3250,34 @@ def assign(self, **kwargs): | |||
|
|||
Examples | |||
-------- | |||
>>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)}) | |||
>>> df = pd.DataFrame({'temp_c': (17.0, 25.0)}, | |||
index=['Portland', 'Berkeley']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing ...
at the same level of the >>>
, and the indentation is not correct
pandas/core/frame.py
Outdated
Alternatively, the same behavior can be achieved by directly | ||
referencing an existing Series or list-like: | ||
>>> newcol = df['temp_c'] * 9 / 5 + 32 | ||
>>> df.assign(temp_f=newcol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use the expression directly in the assignment instead (i.e. >>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)
), but it's your call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, this example is to show that you can refer to an existing list or series and assign it to the df. The example above it is already using direct assignment. Do you think it is not necessary to show this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous example, the new column is assigned to a callable (which is run with the DataFrame as a parameter). In this example the new column is assigned to a Series. What I'm saying is that instead of saving the Series to newcol, and then assign the new column to the variable newcol, we can simply create the Series as a parameter.
pandas/core/frame.py
Outdated
where one of the columns depends on another one defined within the same | ||
assign: | ||
>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32, | ||
temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before regarding ...
and indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the previous comments, can you remove the df
in the Returns section (just leave DataFrame in the first line).
And also, add the **
to the kwargs
parameter, and change the type, so the first line is **kwargs : dict of {str: callable or Series}
Thanks!
pandas/core/frame.py
Outdated
Alternatively, the same behavior can be achieved by directly | ||
referencing an existing Series or list-like: | ||
>>> newcol = df['temp_c'] * 9 / 5 + 32 | ||
>>> df.assign(temp_f=newcol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous example, the new column is assigned to a callable (which is run with the DataFrame as a parameter). In this example the new column is assigned to a Series. What I'm saying is that instead of saving the Series to newcol, and then assign the new column to the variable newcol, we can simply create the Series as a parameter.
…-dev#22293) * Fix bug #GH22092 * Update v0.24.0.txt * Update v0.24.0.txt * Update ops.py * Update test_operators.py * Update v0.24.0.txt * Update test_operators.py
Do I need to rebase my changes against master? |
It's like the other way round I'd say. You get the latest changes from pandas |
Pretty sure my branch tree is messed up! For now, I have merged upstream master to my branch and pushed the changes. Waiting for your directions @datapythonista after the checks are finished. |
Git changes look OK. @aeltanawy could you remove Line 24 in 0480f4c
|
Looks like all checks passed. @TomAugspurger, I'll create another pull request for removing -assign so as not to confuse this one which is entirely a DOC case. How does this sound? |
@aeltanawy it's better if you do it in this one, as removing the |
ah! |
Thanks @aeltanawy, merging on green. |
Thanks @aeltanawy! |
Time to celebrate my first contribution! Thanks to everyone :). |
@aeltanawy Thanks a lot! |
Updated the DataFrame.assign docstring example to use np.arange instead of np.random.randn to pass the validation test.