Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add DataFrame.assign method #9239

Merged
merged 1 commit into from
Mar 1, 2015

Conversation

TomAugspurger
Copy link
Contributor

Closes #9229

signature: DataFrame.transform(**kwargs)

  • the keyword is the name of the new column (existing columns are overwritten if there's a name conflict, as in dplyr)
  • the value is either
    • called on self if it's callable. The callable should be a function of 1 argument, the DataFrame being called on.
    • inserted otherwise
In [7]: df.head()
Out[7]: 
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

In [8]: (df.query('species == "virginica"')
           .transform(sepal_ratio=lambda x: x.sepal_length / x.sepal_width)
           .head())
Out[8]: 
     sepal_length  sepal_width  petal_length  petal_width    species  \
100           6.3          3.3           6.0          2.5  virginica   
101           5.8          2.7           5.1          1.9  virginica   
102           7.1          3.0           5.9          2.1  virginica   
103           6.3          2.9           5.6          1.8  virginica   
104           6.5          3.0           5.8          2.2  virginica   

     sepal_ratio  
100     1.909091  
101     2.148148  
102     2.366667  
103     2.172414  
104     2.166667  

My question now is

  • How strict should we be on the shape of the transformed DataFrame? Should we do any kind of checking on the index or columns?

"""

"""
data = self.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should do a shallow copy if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never shallow copy

much more trouble than it's worth
only indexes are shallow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the unintentional formatting on that :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahha

the old iphone

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstand pandas's datamodel, presuming that you could do a shallow copy and then update a column without modifying the original (like a dict). I see now that I was mistaken :(.

@jreback jreback added this to the 0.16.0 milestone Jan 13, 2015
@jreback
Copy link
Contributor

jreback commented Jan 13, 2015

I like transform as the name; no-one in favor of mutate? any other possibilities?

only minus against transform is the use in groupby is somewhat different.

@shoyer
Copy link
Member

shoyer commented Jan 13, 2015

I like mutate better because:

  1. we already have transform on groupby
  2. we might want mutate as grouped operation -- dplyr has it

@mrocklin
Copy link
Contributor

If you intend to keep the function pure (as it is currently) then the term mutate might be misleading.

Neither transform nor mutate are very descriptive. It may make sense to seek out a better term, even if it means breaking from tradition.

@shoyer
Copy link
Member

shoyer commented Jan 14, 2015

FWIW, dlpyr's mutate is also pure, despite the misleading name.

@mrocklin
Copy link
Contributor

le sigh

@TomAugspurger
Copy link
Contributor Author

Augment? Enhance? I'll keep thinking.

On Jan 13, 2015, at 18:32, Matthew Rocklin notifications@github.com wrote:

le sigh


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Jan 14, 2015

I think we should go a slightly different route here.

I would change the signature of df.update(*args, **kwargs). Then you can detect if its the 'new' mode or the 'original' mode (you can simply look at the args/kwargs and figure this out unambiguously).

I would vote to effectively deprecate the original .update mode (of course ATM this would just show a FutureWarning. This function is really not necessary except in very rare circumstances, and is not well implemented internally.

df.update(A=df.B/df.C) looks like a winner to me. (same idea in that it IS pure, but has a more suggestive name than transform/mutate. Further we can add the same method to groupby).

@mrocklin
Copy link
Contributor

That sounds pretty clean.

@jorisvandenbossche
Copy link
Member

I wanted to say that df.add_column() is descriptive, but ugly name .. but I like update more!
Only, the combination with the existing use seems a bit confusing to users I think (but have to look in more detail)

@shoyer
Copy link
Member

shoyer commented Jan 14, 2015

+1 for update.

I still vote for allowing multiple variables at once :).

@shoyer
Copy link
Member

shoyer commented Jan 14, 2015

OK, another idea: df.assign?

Update is not very useful in pandas, but it is part of the standard mapping API, where it's known as a method that does an in-place operation.

@TomAugspurger
Copy link
Contributor Author

Thanks for the input. My favorite is df.assign, it conveys the meaning well. I could live with update though. I only worry about the cognitive clash with dict.update(), which is inplace.

I'll also allow multiple variables I think. I'm going to do all of the calculations before the assignment so that we don't run into issues with one calculation depending on another, and having the success or failure of the call dependent upon the dict ordering.

More tests and docs coming soon.

@TomAugspurger
Copy link
Contributor Author

Updated with some docs and handling multiple assigns. I wasn't sure about the best place for the docs so I threw it in basics.rst. I still need to build them to make sure everything looks good.

I am worried about people hitting subtle bugs with assigning multiple columns in one assign since the order won't be preserved. I've tried to document that.

I've got a few more things to clean up and then I'll ping for review.

@TomAugspurger TomAugspurger changed the title API: Add DataFrame.transform method API: Add DataFrame.assign method Jan 18, 2015
@TomAugspurger
Copy link
Contributor Author

Ok, ready for feedback.

Just a summary,

  • I went with assign, but update could work
  • keyword arguments only (potentially multiple)
  • if the value is callable, it's called on self
  • if the value is not callable, it's inserted

@sinhrks
Copy link
Member

sinhrks commented Jan 18, 2015

Nice feature, and some points to be considered:

  • inplace option like other functions (but it results we cannot create column named inplace though)
  • Should Series also have assign to make a DataFrame for consistency?
  • Better to care partial string slicing if DataFrame has DatetimeIndex and PeriodIndex, which outputs unexpected results.

@TomAugspurger
Copy link
Contributor Author

  • I'm -1 on an inplace option. Overall, I think we're discouraging users from using inplace these days. It also kills what I think is the main use of assign: inside a chain of operations.
  • Series could have an assign. I didn't include it yet since that would necessarily involve transforming a Series to a DataFrame, which we have the to_frame method for.
  • For now I'm taking the approach that people need to be very careful when using assign. I'm not doing any checking of the results to ensure that you're computation hasn't caused a reindexing that creates a bunch of NaNs.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

I really think this should be called .update. Adding another function is just confusing.

@shoyer
Copy link
Member

shoyer commented Jan 18, 2015

@jreback I'm all for deprecating .update, but I think it is clearer to give them a distinct name, given that the behavior is different and also different from the update method on dicts.

Another name possibly worth considering is .set.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@shoyer we already have way too many update/set/filter etc methods. I think some consolidation is in order. I don't think adding another new method is useful at all.

And you dont' actually need to deprecate the current functionaility of .update. (e.g. other is a dict-like).

@shoyer
Copy link
Member

shoyer commented Jan 18, 2015

@jreback I agree on the problem but not the solution (in this specific case). Taking the deprecation and eventual removal of the current meaning of update as a given, I would rather have an assign method than an update method, because the later will always have the confusing association with duct.update (the issue is similar to the name mutate but worse). But is mostly bike shedding.

@shoyer
Copy link
Member

shoyer commented Jan 18, 2015

@jreback To follow up on your edit, I am pretty opposed to functions that do very different things depending on how you call them, except as a transitional step. That is poor API design :).

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@shoyer fair enough, and I DO like .set :)

@TomAugspurger
Copy link
Contributor Author

Travis is running. Will merge in a few hours, assuming no objections.

iris = read_csv('data/iris.data')
iris.head()

(iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the parens here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I call .head on the next line.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2015

@TomAugspurger lgtm

  • some minor doc comments
  • pls open another issue (for 0.16.0 hopefully!) to add examples / docs for use of .assign with .groupby() , e.g. df.groupby('A').assign(B = lambda x: X.C+1).max(). I think you need the .assign to interact (and make a copy of the internal self.obj of the grouper, but might be a bit trickier).

@TomAugspurger
Copy link
Contributor Author

@jreback I'll need to think about how this interacts with groupby. E.G. say we wanted the group-wise mean of the ratio:

In [7]: gr.apply(lambda x: x.assign(r=x.sepal_width / x.sepal_length).mean())
Out[7]: 
            sepal_length  sepal_width  petal_length  petal_width         r
species                                                                   
setosa             5.006        3.428         1.462        0.246  0.684248
versicolor         5.936        2.770         4.260        1.326  0.467680
virginica          6.588        2.974         5.552        2.026  0.453396

Obviously not really what we want; assign returns the entire frame. But assigning and then grouping works fine:

In [10]: df.assign(r=lambda x: x.sepal_length / x.sepal_width).groupby('species').r.mean()
Out[10]: 
species
setosa        1.470188
versicolor    2.160402
virginica     2.230453
Name: r, dtype: float64

I suppose it'd be needed for operations where the assign depends on some group-wise computation. Like df.groupby('species').apply(r_ = lambda x: (x.sepal_length / x.sepal_width * len(x)))?

@TomAugspurger
Copy link
Contributor Author

💣s away?

matplotlib.style.use('ggplot')
except AttributeError:
options.display.mpl_style = 'default'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these import still needed? As you don't seem to use a plotting example anymore now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed now. Forgot to remove them.

@jorisvandenbossche
Copy link
Member

I added two small doc remarks, for the rest, bombs away!

Creates a new method for DataFrame, based off dplyr's mutate.
Closes pandas-dev#9229
TomAugspurger pushed a commit that referenced this pull request Mar 1, 2015
API: Add DataFrame.assign method
@TomAugspurger TomAugspurger merged commit c88b0ba into pandas-dev:master Mar 1, 2015
@TomAugspurger
Copy link
Contributor Author

Ok, thanks everyone. We can do follow-ups as needed.

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

Woohoo! Well done 👍

@jreback
Copy link
Contributor

jreback commented Mar 1, 2015

very nice @TomAugspurger

small issue: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html

the plot seems 'too big' size-wise. is this controlled somewhere?

@jorisvandenbossche
Copy link
Member

You can control this size in the image directive of rst (see http://docutils.sourceforge.net/docs/ref/rst/directives.html#image), the other option is just to include a smaller figure in the sources and include it as it is in the docs (this is what is done for the other images included in _static in the docs I think).

@TomAugspurger
Copy link
Contributor Author

Thanks. I'll follow up tonight.

On Mar 2, 2015, at 03:02, Joris Van den Bossche notifications@github.com wrote:

You can control this size in the image directive of rst (see http://docutils.sourceforge.net/docs/ref/rst/directives.html#image), the other option is just to include a smaller figure in the sources and include it as it is in the docs (this is what is done for the other images included in _static in the docs I think).


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2015

@TomAugspurger I remember you did a PR for the size of the plot in the whatsnew: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html. Did it get merged?

@TomAugspurger
Copy link
Contributor Author

Yeah: #9575

Still having problems? I can change the actual image size of that pull didn't work.

On Mar 9, 2015, at 05:08, jreback notifications@github.com wrote:

@TomAugspurger I remember you did a PR for the size of the plot in the whatsnew: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html. Did it get merged?


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2015

hmm still seems much larger than the rest of the page to me

On Mar 9, 2015, at 9:00 AM, Tom Augspurger notifications@github.com wrote:

Yeah: #9575

Still having problems? I can change the actual image size of that pull didn't work.

On Mar 9, 2015, at 05:08, jreback notifications@github.com wrote:

@TomAugspurger I remember you did a PR for the size of the plot in the whatsnew: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html. Did it get merged?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub.

@shoyer
Copy link
Member

shoyer commented Mar 9, 2015

Yes, that didn't seem to work for some reason.

@TomAugspurger
Copy link
Contributor Author

I opened up #9619 to fix this.

@TomAugspurger TomAugspurger deleted the dfTransform branch April 5, 2017 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API/ENH: Add mutate like method to DataFrames
7 participants