API: Add DataFrame.assign method #9239

TomAugspurger · 2015-01-13T13:32:24Z

signature: DataFrame.transform(**kwargs)

the keyword is the name of the new column (existing columns are overwritten if there's a name conflict, as in dplyr)
the value is either
- called on self if it's callable. The callable should be a function of 1 argument, the DataFrame being called on.
- inserted otherwise

In [7]: df.head()
Out[7]: 
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

In [8]: (df.query('species == "virginica"')
           .transform(sepal_ratio=lambda x: x.sepal_length / x.sepal_width)
           .head())
Out[8]: 
     sepal_length  sepal_width  petal_length  petal_width    species  \
100           6.3          3.3           6.0          2.5  virginica   
101           5.8          2.7           5.1          1.9  virginica   
102           7.1          3.0           5.9          2.1  virginica   
103           6.3          2.9           5.6          1.8  virginica   
104           6.5          3.0           5.8          2.2  virginica   

     sepal_ratio  
100     1.909091  
101     2.148148  
102     2.366667  
103     2.172414  
104     2.166667

My question now is

How strict should we be on the shape of the transformed DataFrame? Should we do any kind of checking on the index or columns?

shoyer · 2015-01-13T18:54:20Z

pandas/core/frame.py

+        """
+
+        """
+        data = self.copy()


This should do a shallow copy if possible.

we never shallow copy

much more trouble than it's worth
only indexes are shallow

I like the unintentional formatting on that :)

hahha

the old iphone

I misunderstand pandas's datamodel, presuming that you could do a shallow copy and then update a column without modifying the original (like a dict). I see now that I was mistaken :(.

jreback · 2015-01-13T23:43:14Z

I like transform as the name; no-one in favor of mutate? any other possibilities?

only minus against transform is the use in groupby is somewhat different.

shoyer · 2015-01-13T23:49:57Z

I like mutate better because:

we already have transform on groupby
we might want mutate as grouped operation -- dplyr has it

mrocklin · 2015-01-14T00:24:32Z

If you intend to keep the function pure (as it is currently) then the term mutate might be misleading.

Neither transform nor mutate are very descriptive. It may make sense to seek out a better term, even if it means breaking from tradition.

shoyer · 2015-01-14T00:31:38Z

FWIW, dlpyr's mutate is also pure, despite the misleading name.

mrocklin · 2015-01-14T00:32:07Z

le sigh

TomAugspurger · 2015-01-14T01:18:19Z

Augment? Enhance? I'll keep thinking.

On Jan 13, 2015, at 18:32, Matthew Rocklin notifications@github.com wrote:

le sigh

—
Reply to this email directly or view it on GitHub.

jreback · 2015-01-14T14:23:54Z

I think we should go a slightly different route here.

I would change the signature of df.update(*args, **kwargs). Then you can detect if its the 'new' mode or the 'original' mode (you can simply look at the args/kwargs and figure this out unambiguously).

I would vote to effectively deprecate the original .update mode (of course ATM this would just show a FutureWarning. This function is really not necessary except in very rare circumstances, and is not well implemented internally.

df.update(A=df.B/df.C) looks like a winner to me. (same idea in that it IS pure, but has a more suggestive name than transform/mutate. Further we can add the same method to groupby).

mrocklin · 2015-01-14T15:19:09Z

That sounds pretty clean.

jorisvandenbossche · 2015-01-14T16:03:13Z

I wanted to say that df.add_column() is descriptive, but ugly name .. but I like update more!
Only, the combination with the existing use seems a bit confusing to users I think (but have to look in more detail)

shoyer · 2015-01-14T16:58:21Z

+1 for update.

I still vote for allowing multiple variables at once :).

shoyer · 2015-01-14T17:45:01Z

OK, another idea: df.assign?

Update is not very useful in pandas, but it is part of the standard mapping API, where it's known as a method that does an in-place operation.

TomAugspurger · 2015-01-16T23:37:26Z

Thanks for the input. My favorite is df.assign, it conveys the meaning well. I could live with update though. I only worry about the cognitive clash with dict.update(), which is inplace.

I'll also allow multiple variables I think. I'm going to do all of the calculations before the assignment so that we don't run into issues with one calculation depending on another, and having the success or failure of the call dependent upon the dict ordering.

More tests and docs coming soon.

TomAugspurger · 2015-01-17T14:29:53Z

Updated with some docs and handling multiple assigns. I wasn't sure about the best place for the docs so I threw it in basics.rst. I still need to build them to make sure everything looks good.

I am worried about people hitting subtle bugs with assigning multiple columns in one assign since the order won't be preserved. I've tried to document that.

I've got a few more things to clean up and then I'll ping for review.

TomAugspurger · 2015-01-18T03:20:28Z

Ok, ready for feedback.

Just a summary,

I went with assign, but update could work
keyword arguments only (potentially multiple)
if the value is callable, it's called on self
if the value is not callable, it's inserted

sinhrks · 2015-01-18T13:09:18Z

Nice feature, and some points to be considered:

inplace option like other functions (but it results we cannot create column named inplace though)
Should Series also have assign to make a DataFrame for consistency?
Better to care partial string slicing if DataFrame has DatetimeIndex and PeriodIndex, which outputs unexpected results.

TomAugspurger · 2015-01-18T13:21:40Z

I'm -1 on an inplace option. Overall, I think we're discouraging users from using inplace these days. It also kills what I think is the main use of assign: inside a chain of operations.
Series could have an assign. I didn't include it yet since that would necessarily involve transforming a Series to a DataFrame, which we have the to_frame method for.
For now I'm taking the approach that people need to be very careful when using assign. I'm not doing any checking of the results to ensure that you're computation hasn't caused a reindexing that creates a bunch of NaNs.

jreback · 2015-01-18T20:44:58Z

I really think this should be called .update. Adding another function is just confusing.

shoyer · 2015-01-18T21:30:13Z

@jreback I'm all for deprecating .update, but I think it is clearer to give them a distinct name, given that the behavior is different and also different from the update method on dicts.

Another name possibly worth considering is .set.

jreback · 2015-01-18T21:43:50Z

@shoyer we already have way too many update/set/filter etc methods. I think some consolidation is in order. I don't think adding another new method is useful at all.

And you dont' actually need to deprecate the current functionaility of .update. (e.g. other is a dict-like).

shoyer · 2015-01-18T21:51:14Z

@jreback I agree on the problem but not the solution (in this specific case). Taking the deprecation and eventual removal of the current meaning of update as a given, I would rather have an assign method than an update method, because the later will always have the confusing association with duct.update (the issue is similar to the name mutate but worse). But is mostly bike shedding.

shoyer · 2015-01-18T21:53:57Z

@jreback To follow up on your edit, I am pretty opposed to functions that do very different things depending on how you call them, except as a transitional step. That is poor API design :).

jreback · 2015-01-18T21:54:00Z

@shoyer fair enough, and I DO like .set :)

TomAugspurger · 2015-02-28T13:55:25Z

Travis is running. Will merge in a few hours, assuming no objections.

jreback · 2015-02-28T15:14:45Z

doc/source/dsintro.rst

+   iris = read_csv('data/iris.data')
+   iris.head()
+
+   (iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])


you don't need the parens here

I call .head on the next line.

jreback · 2015-02-28T15:22:10Z

@TomAugspurger lgtm

some minor doc comments
pls open another issue (for 0.16.0 hopefully!) to add examples / docs for use of .assign with .groupby() , e.g. df.groupby('A').assign(B = lambda x: X.C+1).max(). I think you need the .assign to interact (and make a copy of the internal self.obj of the grouper, but might be a bit trickier).

TomAugspurger · 2015-02-28T16:05:40Z

@jreback I'll need to think about how this interacts with groupby. E.G. say we wanted the group-wise mean of the ratio:

In [7]: gr.apply(lambda x: x.assign(r=x.sepal_width / x.sepal_length).mean())
Out[7]: 
            sepal_length  sepal_width  petal_length  petal_width         r
species                                                                   
setosa             5.006        3.428         1.462        0.246  0.684248
versicolor         5.936        2.770         4.260        1.326  0.467680
virginica          6.588        2.974         5.552        2.026  0.453396

Obviously not really what we want; assign returns the entire frame. But assigning and then grouping works fine:

In [10]: df.assign(r=lambda x: x.sepal_length / x.sepal_width).groupby('species').r.mean()
Out[10]: 
species
setosa        1.470188
versicolor    2.160402
virginica     2.230453
Name: r, dtype: float64

I suppose it'd be needed for operations where the assign depends on some group-wise computation. Like df.groupby('species').apply(r_ = lambda x: (x.sepal_length / x.sepal_width * len(x)))?

TomAugspurger · 2015-02-28T17:47:04Z

💣s away?

jorisvandenbossche · 2015-03-01T12:59:37Z

doc/source/basics.rst

+      matplotlib.style.use('ggplot')
+   except AttributeError:
+      options.display.mpl_style = 'default'
+


Are these import still needed? As you don't seem to use a plotting example anymore now?

Not needed now. Forgot to remove them.

jorisvandenbossche · 2015-03-01T13:06:13Z

I added two small doc remarks, for the rest, bombs away!

Creates a new method for DataFrame, based off dplyr's mutate. Closes pandas-dev#9229

API: Add DataFrame.assign method

TomAugspurger · 2015-03-01T14:51:23Z

Ok, thanks everyone. We can do follow-ups as needed.

shoyer · 2015-03-01T17:54:47Z

Woohoo! Well done 👍

jreback · 2015-03-01T23:04:01Z

very nice @TomAugspurger

small issue: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html

the plot seems 'too big' size-wise. is this controlled somewhere?

jorisvandenbossche · 2015-03-02T09:02:25Z

You can control this size in the image directive of rst (see http://docutils.sourceforge.net/docs/ref/rst/directives.html#image), the other option is just to include a smaller figure in the sources and include it as it is in the docs (this is what is done for the other images included in _static in the docs I think).

TomAugspurger · 2015-03-02T13:48:33Z

Thanks. I'll follow up tonight.

On Mar 2, 2015, at 03:02, Joris Van den Bossche notifications@github.com wrote:

You can control this size in the image directive of rst (see http://docutils.sourceforge.net/docs/ref/rst/directives.html#image), the other option is just to include a smaller figure in the sources and include it as it is in the docs (this is what is done for the other images included in _static in the docs I think).

—
Reply to this email directly or view it on GitHub.

jreback · 2015-03-09T10:08:34Z

@TomAugspurger I remember you did a PR for the size of the plot in the whatsnew: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html. Did it get merged?

TomAugspurger · 2015-03-09T13:00:23Z

Yeah: #9575

Still having problems? I can change the actual image size of that pull didn't work.

On Mar 9, 2015, at 05:08, jreback notifications@github.com wrote:

@TomAugspurger I remember you did a PR for the size of the plot in the whatsnew: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html. Did it get merged?

—
Reply to this email directly or view it on GitHub.

jreback · 2015-03-09T13:41:34Z

hmm still seems much larger than the rest of the page to me

On Mar 9, 2015, at 9:00 AM, Tom Augspurger notifications@github.com wrote:

Yeah: #9575

Still having problems? I can change the actual image size of that pull didn't work.

On Mar 9, 2015, at 05:08, jreback notifications@github.com wrote:

@TomAugspurger I remember you did a PR for the size of the plot in the whatsnew: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html. Did it get merged?

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub.

shoyer · 2015-03-09T17:18:39Z

Yes, that didn't seem to work for some reason.

TomAugspurger · 2015-03-09T17:57:04Z

I opened up #9619 to fix this.

TomAugspurger mentioned this pull request Jan 13, 2015

API/ENH: Add mutate like method to DataFrames #9229

Closed

shoyer reviewed Jan 13, 2015
View reviewed changes

jreback added the API Design label Jan 13, 2015

jreback added this to the 0.16.0 milestone Jan 13, 2015

TomAugspurger force-pushed the dfTransform branch from 3b40b32 to 24a055f Compare January 17, 2015 14:26

TomAugspurger changed the title ~~API: Add DataFrame.transform method~~ API: Add DataFrame.assign method Jan 18, 2015

TomAugspurger force-pushed the dfTransform branch from 24a055f to d70bf60 Compare January 18, 2015 03:14

jreback reviewed Feb 28, 2015
View reviewed changes

TomAugspurger force-pushed the dfTransform branch from e9266f5 to 7137212 Compare February 28, 2015 15:57

jorisvandenbossche reviewed Mar 1, 2015
View reviewed changes

ENH: Add assign method to DataFrame

6a5bd89

Creates a new method for DataFrame, based off dplyr's mutate. Closes pandas-dev#9229

TomAugspurger force-pushed the dfTransform branch from 7137212 to 6a5bd89 Compare March 1, 2015 13:47

TomAugspurger pushed a commit that referenced this pull request Mar 1, 2015

Merge pull request #9239 from TomAugspurger/dfTransform

c88b0ba

API: Add DataFrame.assign method

TomAugspurger merged commit c88b0ba into pandas-dev:master Mar 1, 2015

TomAugspurger deleted the dfTransform branch April 5, 2017 02:06

API: Add DataFrame.assign method #9239

API: Add DataFrame.assign method #9239

Conversation

TomAugspurger commented Jan 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

we never shallow copy

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 13, 2015

shoyer commented Jan 13, 2015

mrocklin commented Jan 14, 2015

shoyer commented Jan 14, 2015

mrocklin commented Jan 14, 2015

TomAugspurger commented Jan 14, 2015

jreback commented Jan 14, 2015

mrocklin commented Jan 14, 2015

jorisvandenbossche commented Jan 14, 2015

shoyer commented Jan 14, 2015

shoyer commented Jan 14, 2015

TomAugspurger commented Jan 16, 2015

TomAugspurger commented Jan 17, 2015

TomAugspurger commented Jan 18, 2015

sinhrks commented Jan 18, 2015

TomAugspurger commented Jan 18, 2015

jreback commented Jan 18, 2015

shoyer commented Jan 18, 2015

jreback commented Jan 18, 2015

shoyer commented Jan 18, 2015

shoyer commented Jan 18, 2015

jreback commented Jan 18, 2015

TomAugspurger commented Feb 28, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 28, 2015

TomAugspurger commented Feb 28, 2015

TomAugspurger commented Feb 28, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 1, 2015

TomAugspurger commented Mar 1, 2015

shoyer commented Mar 1, 2015

jreback commented Mar 1, 2015

jorisvandenbossche commented Mar 2, 2015

TomAugspurger commented Mar 2, 2015

jreback commented Mar 9, 2015

TomAugspurger commented Mar 9, 2015

jreback commented Mar 9, 2015

shoyer commented Mar 9, 2015

TomAugspurger commented Mar 9, 2015