Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix DataFrame constructor w named Series #9237

Conversation

TomAugspurger
Copy link
Contributor

closes #7893

Closes #9232

Problem was passing Series w/ a name to DataFrame w/ the columns kwarg.

Before:

In [55]: x = pd.Series(range(5), name=1)

In [56]: y = pd.Series(range(5), name=0)

In [57]: pd.DataFrame(x, columns=[0])
Out[57]:
Empty DataFrame
Columns: [0]
Index: []

In [58]: pd.DataFrame(x, columns=[1])
Out[58]:
   1
0  0
1  1
2  2
3  3
4  4

In [59]: pd.DataFrame(y, columns=[0])
Out[59]:
   0
0  0
1  1
2  2
3  3
4  4

In [60]: pd.DataFrame(y, columns=[1])
Out[60]:
   1
0  0
1  1
2  2
3  3
4  4

after

In [1]: x = pd.Series(range(5), name=1)

In [2]: y = pd.Series(range(5), name=0)

In [4]: pd.DataFrame(x, columns=[0])
Out[4]: 
   0
0  0
1  1
2  2
3  3
4  4

In [5]: pd.DataFrame(y, columns=[1])
Out[5]: 
   1
0  0
1  1
2  2
3  3
4  4

There were two intertwined problems

  1. we checked if getattr(data, 'name', None):, which returned False when data.name was Falseish (like 0). I now compare it directly against None.
  2. If data has a name and the columns kwarg is specified, the constructor returned an Empty DataFrame w/ the column specified in columns. Now, we do what's documented:

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

@jreback
Copy link
Contributor

jreback commented Jan 13, 2015

this is an api change

i will have to look at this

@TomAugspurger
Copy link
Contributor Author

Yep, I wasn't sure where to document it in whatsnew. I can expand it.

I think the docs are pretty clear that this was the intended behavior..

@jorisvandenbossche
Copy link
Member

How I always understood it, it is the y case (pd.DataFrame(y, columns=[1]) not giving an empty dataframe) that was buggy and not the x case (pd.DataFrame(x, columns=[0]) giving an empty frame). It was sometimes odd or surpising behaviour, but giving a series/frame to DataFrame is regarded as a reindex.

I wanted to say that this is in line with how a DataFrame itself is treated when given to DataFrame, but it appears this is still somewhat different (not giving an empty frame, but one with NaNs):

In [42]: x = pd.Series(range(5), name=1)

In [43]: pd.DataFrame(x.to_frame(), columns=[0])
Out[43]: 
    0
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN

@jreback
Copy link
Contributor

jreback commented Jan 13, 2015

see more discusion on #7893

this AFAIK was a longstanding behavior which I think I broke in 0.12-13 . I agree with @jorisvandenbossche here

This should return a column of NaN if the passed name in the columns does not match, e.g. its a reindex. (the bug is its returning an empty frame). In fact the undelrying code should do exactly to_frame() as this handles the edge cases.

@jreback jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 13, 2015
@jreback jreback added this to the 0.16.0 milestone Jan 13, 2015
@jreback
Copy link
Contributor

jreback commented Jan 13, 2015

I think we should have a mini section in basics.rst to show conversions of series -> dataframe (and obviously in the 0.16.0 whatsnew)

@TomAugspurger
Copy link
Contributor Author

OK, so I'll repurpose this PR to make to_frame() and the constructor consistent (both return a column of NaNs) and rewrite the docs.

@jorisvandenbossche
Copy link
Member

Just for the docs, if you want to give a series as input to DataFrame, but do want to overwrite the series name (which is a valid use case), what is the recommended way to do? As first changing the series name, or afterwards renaming the dataframe is somewhat cumbersome.
I suppose giving it as a dict:

In [59]: s = pd.Series([1,2,3], index=[10,11,12], name='test')

In [60]: pd.DataFrame(s)
Out[60]: 
    test
10     1
11     2
12     3

In [61]: pd.DataFrame(s, columns=['col'])
Out[61]: 
Empty DataFrame
Columns: [col]
Index: []

In [62]: pd.DataFrame({'col': s})
Out[62]: 
    col
10    1
11    2
12    3

@jorisvandenbossche
Copy link
Member

OK, but of course for just this, using to_frame is the simplest:

In [63]: s.to_frame(name='col')
Out[63]: 
    col
10    1
11    2
12    3

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@TomAugspurger can you update the top section. I don't think this is correct. The columns argument will cause a reindex of the input (after the column name is set by name). So if they different you would get a column of nans. What I think is here is inconsistent with that.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2015

@TomAugspurger can you rebase this. and confirm (just change the top section) of the existing and new behavior (may also need an example in the api breaking section of the same).

@jreback
Copy link
Contributor

jreback commented Mar 7, 2015

@TomAugspurger ?

@TomAugspurger
Copy link
Contributor Author

@jreback so the expected behavior is

In [24]: y = pd.Series(range(5), name=0)

In [25]: pd.DataFrame(y, columns=[1])
Out[25]: 
   1
0  NaN
1  NaN
2  NaN
3  NaN
4  NaN

?

@TomAugspurger
Copy link
Contributor Author

IMO, it should be identical to what you get from a dict

In [29]: pd.DataFrame({y.name: y.values}, columns=[1])  # y.name is 0
Out[29]: 
Empty DataFrame
Columns: [1]
Index: []

So that should also be NaNs? Or no since I didn't provide an index?

@jreback
Copy link
Contributor

jreback commented Mar 8, 2015

@TomAugspurger

I think [24/25] is the correct return. (0.15.2 does what you have for 24/25). [29] is just wrong (same in master).

The canonical way to do this is: y.to_frame(name=1) which in effect is: DataFrame({ name : y }), where name is y.name or the new_name if provided.

This has to be a reindex, e.g. construct the dict using the current name THEN reindex. (its done like this for a dictionary construction, but it is not entirely correct; the [1] column in your example should still exist, eg. [24/25] and NOT [29])

Imagine this scenario:

In [1]: DataFrame({'y1' : Series(range(5),name='foo'), 'y2' : Series(range(5),name='bar') }, columns=['y2'])
Out[1]: 
   y2
0   0
1   1
2   2
3   3
4   4

So the columns are selecting (e.g. reindexing) on the passed dictionary keys, if you JUST pass a single Series, then it must be { s.name : s } so the same thing should happen

TL;DR; but makes sense?

@TomAugspurger
Copy link
Contributor Author

Yep, it definitely is a reindex. We'll go with [24]/[25].

@TomAugspurger
Copy link
Contributor Author

And I'm comfortable saying that

In [16]: pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, columns=["C"])

Returning an Empty DataFrame instead of NaNs was a bug (so this isn't an API change)



- Fixed bug with DataFrame constructor when passed a Series with a
name and the `columns` keyword argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number here

@jreback
Copy link
Contributor

jreback commented Mar 8, 2015

@TomAugspurger agree with your [16]. thxs

@TomAugspurger
Copy link
Contributor Author

This is breaking some tests unfortunately. When constructing from a dict of {col: Series} with differing indicies and specifying columns. Here's what I do now

In [2]: s1 = pd.Series([1, 2, 3], name='a')

In [3]: s2 = pd.Series([4, 5, 6], name='b', index=[2, 3, 4])

In [5]: pd.DataFrame({'a': s1, 'b': s2}, columns=['a'])
Out[5]: 
    a
0   1
1   2
2   3
3 NaN
4 NaN

Before it was

In [5 master]: pd.DataFrame({'a': s1, 'b': s2}, columns=['a'])
Out[5 master]: 
    a
0   1
1   2
2   3

Since we filtered out b before determining what the index was. I think it's good that we clarifying all these constructors and make them consistent, but it's feeling pretty disrupting.

@jreback
Copy link
Contributor

jreback commented Mar 8, 2015

@TomAugspurger hmm, I see what you mean. I don't think this should change though. We are only reindexing on the columns (as opposed to creating the frame first from the ENTIRE dict, then reindexing)

Hmm I guess that is why the [29] is the way it is.

Ok, why don't you investigate and see what makes the most sense.

@TomAugspurger
Copy link
Contributor Author

Is it cool to push this all off to 0.17? I don't have much time today and I need to squash a weird bug in the plot accessors. Is there anything that is obviously broken and should be fixed today? I'd say that this is obviously wrong.

In [1]: y = pd.Series(range(5), name=0)

In [2]: z = pd.Series(range(5), name=1)

In [3]: pd.DataFrame(y, columns=[2])
Out[3]: 
   2
0  0
1  1
2  2
3  3
4  4

In [4]: pd.DataFrame(z, columns=[2])
Out[4]: 
Empty DataFrame
Columns: [2]
Index: []

I jsut don't want to make anything harder down the road.

@jreback
Copy link
Contributor

jreback commented Mar 8, 2015

sure let's push thx!

@jreback jreback modified the milestones: Next Major Release, 0.16.0 Mar 8, 2015
@jreback
Copy link
Contributor

jreback commented May 9, 2015

@TomAugspurger ought to re-examine this for 0.17.0. pls rebase when you have a chance

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.17.0, Next Major Release Jun 2, 2015
@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

@TomAugspurger what's status?

@TomAugspurger
Copy link
Contributor Author

Just reread through. I'll have something by the end of next week.

@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 26, 2015
@jreback
Copy link
Contributor

jreback commented Aug 26, 2015

@TomAugspurger pushing, but if you get to it next week or 2 can put in.

@jreback
Copy link
Contributor

jreback commented Oct 18, 2015

@TomAugspurger status?

@TomAugspurger
Copy link
Contributor Author

Won't really have time to work on this in the short term. Better to close or leave open?

@jreback
Copy link
Contributor

jreback commented Oct 19, 2015

let's close for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
3 participants