-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix DataFrame constructor w named Series #9237
BUG: fix DataFrame constructor w named Series #9237
Conversation
this is an api change i will have to look at this |
Yep, I wasn't sure where to document it in whatsnew. I can expand it. I think the docs are pretty clear that this was the intended behavior.. |
How I always understood it, it is the I wanted to say that this is in line with how a DataFrame itself is treated when given to
|
see more discusion on #7893 this AFAIK was a longstanding behavior which I think I broke in 0.12-13 . I agree with @jorisvandenbossche here This should return a column of |
I think we should have a mini section in basics.rst to show conversions of series -> dataframe (and obviously in the 0.16.0 whatsnew) |
OK, so I'll repurpose this PR to make to_frame() and the constructor consistent (both return a column of NaNs) and rewrite the docs. |
Just for the docs, if you want to give a series as input to
|
OK, but of course for just this, using
|
@TomAugspurger can you update the top section. I don't think this is correct. The columns argument will cause a reindex of the input (after the column name is set by name). So if they different you would get a column of |
@TomAugspurger can you rebase this. and confirm (just change the top section) of the existing and new behavior (may also need an example in the api breaking section of the same). |
@jreback so the expected behavior is In [24]: y = pd.Series(range(5), name=0)
In [25]: pd.DataFrame(y, columns=[1])
Out[25]:
1
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN ? |
IMO, it should be identical to what you get from a dict In [29]: pd.DataFrame({y.name: y.values}, columns=[1]) # y.name is 0
Out[29]:
Empty DataFrame
Columns: [1]
Index: [] So that should also be NaNs? Or no since I didn't provide an index? |
I think [24/25] is the correct return. (0.15.2 does what you have for 24/25). [29] is just wrong (same in master). The canonical way to do this is: This has to be a reindex, e.g. construct the dict using the current name THEN reindex. (its done like this for a dictionary construction, but it is not entirely correct; the [1] column in your example should still exist, eg. [24/25] and NOT [29]) Imagine this scenario:
So the columns are selecting (e.g. reindexing) on the passed dictionary keys, if you JUST pass a single Series, then it must be TL;DR; but makes sense? |
Yep, it definitely is a reindex. We'll go with [24]/[25]. |
And I'm comfortable saying that In [16]: pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, columns=["C"]) Returning an Empty DataFrame instead of NaNs was a bug (so this isn't an API change) |
|
||
|
||
- Fixed bug with DataFrame constructor when passed a Series with a | ||
name and the `columns` keyword argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number here
@TomAugspurger agree with your [16]. thxs |
This is breaking some tests unfortunately. When constructing from a dict of In [2]: s1 = pd.Series([1, 2, 3], name='a')
In [3]: s2 = pd.Series([4, 5, 6], name='b', index=[2, 3, 4])
In [5]: pd.DataFrame({'a': s1, 'b': s2}, columns=['a'])
Out[5]:
a
0 1
1 2
2 3
3 NaN
4 NaN Before it was In [5 master]: pd.DataFrame({'a': s1, 'b': s2}, columns=['a'])
Out[5 master]:
a
0 1
1 2
2 3 Since we filtered out |
@TomAugspurger hmm, I see what you mean. I don't think this should change though. We are only reindexing on the columns (as opposed to creating the frame first from the ENTIRE dict, then reindexing) Hmm I guess that is why the [29] is the way it is. Ok, why don't you investigate and see what makes the most sense. |
Is it cool to push this all off to 0.17? I don't have much time today and I need to squash a weird bug in the plot accessors. Is there anything that is obviously broken and should be fixed today? I'd say that this is obviously wrong. In [1]: y = pd.Series(range(5), name=0)
In [2]: z = pd.Series(range(5), name=1)
In [3]: pd.DataFrame(y, columns=[2])
Out[3]:
2
0 0
1 1
2 2
3 3
4 4
In [4]: pd.DataFrame(z, columns=[2])
Out[4]:
Empty DataFrame
Columns: [2]
Index: [] I jsut don't want to make anything harder down the road. |
sure let's push thx! |
Pushing for now [ci skip]
d4e02d9
to
12d39c6
Compare
@TomAugspurger ought to re-examine this for 0.17.0. pls rebase when you have a chance |
@TomAugspurger what's status? |
Just reread through. I'll have something by the end of next week. |
@TomAugspurger pushing, but if you get to it next week or 2 can put in. |
@TomAugspurger status? |
Won't really have time to work on this in the short term. Better to close or leave open? |
let's close for now. |
closes #7893
Closes #9232
Problem was passing Series w/ a name to DataFrame w/ the
columns
kwarg.Before:
after
There were two intertwined problems
if getattr(data, 'name', None):
, which returned False when data.name wasFalse
ish (like 0). I now compare it directly against None.data
has a name and the columns kwarg is specified, the constructor returned an Empty DataFrame w/ the column specified in columns. Now, we do what's documented: