Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: setting a column with a scalar and no index should raise #16823

Closed
jreback opened this issue Jul 4, 2017 · 11 comments · Fixed by #16968
Closed

ERR: setting a column with a scalar and no index should raise #16823

jreback opened this issue Jul 4, 2017 · 11 comments · Fixed by #16968
Labels
Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jul 4, 2017

xref #10017 (comment)

This should raise like [5] below.

In [2]: df = DataFrame()

In [3]: df['foo'] = 1

In [4]: df
Out[4]: 
Empty DataFrame
Columns: [foo]
Index: []
In [5]: DataFrame({'foo': 1})
ValueError: If using all scalar values, you must pass an index
I
n [17]: df.loc[1] = 1
\ValueError: cannot set a frame with no defined columns

Even though this is technically valid code (in [4]), IOW the scalar is reindexed to a 0-len index, this immediately loses information This should raise an error.

@jreback jreback added Difficulty Intermediate Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 4, 2017
@jreback jreback added this to the Next Major Release milestone Jul 4, 2017
@jreback
Copy link
Contributor Author

jreback commented Jul 4, 2017

cc @AwasthiMaddy

@toobaz
Copy link
Member

toobaz commented Jul 17, 2017

I'm not convinced this is a bug... as @jreback says, [3] is valid code, and the boundary with "losing information" is not so obvious. What about

In [2]: s = pd.Series(range(4))

In [3]: s.loc[5:] = 1

? I'm pretty sure we want to allow it. On the other hand, in [5] you are initializing, so it makes sense to require more explicit "instructions" (as in pd.DataFrame({'foo' : 1}, index=[]), which works fine).

I see instead the incoherence with [17], but I think it can be justified (because a DataFrame has a better defined meaning than one with no columns; and because setting an entire column to a scalar is a more usual operation than setting an entire row to a scalar) - and anyway, I would ideally prefer to fix [17] (which should expand the index) than to forbid [3].

@toobaz
Copy link
Member

toobaz commented Oct 12, 2017

@jreback I invite you to reconsider this.

"Fixing" this has really brought no advantage. I wonder whether any real example of "lost information" can be provided, and on the other hand there is an evident loss of generality. It's the equivalent of stating that the expression a=b**0 should produce an error because it always assigns 1 and hence you are "loosing information".
The "fix" has brought us a slightly more complicated code path, a slightly less predictable behavior (as in "I need more words to explain how the thing works"), an incoherence with examples like pd.Series(range(4)).loc[5:] = veryimportantnumber, which doesn't raise, but also like pd.DataFrame(index=[2]).loc[2] = veryimportantnumber, which doesn't raise either, and a violation of the principle of least surprise.

Perhaps most importantly, this can break perfectly valid code written in the past, as in:

df = pd.DataFrame()
for var in ['a', 'b']: # The list can come from some generating function
    df[var] = -1 # For the moment, just create the column

# Add rows and additional columns from existing frame
df = df.join(pd.DataFrame([[1,2], [3,4]], columns=['c', 'd']), how='outer').fillna(-1))

# Replace "-1" with real data...

I probably never wrote such code, but there is no reason to punish any user who did. A warning would be acceptable, but then the examples above should raise a warning too for coherence. Anyway, we have had a full FutureWarning/DeprecationWarning cycle for changes more obvious than this.

@jreback
Copy link
Contributor Author

jreback commented Oct 14, 2017

@toobaz I disagree. This is a clear error, expecting a column to spring into existance and none does. A deprecation cycle is for something that is not necessarily a problem but an api change. this is an error.

@toobaz
Copy link
Member

toobaz commented Oct 15, 2017

@toobaz I disagree. This is a clear error, expecting a column to spring into existance and none does.

It does instead! It is just empty because the index is empty, but the columns attribute does change, and this is precisely what should happen when assigning a scalar to a non-existing column: a column is added, and all existing rows are affected (=none, in this case).

Again, what is the "error" in the snippet I posted above?! Also consider the example in which pandas is used to automatically process every day some DataFrame of daily data, add a constant "date" column and append to a table in HDF5: this should not fail unexpectedly just because on some day there are 0 rows.

This really reminds me of other cases in which it was decided to do things which were more complicated and less general but apparently more "friendly", just to regret it later (I'm clearly referring to the project in general, not to anybody's decisions in particular).

@toobaz
Copy link
Member

toobaz commented Oct 15, 2017

... and anyway, if we really made a rule that any expression of the form

df.*[*] = 3

with 3 not actually stored somewhere is a "loss of data" and should raise, then there would be many places we should "fix" (including the ones I already posted).

@jorisvandenbossche
Copy link
Member

I agree with @toobaz general feeling of "why was this actually needed" without seeing a real benefit.

We can certainly discuss the validity of the original code example that was changed to raise an error (should one be allowed to assign into an empty series/frame?), but about the current merged PR, IMO it:

  • it introduces inconsistencies: why raise for df = pd.DataFrame(); df['a'] = 1, but not for pd.DataFrame(); df.loc[: 'a'] = 1 or for s = pd.Series(); s[:] = 1, which both also 'loose' information.

  • it makes that code does a bit less generalize, and you need to start explicitly checking for emptiness (as @toobaz above gave an example with processing daily data). Assume the following code example:

    df = pd.DataFrame({'a': np.random.randn(4)})
    # make two subset dataframes and add indicator column
    high = df[df['a'] > 3]
    low = df[df['a'] <= 3]
    high['b'] = 'high'
    low['b'] = 'low'
    

    Maybe not the best 'pandonic' code, but is this is something we should disallow? Or require people to start checking for emptiness before doing such things?

@jreback
Copy link
Contributor Author

jreback commented Oct 16, 2017

it introduces inconsistencies: why raise for df = pd.DataFrame(); df['a'] = 1, but not for pd.DataFrame(); df.loc[: 'a'] = 1 or for s = pd.Series(); s[:] = 1, which both also 'loose' information.

these should raise as well.

I am not against your example, but its non-idiomatic. I don't see why it should work (aside from the fact that you are also setting on a copy)

Fundamentally I agree we should align behavior.

So I would be willing to reverse this if we then allow scalars in the constructor w/o an Index. But I am afraid that will lead to many errors and is a worse case.

@toobaz
Copy link
Member

toobaz commented Oct 16, 2017

So I would be willing to reverse this if we then allow scalars in the constructor w/o an Index.

I think that it would be better than nothing (also because, all else equal, for backward compatibility we prefer to allow than to disallow something new).

But frankly speaking, to me things seem relatively simple:

  • at initialization, a DataFrame needs to have an index. You can avoid providing one expliclty only if it can be automatically built for the values you pass (i.e. 1-dimensional objects of the same length, or a single 2-dimensional block of data). Scalars clearly do not satisfy this requirement, so the constructor will raise if passed only scalars (but pd.DataFrame({'A' : range(3), 'B' : 23}) works, which is cool).
  • at assignment, there is already an index, and in particular, when assigning a(n entire) column you know you'll never alter the index. More specifically, when you assign a scalar to a column, you know it will alter all existing rows, which means "none" if the index is empty.

In both cases, scalars/empty indexes represent no exception to the general behavior.

If we take your path, we have to also fix df = pd.DataFrame([[0], [2], [4]]); df.loc[(df[0] % 2).astype(bool), 0] = 3 and countless (literally) other cases in which the number on the right is not stored anywhere. If we take my path, we might have to just fix (allow) pd.DataFrame().loc[1] = 1 (for consistency - certainly not because it is a relevant issue).

@jreback
Copy link
Contributor Author

jreback commented Oct 16, 2017

can u create an issue (and repeat your last there)
to avoid this getting lost

@toobaz
Copy link
Member

toobaz commented Oct 16, 2017

can u create an issue (and repeat your last there)

See #17894 (and #17895 )

jreback added a commit to jreback/pandas that referenced this issue Oct 17, 2017
jreback added a commit that referenced this issue Oct 18, 2017
…h no index ( #16823) (#16968)" (#17902)

* Revert "ERR: Raise ValueError when setting scalars in a dataframe with no index ( #16823) (#16968)"

This reverts commit f9ba6fe.

* TST: expicit test on setting scalars on empty frame

closes #17894
yeemey pushed a commit to yeemey/pandas that referenced this issue Oct 20, 2017
…h no index ( pandas-dev#16823) (pandas-dev#16968)" (pandas-dev#17902)

* Revert "ERR: Raise ValueError when setting scalars in a dataframe with no index ( pandas-dev#16823) (pandas-dev#16968)"

This reverts commit f9ba6fe.

* TST: expicit test on setting scalars on empty frame

closes pandas-dev#17894
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
…h no index ( pandas-dev#16823) (pandas-dev#16968)" (pandas-dev#17902)

* Revert "ERR: Raise ValueError when setting scalars in a dataframe with no index ( pandas-dev#16823) (pandas-dev#16968)"

This reverts commit f9ba6fe.

* TST: expicit test on setting scalars on empty frame

closes pandas-dev#17894
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
…h no index ( pandas-dev#16823) (pandas-dev#16968)" (pandas-dev#17902)

* Revert "ERR: Raise ValueError when setting scalars in a dataframe with no index ( pandas-dev#16823) (pandas-dev#16968)"

This reverts commit f9ba6fe.

* TST: expicit test on setting scalars on empty frame

closes pandas-dev#17894
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants