ERR: setting a column with a scalar and no index should raise #16823

jreback · 2017-07-04T12:41:53Z

This should raise like [5] below.

In [2]: df = DataFrame()

In [3]: df['foo'] = 1

In [4]: df
Out[4]: 
Empty DataFrame
Columns: [foo]
Index: []

In [5]: DataFrame({'foo': 1})
ValueError: If using all scalar values, you must pass an index
I
n [17]: df.loc[1] = 1
\ValueError: cannot set a frame with no defined columns

Even though this is technically valid code (in [4]), IOW the scalar is reindexed to a 0-len index, this immediately loses information This should raise an error.

The text was updated successfully, but these errors were encountered:

jreback · 2017-07-04T12:42:13Z

cc @AwasthiMaddy

toobaz · 2017-07-17T05:28:05Z

I'm not convinced this is a bug... as @jreback says, [3] is valid code, and the boundary with "losing information" is not so obvious. What about

In [2]: s = pd.Series(range(4))

In [3]: s.loc[5:] = 1

? I'm pretty sure we want to allow it. On the other hand, in [5] you are initializing, so it makes sense to require more explicit "instructions" (as in pd.DataFrame({'foo' : 1}, index=[]), which works fine).

I see instead the incoherence with [17], but I think it can be justified (because a DataFrame has a better defined meaning than one with no columns; and because setting an entire column to a scalar is a more usual operation than setting an entire row to a scalar) - and anyway, I would ideally prefer to fix [17] (which should expand the index) than to forbid [3].

…ex ( #16823) (#16968)

toobaz · 2017-10-12T16:30:11Z

@jreback I invite you to reconsider this.

"Fixing" this has really brought no advantage. I wonder whether any real example of "lost information" can be provided, and on the other hand there is an evident loss of generality. It's the equivalent of stating that the expression a=b**0 should produce an error because it always assigns 1 and hence you are "loosing information".
The "fix" has brought us a slightly more complicated code path, a slightly less predictable behavior (as in "I need more words to explain how the thing works"), an incoherence with examples like pd.Series(range(4)).loc[5:] = veryimportantnumber, which doesn't raise, but also like pd.DataFrame(index=[2]).loc[2] = veryimportantnumber, which doesn't raise either, and a violation of the principle of least surprise.

Perhaps most importantly, this can break perfectly valid code written in the past, as in:

df = pd.DataFrame()
for var in ['a', 'b']: # The list can come from some generating function
    df[var] = -1 # For the moment, just create the column

# Add rows and additional columns from existing frame
df = df.join(pd.DataFrame([[1,2], [3,4]], columns=['c', 'd']), how='outer').fillna(-1))

# Replace "-1" with real data...

I probably never wrote such code, but there is no reason to punish any user who did. A warning would be acceptable, but then the examples above should raise a warning too for coherence. Anyway, we have had a full FutureWarning/DeprecationWarning cycle for changes more obvious than this.

jreback · 2017-10-14T21:47:12Z

@toobaz I disagree. This is a clear error, expecting a column to spring into existance and none does. A deprecation cycle is for something that is not necessarily a problem but an api change. this is an error.

toobaz · 2017-10-15T07:11:45Z

@toobaz I disagree. This is a clear error, expecting a column to spring into existance and none does.

It does instead! It is just empty because the index is empty, but the columns attribute does change, and this is precisely what should happen when assigning a scalar to a non-existing column: a column is added, and all existing rows are affected (=none, in this case).

Again, what is the "error" in the snippet I posted above?! Also consider the example in which pandas is used to automatically process every day some DataFrame of daily data, add a constant "date" column and append to a table in HDF5: this should not fail unexpectedly just because on some day there are 0 rows.

This really reminds me of other cases in which it was decided to do things which were more complicated and less general but apparently more "friendly", just to regret it later (I'm clearly referring to the project in general, not to anybody's decisions in particular).

toobaz · 2017-10-15T07:31:46Z

... and anyway, if we really made a rule that any expression of the form

df.*[*] = 3

with 3 not actually stored somewhere is a "loss of data" and should raise, then there would be many places we should "fix" (including the ones I already posted).

jorisvandenbossche · 2017-10-16T09:04:09Z

I agree with @toobaz general feeling of "why was this actually needed" without seeing a real benefit.

We can certainly discuss the validity of the original code example that was changed to raise an error (should one be allowed to assign into an empty series/frame?), but about the current merged PR, IMO it:

it introduces inconsistencies: why raise for df = pd.DataFrame(); df['a'] = 1, but not for pd.DataFrame(); df.loc[: 'a'] = 1 or for s = pd.Series(); s[:] = 1, which both also 'loose' information.
it makes that code does a bit less generalize, and you need to start explicitly checking for emptiness (as @toobaz above gave an example with processing daily data). Assume the following code example:
```
df = pd.DataFrame({'a': np.random.randn(4)})
# make two subset dataframes and add indicator column
high = df[df['a'] > 3]
low = df[df['a'] <= 3]
high['b'] = 'high'
low['b'] = 'low'
```
Maybe not the best 'pandonic' code, but is this is something we should disallow? Or require people to start checking for emptiness before doing such things?

jreback · 2017-10-16T12:09:13Z

it introduces inconsistencies: why raise for df = pd.DataFrame(); df['a'] = 1, but not for pd.DataFrame(); df.loc[: 'a'] = 1 or for s = pd.Series(); s[:] = 1, which both also 'loose' information.

these should raise as well.

I am not against your example, but its non-idiomatic. I don't see why it should work (aside from the fact that you are also setting on a copy)

Fundamentally I agree we should align behavior.

So I would be willing to reverse this if we then allow scalars in the constructor w/o an Index. But I am afraid that will lead to many errors and is a worse case.

toobaz · 2017-10-16T12:58:22Z

So I would be willing to reverse this if we then allow scalars in the constructor w/o an Index.

I think that it would be better than nothing (also because, all else equal, for backward compatibility we prefer to allow than to disallow something new).

But frankly speaking, to me things seem relatively simple:

at initialization, a DataFrame needs to have an index. You can avoid providing one expliclty only if it can be automatically built for the values you pass (i.e. 1-dimensional objects of the same length, or a single 2-dimensional block of data). Scalars clearly do not satisfy this requirement, so the constructor will raise if passed only scalars (but pd.DataFrame({'A' : range(3), 'B' : 23}) works, which is cool).
at assignment, there is already an index, and in particular, when assigning a(n entire) column you know you'll never alter the index. More specifically, when you assign a scalar to a column, you know it will alter all existing rows, which means "none" if the index is empty.

In both cases, scalars/empty indexes represent no exception to the general behavior.

If we take your path, we have to also fix df = pd.DataFrame([[0], [2], [4]]); df.loc[(df[0] % 2).astype(bool), 0] = 3 and countless (literally) other cases in which the number on the right is not stored anywhere. If we take my path, we might have to just fix (allow) pd.DataFrame().loc[1] = 1 (for consistency - certainly not because it is a relevant issue).

jreback · 2017-10-16T13:04:53Z

can u create an issue (and repeat your last there)
to avoid this getting lost

…ex ( pandas-dev#16823) (pandas-dev#16968)

toobaz · 2017-10-16T15:07:12Z

can u create an issue (and repeat your last there)

See #17894 (and #17895 )

…h no index ( pandas-dev#16823) (pandas-dev#16968)" This reverts commit f9ba6fe.

…h no index ( #16823) (#16968)" (#17902) * Revert "ERR: Raise ValueError when setting scalars in a dataframe with no index ( #16823) (#16968)" This reverts commit f9ba6fe. * TST: expicit test on setting scalars on empty frame closes #17894

…h no index ( pandas-dev#16823) (pandas-dev#16968)" (pandas-dev#17902) * Revert "ERR: Raise ValueError when setting scalars in a dataframe with no index ( pandas-dev#16823) (pandas-dev#16968)" This reverts commit f9ba6fe. * TST: expicit test on setting scalars on empty frame closes pandas-dev#17894

…ex ( pandas-dev#16823) (pandas-dev#16968)

…h no index ( pandas-dev#16823) (pandas-dev#16968)" (pandas-dev#17902) * Revert "ERR: Raise ValueError when setting scalars in a dataframe with no index ( pandas-dev#16823) (pandas-dev#16968)" This reverts commit f9ba6fe. * TST: expicit test on setting scalars on empty frame closes pandas-dev#17894

…ex ( pandas-dev#16823) (pandas-dev#16968)

…h no index ( pandas-dev#16823) (pandas-dev#16968)" (pandas-dev#17902) * Revert "ERR: Raise ValueError when setting scalars in a dataframe with no index ( pandas-dev#16823) (pandas-dev#16968)" This reverts commit f9ba6fe. * TST: expicit test on setting scalars on empty frame closes pandas-dev#17894

jreback added Difficulty Intermediate Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 4, 2017

jreback added this to the Next Major Release milestone Jul 4, 2017

alanbato mentioned this issue Jul 15, 2017

ERR: Raise ValueError when setting scalars in a dataframe with no index ( #16823) #16968

Merged

4 tasks

jreback modified the milestones: 0.21.0, Next Major Release Jul 15, 2017

jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017

jreback modified the milestones: Next Major Release, 0.21.0 Oct 2, 2017

jreback closed this as completed in #16968 Oct 8, 2017

jreback pushed a commit that referenced this issue Oct 8, 2017

ERR: Raise ValueError when setting scalars in a dataframe with no ind…

f9ba6fe

…ex ( #16823) (#16968)

toobaz mentioned this issue Oct 11, 2017

Using assign to place values from a dict into an empty dataframe adds the column names, but no values #17847

Closed

ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017

ERR: Raise ValueError when setting scalars in a dataframe with no ind…

16ee0a7

…ex ( pandas-dev#16823) (pandas-dev#16968)

toobaz mentioned this issue Oct 16, 2017

Allow setting a column with a scalar and no index (revert #16823) #17894

Closed

jreback added a commit to jreback/pandas that referenced this issue Oct 17, 2017

Revert "ERR: Raise ValueError when setting scalars in a dataframe wit…

c51064e

…h no index ( pandas-dev#16823) (pandas-dev#16968)" This reverts commit f9ba6fe.

jreback mentioned this issue Oct 17, 2017

Revert "ERR: Raise ValueError when setting scalars in a dataframe with no index ( #16823) (#16968)" #17902

Merged

alanbato added a commit to alanbato/pandas that referenced this issue Nov 10, 2017

ERR: Raise ValueError when setting scalars in a dataframe with no ind…

1c4d72a

…ex ( pandas-dev#16823) (pandas-dev#16968)

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

ERR: Raise ValueError when setting scalars in a dataframe with no ind…

12befd0

…ex ( pandas-dev#16823) (pandas-dev#16968)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERR: setting a column with a scalar and no index should raise #16823

ERR: setting a column with a scalar and no index should raise #16823

jreback commented Jul 4, 2017

jreback commented Jul 4, 2017

toobaz commented Jul 17, 2017

toobaz commented Oct 12, 2017 •

edited

Loading

jreback commented Oct 14, 2017

toobaz commented Oct 15, 2017

toobaz commented Oct 15, 2017

jorisvandenbossche commented Oct 16, 2017

jreback commented Oct 16, 2017

toobaz commented Oct 16, 2017 •

edited

Loading

jreback commented Oct 16, 2017

toobaz commented Oct 16, 2017

ERR: setting a column with a scalar and no index should raise #16823

ERR: setting a column with a scalar and no index should raise #16823

Comments

jreback commented Jul 4, 2017

jreback commented Jul 4, 2017

toobaz commented Jul 17, 2017

toobaz commented Oct 12, 2017 • edited Loading

jreback commented Oct 14, 2017

toobaz commented Oct 15, 2017

toobaz commented Oct 15, 2017

jorisvandenbossche commented Oct 16, 2017

jreback commented Oct 16, 2017

toobaz commented Oct 16, 2017 • edited Loading

jreback commented Oct 16, 2017

toobaz commented Oct 16, 2017

toobaz commented Oct 12, 2017 •

edited

Loading

toobaz commented Oct 16, 2017 •

edited

Loading