BUG: ArrayManager not respecting copy keyword #44889

jbrockmendel · 2021-12-14T23:08:56Z

closes #xxxx
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

jreback

lgtm @jorisvandenbossche if any comments

jorisvandenbossche

Can you explain the changes a bit? Nothing in the non-test code changes look AM-specific in itself, so how is this only fixing an AM bug, and is it not changing behaviour for BM?

jorisvandenbossche · 2021-12-17T21:08:46Z

pandas/core/internals/construction.py

        arrays = [arr if not isinstance(arr, Index) else arr._data for arr in arrays]
-        arrays = [
-            arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
-        ]


Is this related to the other changes (this is not behind an if copy check), or just an additional copy that is no longer needed nowadays?

this can be removed independently without breaking any tests.

This actually causes a change in behaviour:

In [1]: dti = pd.date_range("2012", periods=3, tz="Europe/Brussels") In [2]: df = pd.DataFrame({"a": dti}, copy=False) In [3]: df.iloc[0, 0] = df.iloc[2, 0] In [4]: df Out[4]: a 0 2012-01-03 00:00:00+01:00 1 2012-01-02 00:00:00+01:00 2 2012-01-03 00:00:00+01:00 In [5]: dti Out[5]: DatetimeIndex(['2012-01-01 00:00:00+01:00', '2012-01-02 00:00:00+01:00', '2012-01-03 00:00:00+01:00'], dtype='datetime64[ns, Europe/Brussels]', freq='D')

In the above, the dti is not mutated, but with this branch it will be.
Now, since I explicitly passed copy=False, I suppose this is a bug fix :) For other Index objects it will also mutate the original index.

When the original PR (#24096) added this line of code to protect DatetimeIndex from getting mutated, the copy behaviour of DataFrame(..) might still have been different.

jorisvandenbossche · 2021-12-17T21:13:30Z

pandas/core/internals/construction.py


    if copy:
-        # arrays_to_mgr (via form_blocks) won't make copies for EAs


This is no longer the case?

it is the case, it just isn't relevant

jorisvandenbossche · 2021-12-17T21:20:17Z

pandas/tests/frame/indexing/test_setitem.py

-            mark = pytest.mark.xfail(
-                reason="Both 'A' columns get set with 3 instead of 0 and 3"
-            )
-            request.node.add_marker(mark)


Is this fixed by the code changes?

So I looked into this to understand why it was actually fixing this. And it's actually because the copy=True that gets honored is now masking another bug.
Inside dict_to_mgr, if there are columns not present in the data, we create all-NA arrays for them with arrays.loc[missing] = [val] * missing.sum().

But so that is putting an identical array for each columns. And so if you then mutate one columns, all columns get set, if the columns are not copied. With the default of copy=True we will now copy the arrays before creating the DataFrame. But with copy=False you still have the wrong behaviour.

Will open a separate issue for this.

jorisvandenbossche · 2021-12-17T21:25:57Z

pandas/tests/frame/test_constructors.py

+        if (
+            using_array_manager
+            and not copy
+            and not (any_numpy_dtype in (tm.STRING_DTYPES + tm.BYTES_DTYPES))


So is it correct to conclude from this remaining skip that this PR fixed the copy=True case but not yet the copy=False case?

jorisvandenbossche · 2021-12-17T21:38:06Z

pandas/core/internals/construction.py

-            for x in arrays
-        ]
-        # TODO: can we get rid of the dt64tz special case above?
+        arrays = [x if not hasattr(x, "dtype") else x.copy() for x in arrays]


If I understand correctly, this will now copy all arrays that were present in the dict (if copy=True), instead of only EAdtype arrays. This ensures that we now honor the copy keyword for AM (i.e. the actual fix in the PR).
But doesn't that also mean that this causes an extra copy for BM for numpy arrays?

good catch, will update

jreback · 2021-12-23T22:57:44Z

@jbrockmendel ok here?

jbrockmendel · 2021-12-23T23:45:29Z

i still need to respond to joris's comments

jreback · 2021-12-31T16:49:48Z

looks reasonable, cc @jorisvandenbossche

jreback · 2022-01-08T15:23:21Z

@jorisvandenbossche if any comments

…eyword

BUG: ArrayManager not respecting copy keyword

ac23879

jreback added this to the 1.4 milestone Dec 16, 2021

jreback added the ArrayManager label Dec 16, 2021

jreback approved these changes Dec 16, 2021

View reviewed changes

jorisvandenbossche requested changes Dec 17, 2021

View reviewed changes

jorisvandenbossche reviewed Dec 17, 2021

View reviewed changes

Merge branch 'master' into fixmes-arraymanager

a9b5c6b

jreback removed this from the 1.4 milestone Dec 27, 2021

jbrockmendel added 5 commits December 27, 2021 11:17

Merge branch 'master' into fixmes-arraymanager

a605885

Merge branch 'master' into fixmes-arraymanager

0895b0e

avoid double-copy

01c6ad2

Merge branch 'master' into fixmes-arraymanager

fe0dc04

copy conditionally

1bfdcc6

jreback added this to the 1.4 milestone Dec 31, 2021

jorisvandenbossche approved these changes Jan 14, 2022

View reviewed changes

jorisvandenbossche merged commit 83ea173 into pandas-dev:main Jan 14, 2022

meeseeksmachine mentioned this pull request Jan 14, 2022

Backport PR #44889 on branch 1.4.x (BUG: ArrayManager not respecting copy keyword) #45368

Closed

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 14, 2022

Backport PR pandas-dev#44889: BUG: ArrayManager not respecting copy k…

2ec85c2

…eyword

jorisvandenbossche modified the milestones: 1.4, 1.5 Jan 14, 2022

jorisvandenbossche mentioned this pull request Jan 14, 2022

BUG: DataFrame constructor with copy=False and missing columns creates columns that are views of each other #45369

Closed

jbrockmendel deleted the fixmes-arraymanager branch January 14, 2022 23:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: ArrayManager not respecting copy keyword #44889

BUG: ArrayManager not respecting copy keyword #44889

jbrockmendel commented Dec 14, 2021

jreback left a comment

jorisvandenbossche left a comment

jorisvandenbossche Dec 17, 2021

jbrockmendel Dec 27, 2021

jorisvandenbossche Jan 14, 2022

jorisvandenbossche Dec 17, 2021

jbrockmendel Dec 27, 2021

jorisvandenbossche Dec 17, 2021

jbrockmendel Dec 27, 2021

jorisvandenbossche Jan 14, 2022

jorisvandenbossche Jan 14, 2022

jorisvandenbossche Jan 14, 2022

jorisvandenbossche Dec 17, 2021

jbrockmendel Dec 17, 2021

jorisvandenbossche Dec 17, 2021

jbrockmendel Dec 27, 2021

jreback commented Dec 23, 2021

jbrockmendel commented Dec 23, 2021

jreback commented Dec 31, 2021

jreback commented Jan 8, 2022


		if copy:
		# arrays_to_mgr (via form_blocks) won't make copies for EAs

BUG: ArrayManager not respecting copy keyword #44889

BUG: ArrayManager not respecting copy keyword #44889

Conversation

jbrockmendel commented Dec 14, 2021

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 23, 2021

jbrockmendel commented Dec 23, 2021

jreback commented Dec 31, 2021

jreback commented Jan 8, 2022