-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: ArrayManager not respecting copy keyword #44889
BUG: ArrayManager not respecting copy keyword #44889
Conversation
jbrockmendel
commented
Dec 14, 2021
- closes #xxxx
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm @jorisvandenbossche if any comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the changes a bit? Nothing in the non-test code changes look AM-specific in itself, so how is this only fixing an AM bug, and is it not changing behaviour for BM?
arrays = [arr if not isinstance(arr, Index) else arr._data for arr in arrays] | ||
arrays = [ | ||
arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to the other changes (this is not behind an if copy
check), or just an additional copy that is no longer needed nowadays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be removed independently without breaking any tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually causes a change in behaviour:
In [1]: dti = pd.date_range("2012", periods=3, tz="Europe/Brussels")
In [2]: df = pd.DataFrame({"a": dti}, copy=False)
In [3]: df.iloc[0, 0] = df.iloc[2, 0]
In [4]: df
Out[4]:
a
0 2012-01-03 00:00:00+01:00
1 2012-01-02 00:00:00+01:00
2 2012-01-03 00:00:00+01:00
In [5]: dti
Out[5]:
DatetimeIndex(['2012-01-01 00:00:00+01:00', '2012-01-02 00:00:00+01:00',
'2012-01-03 00:00:00+01:00'],
dtype='datetime64[ns, Europe/Brussels]', freq='D')
In the above, the dti
is not mutated, but with this branch it will be.
Now, since I explicitly passed copy=False
, I suppose this is a bug fix :) For other Index objects it will also mutate the original index.
When the original PR (#24096) added this line of code to protect DatetimeIndex from getting mutated, the copy
behaviour of DataFrame(..) might still have been different.
|
||
if copy: | ||
# arrays_to_mgr (via form_blocks) won't make copies for EAs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is the case, it just isn't relevant
mark = pytest.mark.xfail( | ||
reason="Both 'A' columns get set with 3 instead of 0 and 3" | ||
) | ||
request.node.add_marker(mark) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this fixed by the code changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I looked into this to understand why it was actually fixing this. And it's actually because the copy=True
that gets honored is now masking another bug.
Inside dict_to_mgr
, if there are columns not present in the data, we create all-NA arrays for them with arrays.loc[missing] = [val] * missing.sum()
.
But so that is putting an identical array for each columns. And so if you then mutate one columns, all columns get set, if the columns are not copied. With the default of copy=True
we will now copy the arrays before creating the DataFrame. But with copy=False
you still have the wrong behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will open a separate issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> #45369
if ( | ||
using_array_manager | ||
and not copy | ||
and not (any_numpy_dtype in (tm.STRING_DTYPES + tm.BYTES_DTYPES)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is it correct to conclude from this remaining skip that this PR fixed the copy=True
case but not yet the copy=False
case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
for x in arrays | ||
] | ||
# TODO: can we get rid of the dt64tz special case above? | ||
arrays = [x if not hasattr(x, "dtype") else x.copy() for x in arrays] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this will now copy all arrays that were present in the dict (if copy=True
), instead of only EAdtype arrays. This ensures that we now honor the copy keyword for AM (i.e. the actual fix in the PR).
But doesn't that also mean that this causes an extra copy for BM for numpy arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, will update
@jbrockmendel ok here? |
i still need to respond to joris's comments |
looks reasonable, cc @jorisvandenbossche |
@jorisvandenbossche if any comments |