Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: ArrayManager not respecting copy keyword #44889

Merged
merged 7 commits into from
Jan 14, 2022

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

@jreback jreback added this to the 1.4 milestone Dec 16, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @jorisvandenbossche if any comments

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the changes a bit? Nothing in the non-test code changes look AM-specific in itself, so how is this only fixing an AM bug, and is it not changing behaviour for BM?

arrays = [arr if not isinstance(arr, Index) else arr._data for arr in arrays]
arrays = [
arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the other changes (this is not behind an if copy check), or just an additional copy that is no longer needed nowadays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed independently without breaking any tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually causes a change in behaviour:

In [1]: dti = pd.date_range("2012", periods=3, tz="Europe/Brussels")

In [2]: df = pd.DataFrame({"a": dti}, copy=False)

In [3]: df.iloc[0, 0] = df.iloc[2, 0]

In [4]: df
Out[4]: 
                          a
0 2012-01-03 00:00:00+01:00
1 2012-01-02 00:00:00+01:00
2 2012-01-03 00:00:00+01:00

In [5]: dti
Out[5]: 
DatetimeIndex(['2012-01-01 00:00:00+01:00', '2012-01-02 00:00:00+01:00',
               '2012-01-03 00:00:00+01:00'],
              dtype='datetime64[ns, Europe/Brussels]', freq='D')

In the above, the dti is not mutated, but with this branch it will be.
Now, since I explicitly passed copy=False, I suppose this is a bug fix :) For other Index objects it will also mutate the original index.

When the original PR (#24096) added this line of code to protect DatetimeIndex from getting mutated, the copy behaviour of DataFrame(..) might still have been different.


if copy:
# arrays_to_mgr (via form_blocks) won't make copies for EAs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the case, it just isn't relevant

mark = pytest.mark.xfail(
reason="Both 'A' columns get set with 3 instead of 0 and 3"
)
request.node.add_marker(mark)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fixed by the code changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I looked into this to understand why it was actually fixing this. And it's actually because the copy=True that gets honored is now masking another bug.
Inside dict_to_mgr, if there are columns not present in the data, we create all-NA arrays for them with arrays.loc[missing] = [val] * missing.sum().

But so that is putting an identical array for each columns. And so if you then mutate one columns, all columns get set, if the columns are not copied. With the default of copy=True we will now copy the arrays before creating the DataFrame. But with copy=False you still have the wrong behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open a separate issue for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> #45369

if (
using_array_manager
and not copy
and not (any_numpy_dtype in (tm.STRING_DTYPES + tm.BYTES_DTYPES))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is it correct to conclude from this remaining skip that this PR fixed the copy=True case but not yet the copy=False case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

for x in arrays
]
# TODO: can we get rid of the dt64tz special case above?
arrays = [x if not hasattr(x, "dtype") else x.copy() for x in arrays]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this will now copy all arrays that were present in the dict (if copy=True), instead of only EAdtype arrays. This ensures that we now honor the copy keyword for AM (i.e. the actual fix in the PR).
But doesn't that also mean that this causes an extra copy for BM for numpy arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will update

@jreback
Copy link
Contributor

jreback commented Dec 23, 2021

@jbrockmendel ok here?

@jbrockmendel
Copy link
Member Author

i still need to respond to joris's comments

@jreback jreback removed this from the 1.4 milestone Dec 27, 2021
@jreback jreback added this to the 1.4 milestone Dec 31, 2021
@jreback
Copy link
Contributor

jreback commented Dec 31, 2021

looks reasonable, cc @jorisvandenbossche

@jreback
Copy link
Contributor

jreback commented Jan 8, 2022

@jorisvandenbossche if any comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants