Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: extraneous copy of extension arrays in v1.3.0 #42501

Closed
gsakkis opened this issue Jul 12, 2021 · 12 comments · Fixed by #42823
Closed

BUG: extraneous copy of extension arrays in v1.3.0 #42501

gsakkis opened this issue Jul 12, 2021 · 12 comments · Fixed by #42823
Assignees
Labels
Copy / view semantics Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@gsakkis
Copy link

gsakkis commented Jul 12, 2021

Code Sample, a copy-pastable example

import pandas as pd
from pandas.core.arrays.integer import coerce_to_array


class IntegerArrayNoCopy(pd.core.arrays.IntegerArray):

    @classmethod
    def _from_sequence(cls, scalars, *, dtype=None, copy=False):
        values, mask = coerce_to_array(scalars, dtype=dtype, copy=copy)
        return IntegerArrayNoCopy(values, mask)

    def copy(self):
        raise NotImplementedError


class Int16DtypeNoCopy(pd.Int16Dtype):
    @classmethod
    def construct_array_type(cls):
        return IntegerArrayNoCopy


if __name__ == '__main__':
    df = pd.DataFrame({"col": [1, 4, None, 5]}, dtype=object)
    print(df.dtypes)
    df = df.astype({"col": Int16DtypeNoCopy()}, copy=False)
    print(df.dtypes)
    print(df)

Problem description

In 1.3.0, astype attempts to create an extension array copy even when explicitly passed copy=False:

Traceback (most recent call last):
  File "test_astype.py", line 24, in <module>
    df = df.astype({"col": Int16DtypeNoCopy()}, copy=False)
  File "/home/gsk/miniconda/envs/pdenv/lib/python3.7/site-packages/pandas/core/generic.py", line 5814, in astype
    result = concat(results, axis=1, copy=False)
  File "/home/gsk/miniconda/envs/pdenv/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/gsk/miniconda/envs/pdenv/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 307, in concat
    return op.get_result()
  File "/home/gsk/miniconda/envs/pdenv/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 508, in get_result
    df = cons(data, index=index)
  File "/home/gsk/miniconda/envs/pdenv/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/home/gsk/miniconda/envs/pdenv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 458, in dict_to_mgr
    for x in arrays
  File "/home/gsk/miniconda/envs/pdenv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 458, in <listcomp>
    for x in arrays
  File "/home/gsk/miniconda/envs/pdenv/lib/python3.7/site-packages/pandas/core/generic.py", line 5924, in copy
    data = self._mgr.copy(deep=deep)
  File "/home/gsk/miniconda/envs/pdenv/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 595, in copy
    res = self.apply("copy", deep=deep)
  File "/home/gsk/miniconda/envs/pdenv/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 327, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/gsk/miniconda/envs/pdenv/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 651, in copy
    values = values.copy()
  File "test_astype.py", line 13, in copy
    raise NotImplementedError
NotImplementedError

Expected Output

1.2.5 works as expected (at least for this example):

col    object
dtype: object
col    Int16
dtype: object
    col
0     1
1     4
2  <NA>
3     5

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f00ed8f
python : 3.7.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.0-1029-oem
Version : #30-Ubuntu SMP Fri May 28 23:53:50 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.6.0.post20210108
Cython : 0.29.22
pytest : 6.2.2
hypothesis : 6.7.0
jinja2 : 2.11.3
IPython : 7.20.0
fsspec : 2021.04.0
fastparquet : 0.5.0
matplotlib : 3.4.1
pyarrow : 4.0.1
s3fs : 2021.04.0
scipy : 1.6.0
xarray : 0.17.0
numba : 0.53.1

@gsakkis gsakkis added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 12, 2021
@gsakkis
Copy link
Author

gsakkis commented Jul 12, 2021

AFAICT it's caused by (or related to) #38939

@mzeitlin11
Copy link
Member

Thanks for reporting and investigating this @gsakkis! Like you suggested, bisection indicates this behavior changed with #38939. Further investigations welcome

@mzeitlin11 mzeitlin11 added Copy / view semantics Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 12, 2021
@mzeitlin11 mzeitlin11 added this to the 1.3.1 milestone Jul 12, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 16, 2021
@jmcomie
Copy link
Contributor

jmcomie commented Jul 16, 2021

take

@jmcomie
Copy link
Contributor

jmcomie commented Jul 16, 2021

I'm still learning how to infer work owner -- @simonjayhawkins if you're working on this please take it back from me. :)

@jmcomie
Copy link
Contributor

jmcomie commented Jul 16, 2021

It seems that #38939 fixed the inverse test case, where copy is expressly true, but broke this test case. Current hypothesis is that this is a straightforward issue with argument propagation. I have a draft code change that fixes our test case here. I'll continue exploring the code.

@simonjayhawkins
Copy link
Member

I'm still learning how to infer work owner -- @simonjayhawkins if you're working on this please take it back from me. :)

@jmcomie my commit that referenced this issue was just confirming commit that caused regression. carry on!

@simonjayhawkins
Copy link
Member

AFAICT it's caused by (or related to) #38939

cc @jbrockmendel

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.1, 1.3.2 Jul 24, 2021
@jmcomie
Copy link
Contributor

jmcomie commented Jul 26, 2021

Looking through the stack, astype calls concat with copy=False, but in the is_series axis=1 block in the get_result method in reshape/concat.py a DataFrame constructor is called without this copy value being passed along. Instrumenting the code with a few print statements reveals that we are setting copy to a value of True in this path, in the code added in #38939. So an apparent fix is an update to the get_result method to pass on the provided copy value to this DataFrame constructor. This does resolve the test case provided -- code:

diff --git a/pandas/core/reshape/concat.py b/pandas/core/reshape/concat.py
index d908638c47..10936b6003 100644
--- a/pandas/core/reshape/concat.py
+++ b/pandas/core/reshape/concat.py
@@ -504,7 +504,7 @@ class _Concatenator:
                 cons = sample._constructor_expanddim
 
                 index, columns = self.new_axes
-                df = cons(data, index=index)
+                df = cons(data, index=index, copy=self.copy)
                 df.columns = columns
                 return df.__finalize__(self, method="concat")

However this change breaks the test case in tests/extension/decimal/test_decimal.py::test_astype_dispatches[True]. I paused after an initial review of the test_astype_dispatches behavior and am planning to pick up that investigation this week. Let me know if I'm missing something or if there's a better angle to pursue here.

@jbrockmendel
Copy link
Member

@jmcomie
Copy link
Contributor

jmcomie commented Jul 26, 2021

could removing the special-casing for EAs here (https://github.com/pandas-dev/pandas/pull/38939/files#diff-47a4d23478486a3722569045c05137aec72bc6030df19443c454872a7cdf90d4R445) be helpful?

Removing the special-casing fixes the test case of this issue but causes other failures. Separately, I wonder if it's worth refactoring that block into a lower level call, since it seems to act on undocumented knowledge of the lower level calls.

@jmcomie
Copy link
Contributor

jmcomie commented Jul 26, 2021

The test_astype_dispatches failure is due to a parameterized expected failure succeeding after the change so it might work to change the test to expect both parameterized inputs to work.

@jmcomie
Copy link
Contributor

jmcomie commented Jul 26, 2021

The expected failure is pd.concat call inside NDFrame.astype reverts the dtype, which seems to be the issue we're trying to address here, at least indirectly. Why does the dtype get reverted on copy generally?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Regression Functionality that used to work in a prior pandas version
Projects
None yet
5 participants