Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.insert() fails to insert a 2D python list when pandas doesn't #5531

Closed
3 tasks done
dchigarev opened this issue Jan 12, 2023 · 0 comments · Fixed by #5555
Closed
3 tasks done

BUG: DataFrame.insert() fails to insert a 2D python list when pandas doesn't #5531

dchigarev opened this issue Jan 12, 2023 · 0 comments · Fixed by #5555
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas concordance 🐼 Functionality that does not match pandas

Comments

@dchigarev
Copy link
Collaborator

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd
import pandas

value = [[1, 2], [3, 4]]

md_df = pd.DataFrame({"a": [1, 2]})
pd_df = md_df._to_pandas()

pd_df["new_col"] = value
print(pd_df)
#    a new_col
# 0  1  [1, 2]
# 1  2  [3, 4]

md_df["new_col"] = value # ValueError
print(md_df)

Issue Description

It appears that pandas always treat a python list as a 1D object (even if it's naturally a 2D one), thus allowing such objects to be inserted as a column. On the other hand, Modin does that pretty optimization (#5226) that converts a list-like value to insert to numpy, speeding up the column deserialization significantly.

if is_list_like(value) and not isinstance(value, np.ndarray):
value = np.array(value)

This acts badly in cases when value is a 2D python list (remember, pandas see this as a 1D object) as np.array(value) will convert the value into a literal 2D matrix, thus causing an error on insertion because pandas now can see that the value is 2D and aborts the insertion.

A simple workaround could be is to manually create a proper 1D NumPy array and then insert it into the frame, this will avoid this mistaken conversion.

import modin.pandas as pd
import numpy as np

value = [[1, 2], [3, 4]]

md_df = pd.DataFrame({"a": [1, 2]})

new_value = np.empty(len(value), dtype=object)
new_value[:] = value

md_df["new_col"] = new_value
print(md_df)
#    a new_col
# 0  1  [1, 2]
# 1  2  [3, 4]

Expected Behavior

To work

Error Logs

Traceback (most recent call last):
  File "t2.py", line 13, in <module>
    print(md_df)
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/pandas/base.py", line 3846, in __str__
    return repr(self)
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/pandas/dataframe.py", line 250, in __repr__
    result = repr(self._build_repr_df(num_rows, num_cols))
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/pandas/base.py", line 227, in _build_repr_df
    return self.iloc[indexer]._query_compiler.to_pandas()
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/storage_formats/pandas/query_compiler.py", line 268, in to_pandas
    return self._modin_frame.to_pandas()
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 126, in run_f_on_minimally_updated_metadata
    result = f(self, *args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 3220, in to_pandas
    df = self._partition_mgr_cls.to_pandas(self._partitions)
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in to_pandas
    retrieved_objects = cls.get_objects_from_partitions(partitions.flatten())
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py", line 117, in get_objects_from_partitions
    return RayWrapper.materialize(
  File "/localdisk/dchigare/repos/modin/modin/core/execution/ray/common/engine_wrapper.py", line 92, in materialize
    return ray.get(obj_id)
  File "/localdisk/dchigare/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/localdisk/dchigare/miniconda3/envs/modin/lib/python3.8/site-packages/ray/_private/worker.py", line 2309, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::_deploy_ray_func() (pid=3487892, ip=10.241.129.69)
  File "/localdisk/dchigare/repos/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py", line 560, in _deploy_ray_func
    result = deployer(axis, f_to_deploy, f_args, f_kwargs, *args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 181, in deploy_axis_func
    result = func(dataframe, *f_args, **f_kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/storage_formats/pandas/query_compiler.py", line 2385, in insert
    df.insert(internal_idx, column, value)
  File "/localdisk/dchigare/miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/frame.py", line 4819, in insert
    self._mgr.insert(loc, column, value)
  File "/localdisk/dchigare/miniconda3/envs/modin/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1407, in insert
    raise ValueError(
ValueError: Expected a 1D array, got an array with shape (2, 2)

Installed Versions

Replace this line with the output of pd.show_versions()

@dchigarev dchigarev added bug 🦗 Something isn't working P1 Important tasks that we should complete soon labels Jan 12, 2023
@anmyachev anmyachev added the pandas concordance 🐼 Functionality that does not match pandas label Jan 13, 2023
dchigarev added a commit to dchigarev/modin that referenced this issue Jan 18, 2023
…nto a frame

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
YarShev pushed a commit that referenced this issue Jan 19, 2023
…5555)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
vnlitvinov pushed a commit that referenced this issue Jan 24, 2023
…5555)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas concordance 🐼 Functionality that does not match pandas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants