Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#5531: Fix failure when inserting a 2D python list into a frame #5555

Merged
merged 2 commits into from
Jan 19, 2023

Conversation

dchigarev
Copy link
Collaborator

@dchigarev dchigarev commented Jan 18, 2023

Signed-off-by: Dmitry Chigarev dmitry.chigarev@intel.com

What do these changes do?

Previously, all of the values to .insert() were explicitly converted to a numpy array thus causing some inputs accidentally to change their shape:

>>> value = [[1, 2], [3, 4]] # 1D value from pandas perspective, totally okay to insert it as a column
>>> df.insert(0, "new_col1", value)
>>> df
  new_col1  a  b   c
0   [1, 2]  1  2  10
1   [3, 4]  2  3  20
>>> value = np.array(value)
>>> value.shape # numpy is smart enough to see that it's a 2D matrix
(2, 2)
>>> df.insert(0, "new_col2", value) # pandas now see this as well and restrict the insertion
ValueError: Expected a 1D array, got an array with shape (2, 2)

It was decided to surrender to pandas and to pass a raw value to partitions to make pandas process it the way it wants.

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves BUG: DataFrame.insert() fails to insert a 2D python list when pandas doesn't #5531
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

…nto a frame

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Comment on lines -2361 to -2365
if is_list_like(value) and not isinstance(value, np.ndarray):
value = np.array(value)
elif is_scalar(value):
value = [value] * len(self.index)

Copy link
Collaborator Author

@dchigarev dchigarev Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value is only used to broadcast it to the corresponding partition and then to be inserted natively using pandas. Let's pass the raw value and not pretend we can over-smart pandas in preprocessing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we were initially converting to numpy?
This was introduced in #5226. Before that PR all inputs were converted to a python list which led to an enormous time for putting the value into ray storage. It was then suggested that python lists are generally much slower compared with NumPy arrays when it comes to serializing so it was proposed that we should always convert to NumPy.

However, now making a closer look into this it appeared that list() serialization is actually quite fast when the list consists of python built-in types (int, float, str, list(2d list), ...) and the observed behavior of slow serialization only appears when a list suddenly consists of non-native types (np.int64, np.float, ...):

'bad' list dtype: <class 'numpy.int64'>
'good' list dtype: <class 'int'>
Time for putting a numpy array: 0.05478150397539139
Time for putting a bad python list: 4.355332792038098
Time for putting a good python list: 0.026673782034777105
Reproducer
import ray
import numpy as np
from timeit import default_timer as timer

ray.init()

numpy_array = np.random.randint(0, 10, size=1_000_000)
python_list_bad = list(numpy_array)
python_list_good = numpy_array.tolist()

print(f"'bad' list dtype: {type(python_list_bad[0])}")
print(f"'good' list dtype: {type(python_list_good[0])}")

t1 = timer()
ray.put(numpy_array)
print(f"Time for putting a numpy array: {timer() - t1}") # 0.05s

t1 = timer()
ray.put(python_list_bad)
print(f"Time for putting a bad python list: {timer() - t1}") # 4.35s

t1 = timer()
ray.put(python_list_good)
print(f"Time for putting a good python list: {timer() - t1}") # 0.02s

And it's now clear that the 'bad' case is exactly what was happening before #5226. Previously the values were deliberately converted to a list like list(value) and then it accidentally generated a 'bad' case where a list consists of non-native types:

>>> ls = list(np.array([1, 2]))
>>> type(ls[0])
<class 'numpy.int32'>

In a real-world scenario, it's unlikely (I hope) that we would meet a list with numpy ints in it, thus I believe that we can safely remove all the value preprocessing and pass a raw value to ray's object store without losing anything in terms of performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the cases when we want to broadcast a value to the partitions, we should create an API at the Modin Dataframe layer to preprocess the raw value (put it into the object store) in order for it to be retrieved from the object store in remote tasks rather than copying it directly to the tasks. Of course, that should be done as part of a separate issue.

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@dchigarev dchigarev marked this pull request as ready for review January 18, 2023 18:26
@dchigarev dchigarev requested a review from a team as a code owner January 18, 2023 18:26
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@YarShev YarShev merged commit 5c20ed0 into modin-project:master Jan 19, 2023
vnlitvinov pushed a commit that referenced this pull request Jan 24, 2023
…5555)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.insert() fails to insert a 2D python list when pandas doesn't
3 participants