FIX-#5531: Fix failure when inserting a 2D python list into a frame #5555

dchigarev · 2023-01-18T13:36:18Z

Signed-off-by: Dmitry Chigarev dmitry.chigarev@intel.com

What do these changes do?

Previously, all of the values to .insert() were explicitly converted to a numpy array thus causing some inputs accidentally to change their shape:

>>> value = [[1, 2], [3, 4]] # 1D value from pandas perspective, totally okay to insert it as a column
>>> df.insert(0, "new_col1", value)
>>> df
  new_col1  a  b   c
0   [1, 2]  1  2  10
1   [3, 4]  2  3  20
>>> value = np.array(value)
>>> value.shape # numpy is smart enough to see that it's a 2D matrix
(2, 2)
>>> df.insert(0, "new_col2", value) # pandas now see this as well and restrict the insertion
ValueError: Expected a 1D array, got an array with shape (2, 2)

It was decided to surrender to pandas and to pass a raw value to partitions to make pandas process it the way it wants.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves BUG: DataFrame.insert() fails to insert a 2D python list when pandas doesn't #5531
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

…nto a frame Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2023-01-18T14:14:30Z

modin/core/storage_formats/pandas/query_compiler.py

-        if is_list_like(value) and not isinstance(value, np.ndarray):
-            value = np.array(value)
-        elif is_scalar(value):
-            value = [value] * len(self.index)
-


The value is only used to broadcast it to the corresponding partition and then to be inserted natively using pandas. Let's pass the raw value and not pretend we can over-smart pandas in preprocessing.

Why we were initially converting to numpy?
This was introduced in #5226. Before that PR all inputs were converted to a python list which led to an enormous time for putting the value into ray storage. It was then suggested that python lists are generally much slower compared with NumPy arrays when it comes to serializing so it was proposed that we should always convert to NumPy.

However, now making a closer look into this it appeared that list() serialization is actually quite fast when the list consists of python built-in types (int, float, str, list(2d list), ...) and the observed behavior of slow serialization only appears when a list suddenly consists of non-native types (np.int64, np.float, ...):

'bad' list dtype: <class 'numpy.int64'> 'good' list dtype: <class 'int'> Time for putting a numpy array: 0.05478150397539139 Time for putting a bad python list: 4.355332792038098 Time for putting a good python list: 0.026673782034777105

Reproducer

import ray import numpy as np from timeit import default_timer as timer ray.init() numpy_array = np.random.randint(0, 10, size=1_000_000) python_list_bad = list(numpy_array) python_list_good = numpy_array.tolist() print(f"'bad' list dtype: {type(python_list_bad[0])}") print(f"'good' list dtype: {type(python_list_good[0])}") t1 = timer() ray.put(numpy_array) print(f"Time for putting a numpy array: {timer() - t1}") # 0.05s t1 = timer() ray.put(python_list_bad) print(f"Time for putting a bad python list: {timer() - t1}") # 4.35s t1 = timer() ray.put(python_list_good) print(f"Time for putting a good python list: {timer() - t1}") # 0.02s

And it's now clear that the 'bad' case is exactly what was happening before #5226. Previously the values were deliberately converted to a list like list(value) and then it accidentally generated a 'bad' case where a list consists of non-native types:

>>> ls = list(np.array([1, 2])) >>> type(ls[0]) <class 'numpy.int32'>

In a real-world scenario, it's unlikely (I hope) that we would meet a list with numpy ints in it, thus I believe that we can safely remove all the value preprocessing and pass a raw value to ray's object store without losing anything in terms of performance.

For the cases when we want to broadcast a value to the partitions, we should create an API at the Modin Dataframe layer to preprocess the raw value (put it into the object store) in order for it to be retrieved from the object store in remote tasks rather than copying it directly to the tasks. Of course, that should be done as part of a separate issue.

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

anmyachev

LGTM!

…5555) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

FIX-modin-project#5531: Fix failure when inserting a 2D python list i…

ec50916

…nto a frame Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev commented Jan 18, 2023

View reviewed changes

Fix test

7bbbf05

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev marked this pull request as ready for review January 18, 2023 18:26

dchigarev requested a review from a team as a code owner January 18, 2023 18:26

anmyachev approved these changes Jan 19, 2023

View reviewed changes

YarShev approved these changes Jan 19, 2023

View reviewed changes

YarShev merged commit 5c20ed0 into modin-project:master Jan 19, 2023

vnlitvinov pushed a commit that referenced this pull request Jan 24, 2023

FIX-#5531: Fix failure when inserting a 2D python list into a frame (#…

f0236d9

…5555) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#5531: Fix failure when inserting a 2D python list into a frame #5555

FIX-#5531: Fix failure when inserting a 2D python list into a frame #5555

dchigarev commented Jan 18, 2023 •

edited

Loading

dchigarev Jan 18, 2023 •

edited

Loading

dchigarev Jan 18, 2023

YarShev Jan 19, 2023

anmyachev left a comment

FIX-#5531: Fix failure when inserting a 2D python list into a frame #5555

FIX-#5531: Fix failure when inserting a 2D python list into a frame #5555

Conversation

dchigarev commented Jan 18, 2023 • edited Loading

What do these changes do?

dchigarev Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

dchigarev Jan 18, 2023

Choose a reason for hiding this comment

YarShev Jan 19, 2023

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

dchigarev commented Jan 18, 2023 •

edited

Loading

dchigarev Jan 18, 2023 •

edited

Loading