-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#5531: Fix failure when inserting a 2D python list into a frame #5555
Conversation
…nto a frame Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
if is_list_like(value) and not isinstance(value, np.ndarray): | ||
value = np.array(value) | ||
elif is_scalar(value): | ||
value = [value] * len(self.index) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value
is only used to broadcast it to the corresponding partition and then to be inserted natively using pandas. Let's pass the raw value and not pretend we can over-smart pandas in preprocessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we were initially converting to numpy?
This was introduced in #5226. Before that PR all inputs were converted to a python list which led to an enormous time for putting the value into ray storage. It was then suggested that python lists are generally much slower compared with NumPy arrays when it comes to serializing so it was proposed that we should always convert to NumPy.
However, now making a closer look into this it appeared that list()
serialization is actually quite fast when the list consists of python built-in types (int, float, str, list(2d list), ...) and the observed behavior of slow serialization only appears when a list suddenly consists of non-native types (np.int64, np.float, ...):
'bad' list dtype: <class 'numpy.int64'>
'good' list dtype: <class 'int'>
Time for putting a numpy array: 0.05478150397539139
Time for putting a bad python list: 4.355332792038098
Time for putting a good python list: 0.026673782034777105
Reproducer
import ray
import numpy as np
from timeit import default_timer as timer
ray.init()
numpy_array = np.random.randint(0, 10, size=1_000_000)
python_list_bad = list(numpy_array)
python_list_good = numpy_array.tolist()
print(f"'bad' list dtype: {type(python_list_bad[0])}")
print(f"'good' list dtype: {type(python_list_good[0])}")
t1 = timer()
ray.put(numpy_array)
print(f"Time for putting a numpy array: {timer() - t1}") # 0.05s
t1 = timer()
ray.put(python_list_bad)
print(f"Time for putting a bad python list: {timer() - t1}") # 4.35s
t1 = timer()
ray.put(python_list_good)
print(f"Time for putting a good python list: {timer() - t1}") # 0.02s
And it's now clear that the 'bad' case is exactly what was happening before #5226. Previously the values were deliberately converted to a list like list(value)
and then it accidentally generated a 'bad' case where a list consists of non-native types:
>>> ls = list(np.array([1, 2]))
>>> type(ls[0])
<class 'numpy.int32'>
In a real-world scenario, it's unlikely (I hope) that we would meet a list with numpy ints in it, thus I believe that we can safely remove all the value
preprocessing and pass a raw value to ray's object store without losing anything in terms of performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the cases when we want to broadcast a value to the partitions, we should create an API at the Modin Dataframe layer to preprocess the raw value (put it into the object store) in order for it to be retrieved from the object store in remote tasks rather than copying it directly to the tasks. Of course, that should be done as part of a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Dmitry Chigarev dmitry.chigarev@intel.com
What do these changes do?
Previously, all of the values to
.insert()
were explicitly converted to a numpy array thus causing some inputs accidentally to change their shape:It was decided to surrender to pandas and to pass a raw
value
to partitions to make pandas process it the way it wants.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
DataFrame.insert()
fails to insert a 2D python list when pandas doesn't #5531docs/development/architecture.rst
is up-to-date