BUG: #5904

LudsteckJ · 2023-03-30T07:41:15Z

Modin version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import numpy  as np

import ray
ray.shutdown()
ray.init()

# pdm: MODIN pandas
import modin.pandas as pdm
# pdo: ORIGINAL pandas
import pandas as pdo

a = np.random.randint(0, high=100, size=(50000000, 4))
# do: ORIGINAL pandas DataFrame
do = pdo.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])
# dm: MODIN pandas DataFrame
dm = pdm.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])

# This works as expected:
%time ho = do.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())

# This produces only error messages but no output:
%time hm = dm.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())

# speed issues:
%time do['s'] = do.groupby(['x1']).x2.transform('sum')
# This is considerably SLOWER than the standard pandas version above
# on a windows server with 20 CPUs!!!
%time dm['s'] = dm.groupby(['x1']).x2.transform('sum')

%time ho = do[['x1', 'x2', 'x3']].sum()

%time hm = dm[['x1', 'x2', 'x3']].sum()

Issue Description

The following simple apply operation on a modin DataFrame
dm.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
breaks with error messages. This must be a bug since the same method produces sensible results for a standard pandas DataFrame

Furthermore I perfo

Expected Behavior

the apply() above sould yield the same result for a modin DataFrame as for a pandas DataFrame. It breaks

Error Logs

2023-03-30 09:40:02,073 WARNING worker.py:1866 -- Traceback (most recent call last):
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\import_thread.py", line 204, in fetch_and_execute_function_to_run
function({"worker": self.worker})
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\modin\engines\ray\utils.py", line 84, in import_pandas
import pandas # noqa F401
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas_init.py", line 50, in
from pandas.core.api import (
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\api.py", line 48, in
from pandas.core.groupby import (
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\groupby_init_.py", line 1, in
from pandas.core.groupby.generic import (
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\groupby\generic.py", line 73, in
from pandas.core.frame import DataFrame
File "", line 1004, in _find_and_load
File "", line 158, in enter
File "", line 103, in acquire
_frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('pandas.core.frame') at 1689683077104

(apply_func pid=21260) 2023-03-30 09:40:02,104 ERROR serialization.py:371 -- cannot import name 'DataFrame' from partially initialized module 'pandas.core.frame' (most likely due to a circular import) (C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py)
(apply_func pid=21260) Traceback (most recent call last):
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 369, in deserialize_objects
(apply_func pid=21260) obj = self._deserialize_object(data, metadata, object_ref)
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 252, in _deserialize_object
(apply_func pid=21260) return self._deserialize_msgpack_data(data, metadata_fields)
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 207, in _deserialize_msgpack_data
(apply_func pid=21260) python_objects = self.deserialize_pickle5_data(pickle5_data)
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 195, in deserialize_pickle5_data
(apply_func pid=21260) obj = pickle.loads(in_band, buffers=buffers)
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py", line 129, in
(apply_func pid=21260) from pandas.core import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\generic.py", line 108, in
(apply_func pid=21260) from pandas.core import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\indexing.py", line 46, in
(apply_func pid=21260) import pandas.core.common as com
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas_init.py", line 50, in
(apply_func pid=21260) from pandas.core.api import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\api.py", line 48, in
(apply_func pid=21260) from pandas.core.groupby import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\groupby_init.py", line 1, in
(apply_func pid=21260) from pandas.core.groupby.generic import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\groupby\generic.py", line 73, in
(apply_func pid=21260) from pandas.core.frame import DataFrame
(apply_func pid=21260) ImportError: cannot import name 'DataFrame' from partially initialized module 'pandas.core.frame' (most likely due to a circular import) (C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py)
2023-03-30 09:40:03,120 ERROR worker.py:399 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::apply_func() (pid=28736, ip=127.0.0.1)
File "python\ray_raylet.pyx", line 528, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RayTaskError: ray::deploy_ray_func() (pid=28736, ip=127.0.0.1)
File "python\ray_raylet.pyx", line 528, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RayTaskError: ray::apply_func() (pid=21260, ip=127.0.0.1)
File "python\ray_raylet.pyx", line 528, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RaySystemError: System error: cannot import name 'DataFrame' from partially initialized module 'pandas.core.frame' (most likely due to a circular import) (C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py)
traceback: Traceback (most recent call last):
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 369, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 252, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata_fields)
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 207, in _deserialize_msgpack_data
python_objects = self._deserialize_pickle5_data(pickle5_data)
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 195, in _deserialize_pickle5_data
obj = pickle.loads(in_band, buffers=buffers)
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py", line 129, in
from pandas.core import (

...

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python : 3.9.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.14393
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : de
LOCALE : de_DE.cp1252

pandas : 1.3.4
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.25
pytest : None
hypothesis : None
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.31.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fsspec : 2022.01.0
fastparquet : None
gcsfs : None
matplotlib : 3.5.1
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
sqlalchemy : 1.4.27
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

mvashishtha · 2023-03-30T13:07:17Z

@LudsteckJ thank you for reporting this issue. I get a different ray error ending with ValueError: An application is trying to access a Ray object whose owner is unknown(00ffffffffffffffffffffffffffffffffffffff0100000001000000). Please make sure that all Ray objects you are trying to access are part of the current Ray session[...]

I believe the ray errors you are seeing come from a bug in ray that Modin added a workaround for in #4603. Modin will automatically initialize ray with runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}} to work around this bug. If I remove the ray.init() from your script so that Modin initializes ray for you, it works. Will that work for you?

About the performance, let's follow up on #5905.

LudsteckJ · 2023-03-31T10:48:16Z

Dear Mahesh, thank you for the prompt reply. Your solution works. May I proceed here with another issue? It concerns speed. Original pandas is considerably faster for the apply: %time ho = do.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean()) Wall time: 13.1 s %time hm = dm.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean()) Wall time: 23.2 s This is strange since I work (alone!) on a virtual windows server with 20 (sic!) CPUs Look at the next simple transform command which should be quite well suited for parallelization. Here is original pandas pandas at work, takinng 4.11 seconds %time do['s'] = do.groupby(['x1']).x2.transform('sum') Wall time: 4.11 s And modin consumes 6.48 seconds %time dm['s'] = dm.groupby(['x1']).x2.transform('sum') Wall time: 6.48 s It would be extremely helpful to have benchmarks fort he most important transformations like groupby(), apply(), transform(), aggregate(), sum() etc. Thank you and best regards, Johannes Von: Mahesh Vashishtha ***@***.***> Gesendet: Donnerstag, 30. März 2023 15:07 An: modin-project/modin ***@***.***> Cc: Ludsteck Johannes ***@***.***>; Mention ***@***.***> Betreff: Re: [modin-project/modin] BUG: (Issue #5904) @LudsteckJ<https://github.com/LudsteckJ> thank you for reporting this issue. I get a different ray error ending with ValueError: An application is trying to access a Ray object whose owner is unknown(00ffffffffffffffffffffffffffffffffffffff0100000001000000). Please make sure that all Ray objects you are trying to access are part of the current Ray session[...] I believe the ray errors you are seeing come from a bug in ray that Modin added a workaround for in #4603<#4603>. Modin will automatically initialize ray with runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}} to work around this bug. If I remove the ray.init() from your script so that Modin initializes ray for you, it works. Will that work for you? About the performance, let's follow up on #5905<#5905>. — Reply to this email directly, view it on GitHub<#5904 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFO7AXVLPOMKWLGHKUQWWWDW6WARBANCNFSM6AAAAAAWM4RRJU>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>

mvashishtha · 2023-03-31T20:44:03Z

I'm happy to hear that the fix worked, @LudsteckJ.

Let's follow up in #5905 about the performance. Unfortunately Modin can be slower than pandas in some scenarios. Luckily for #5905 we have some promising work in progress that may help this scenario quite a lot.

It would be extremely helpful to have benchmarks fort he most important transformations like
groupby(), apply(), transform(), aggregate(), sum() etc.

Thank you for this feedback. We do have some benchmarks vs pandas here and documented here, but I don't see benchmarks for groupby's apply and transform. It would be good to add benchmarks for those as well.

I'll close this issue for now because we've found a fix for your bug.

LudsteckJ added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Mar 30, 2023

mvashishtha mentioned this issue Mar 30, 2023

PERF: slow SeriesGroupBy.sum() and DataFrameGroupBy.apply() #5905

Closed

mvashishtha added Needs more information ❔ Issues that require more information from the reporter and removed Triage 🩹 Issues that need triage labels Mar 30, 2023

mvashishtha closed this as completed Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: #5904

BUG: #5904

LudsteckJ commented Mar 30, 2023

mvashishtha commented Mar 30, 2023

LudsteckJ commented Mar 31, 2023 via email

mvashishtha commented Mar 31, 2023

BUG: #5904

BUG: #5904

Comments

LudsteckJ commented Mar 30, 2023

Modin version checks

Reproducible Example

Issue Description

Expected Behavior

Error Logs

Installed Versions

INSTALLED VERSIONS

mvashishtha commented Mar 30, 2023

LudsteckJ commented Mar 31, 2023 via email

mvashishtha commented Mar 31, 2023