Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: #5904

Closed
3 tasks done
LudsteckJ opened this issue Mar 30, 2023 · 3 comments
Closed
3 tasks done

BUG: #5904

LudsteckJ opened this issue Mar 30, 2023 · 3 comments
Labels
bug 🦗 Something isn't working Needs more information ❔ Issues that require more information from the reporter

Comments

@LudsteckJ
Copy link

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import numpy  as np

import ray
ray.shutdown()
ray.init()

# pdm: MODIN pandas
import modin.pandas as pdm
# pdo: ORIGINAL pandas
import pandas as pdo

a = np.random.randint(0, high=100, size=(50000000, 4))
# do: ORIGINAL pandas DataFrame
do = pdo.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])
# dm: MODIN pandas DataFrame
dm = pdm.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])

# This works as expected:
%time ho = do.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())

# This produces only error messages but no output:
%time hm = dm.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())

# speed issues:
%time do['s'] = do.groupby(['x1']).x2.transform('sum')
# This is considerably SLOWER than the standard pandas version above
# on a windows server with 20 CPUs!!!
%time dm['s'] = dm.groupby(['x1']).x2.transform('sum')

%time ho = do[['x1', 'x2', 'x3']].sum()

%time hm = dm[['x1', 'x2', 'x3']].sum()

Issue Description

The following simple apply operation on a modin DataFrame
dm.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
breaks with error messages. This must be a bug since the same method produces sensible results for a standard pandas DataFrame

Furthermore I perfo

Expected Behavior

the apply() above sould yield the same result for a modin DataFrame as for a pandas DataFrame. It breaks

Error Logs

2023-03-30 09:40:02,073 WARNING worker.py:1866 -- Traceback (most recent call last):
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\import_thread.py", line 204, in fetch_and_execute_function_to_run
function({"worker": self.worker})
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\modin\engines\ray\utils.py", line 84, in import_pandas
import pandas # noqa F401
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas_init
.py", line 50, in
from pandas.core.api import (
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\api.py", line 48, in
from pandas.core.groupby import (
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\groupby_init_.py", line 1, in
from pandas.core.groupby.generic import (
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\groupby\generic.py", line 73, in
from pandas.core.frame import DataFrame
File "", line 1004, in _find_and_load
File "", line 158, in enter
File "", line 103, in acquire
_frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('pandas.core.frame') at 1689683077104

(apply_func pid=21260) 2023-03-30 09:40:02,104 ERROR serialization.py:371 -- cannot import name 'DataFrame' from partially initialized module 'pandas.core.frame' (most likely due to a circular import) (C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py)
(apply_func pid=21260) Traceback (most recent call last):
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 369, in deserialize_objects
(apply_func pid=21260) obj = self._deserialize_object(data, metadata, object_ref)
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 252, in _deserialize_object
(apply_func pid=21260) return self._deserialize_msgpack_data(data, metadata_fields)
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 207, in _deserialize_msgpack_data
(apply_func pid=21260) python_objects = self.deserialize_pickle5_data(pickle5_data)
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 195, in deserialize_pickle5_data
(apply_func pid=21260) obj = pickle.loads(in_band, buffers=buffers)
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py", line 129, in
(apply_func pid=21260) from pandas.core import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\generic.py", line 108, in
(apply_func pid=21260) from pandas.core import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\indexing.py", line 46, in
(apply_func pid=21260) import pandas.core.common as com
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas_init
.py", line 50, in
(apply_func pid=21260) from pandas.core.api import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\api.py", line 48, in
(apply_func pid=21260) from pandas.core.groupby import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\groupby_init
.py", line 1, in
(apply_func pid=21260) from pandas.core.groupby.generic import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\groupby\generic.py", line 73, in
(apply_func pid=21260) from pandas.core.frame import DataFrame
(apply_func pid=21260) ImportError: cannot import name 'DataFrame' from partially initialized module 'pandas.core.frame' (most likely due to a circular import) (C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py)
2023-03-30 09:40:03,120 ERROR worker.py:399 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::apply_func() (pid=28736, ip=127.0.0.1)
File "python\ray_raylet.pyx", line 528, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RayTaskError: ray::deploy_ray_func() (pid=28736, ip=127.0.0.1)
File "python\ray_raylet.pyx", line 528, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RayTaskError: ray::apply_func() (pid=21260, ip=127.0.0.1)
File "python\ray_raylet.pyx", line 528, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RaySystemError: System error: cannot import name 'DataFrame' from partially initialized module 'pandas.core.frame' (most likely due to a circular import) (C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py)
traceback: Traceback (most recent call last):
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 369, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 252, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata_fields)
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 207, in _deserialize_msgpack_data
python_objects = self._deserialize_pickle5_data(pickle5_data)
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 195, in _deserialize_pickle5_data
obj = pickle.loads(in_band, buffers=buffers)
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py", line 129, in
from pandas.core import (

...

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python : 3.9.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.14393
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : de
LOCALE : de_DE.cp1252

pandas : 1.3.4
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.25
pytest : None
hypothesis : None
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.31.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fsspec : 2022.01.0
fastparquet : None
gcsfs : None
matplotlib : 3.5.1
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
sqlalchemy : 1.4.27
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

@mvashishtha
Copy link
Collaborator

@LudsteckJ thank you for reporting this issue. I get a different ray error ending with ValueError: An application is trying to access a Ray object whose owner is unknown(00ffffffffffffffffffffffffffffffffffffff0100000001000000). Please make sure that all Ray objects you are trying to access are part of the current Ray session[...]

I believe the ray errors you are seeing come from a bug in ray that Modin added a workaround for in #4603. Modin will automatically initialize ray with runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}} to work around this bug. If I remove the ray.init() from your script so that Modin initializes ray for you, it works. Will that work for you?

About the performance, let's follow up on #5905.

@mvashishtha mvashishtha added Needs more information ❔ Issues that require more information from the reporter and removed Triage 🩹 Issues that need triage labels Mar 30, 2023
@LudsteckJ
Copy link
Author

LudsteckJ commented Mar 31, 2023 via email

@mvashishtha
Copy link
Collaborator

I'm happy to hear that the fix worked, @LudsteckJ.

Let's follow up in #5905 about the performance. Unfortunately Modin can be slower than pandas in some scenarios. Luckily for #5905 we have some promising work in progress that may help this scenario quite a lot.

It would be extremely helpful to have benchmarks fort he most important transformations like
groupby(), apply(), transform(), aggregate(), sum() etc.

Thank you for this feedback. We do have some benchmarks vs pandas here and documented here, but I don't see benchmarks for groupby's apply and transform. It would be good to add benchmarks for those as well.

I'll close this issue for now because we've found a fix for your bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working Needs more information ❔ Issues that require more information from the reporter
Projects
None yet
Development

No branches or pull requests

2 participants