-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: #5904
Comments
@LudsteckJ thank you for reporting this issue. I get a different ray error ending with I believe the ray errors you are seeing come from a bug in ray that Modin added a workaround for in #4603. Modin will automatically initialize ray with About the performance, let's follow up on #5905. |
Dear Mahesh,
thank you for the prompt reply. Your solution works.
May I proceed here with another issue? It concerns speed.
Original pandas is considerably faster for the apply:
%time ho = do.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
Wall time: 13.1 s
%time hm = dm.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
Wall time: 23.2 s
This is strange since I work (alone!) on a virtual windows server with 20 (sic!) CPUs
Look at the next simple transform command which should be quite well suited for parallelization.
Here is original pandas pandas at work, takinng 4.11 seconds
%time do['s'] = do.groupby(['x1']).x2.transform('sum')
Wall time: 4.11 s
And modin consumes 6.48 seconds
%time dm['s'] = dm.groupby(['x1']).x2.transform('sum')
Wall time: 6.48 s
It would be extremely helpful to have benchmarks fort he most important transformations like
groupby(), apply(), transform(), aggregate(), sum() etc.
Thank you and best regards,
Johannes
Von: Mahesh Vashishtha ***@***.***>
Gesendet: Donnerstag, 30. März 2023 15:07
An: modin-project/modin ***@***.***>
Cc: Ludsteck Johannes ***@***.***>; Mention ***@***.***>
Betreff: Re: [modin-project/modin] BUG: (Issue #5904)
@LudsteckJ<https://github.com/LudsteckJ> thank you for reporting this issue. I get a different ray error ending with ValueError: An application is trying to access a Ray object whose owner is unknown(00ffffffffffffffffffffffffffffffffffffff0100000001000000). Please make sure that all Ray objects you are trying to access are part of the current Ray session[...]
I believe the ray errors you are seeing come from a bug in ray that Modin added a workaround for in #4603<#4603>. Modin will automatically initialize ray with runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}} to work around this bug. If I remove the ray.init() from your script so that Modin initializes ray for you, it works. Will that work for you?
About the performance, let's follow up on #5905<#5905>.
—
Reply to this email directly, view it on GitHub<#5904 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFO7AXVLPOMKWLGHKUQWWWDW6WARBANCNFSM6AAAAAAWM4RRJU>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
I'm happy to hear that the fix worked, @LudsteckJ. Let's follow up in #5905 about the performance. Unfortunately Modin can be slower than pandas in some scenarios. Luckily for #5905 we have some promising work in progress that may help this scenario quite a lot.
Thank you for this feedback. We do have some benchmarks vs pandas here and documented here, but I don't see benchmarks for groupby's apply and transform. It would be good to add benchmarks for those as well. I'll close this issue for now because we've found a fix for your bug. |
Modin version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
The following simple apply operation on a modin DataFrame
dm.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
breaks with error messages. This must be a bug since the same method produces sensible results for a standard pandas DataFrame
Furthermore I perfo
Expected Behavior
the apply() above sould yield the same result for a modin DataFrame as for a pandas DataFrame. It breaks
Error Logs
2023-03-30 09:40:02,073 WARNING worker.py:1866 -- Traceback (most recent call last):
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\import_thread.py", line 204, in fetch_and_execute_function_to_run
function({"worker": self.worker})
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\modin\engines\ray\utils.py", line 84, in import_pandas
import pandas # noqa F401
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas_init.py", line 50, in
from pandas.core.api import (
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\api.py", line 48, in
from pandas.core.groupby import (
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\groupby_init_.py", line 1, in
from pandas.core.groupby.generic import (
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\groupby\generic.py", line 73, in
from pandas.core.frame import DataFrame
File "", line 1004, in _find_and_load
File "", line 158, in enter
File "", line 103, in acquire
_frozen_importlib._DeadlockError: deadlock detected by _ModuleLock('pandas.core.frame') at 1689683077104
(apply_func pid=21260) 2023-03-30 09:40:02,104 ERROR serialization.py:371 -- cannot import name 'DataFrame' from partially initialized module 'pandas.core.frame' (most likely due to a circular import) (C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py)
(apply_func pid=21260) Traceback (most recent call last):
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 369, in deserialize_objects
(apply_func pid=21260) obj = self._deserialize_object(data, metadata, object_ref)
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 252, in _deserialize_object
(apply_func pid=21260) return self._deserialize_msgpack_data(data, metadata_fields)
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 207, in _deserialize_msgpack_data
(apply_func pid=21260) python_objects = self.deserialize_pickle5_data(pickle5_data)
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 195, in deserialize_pickle5_data
(apply_func pid=21260) obj = pickle.loads(in_band, buffers=buffers)
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py", line 129, in
(apply_func pid=21260) from pandas.core import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\generic.py", line 108, in
(apply_func pid=21260) from pandas.core import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\indexing.py", line 46, in
(apply_func pid=21260) import pandas.core.common as com
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas_init.py", line 50, in
(apply_func pid=21260) from pandas.core.api import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\api.py", line 48, in
(apply_func pid=21260) from pandas.core.groupby import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\groupby_init.py", line 1, in
(apply_func pid=21260) from pandas.core.groupby.generic import (
(apply_func pid=21260) File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\groupby\generic.py", line 73, in
(apply_func pid=21260) from pandas.core.frame import DataFrame
(apply_func pid=21260) ImportError: cannot import name 'DataFrame' from partially initialized module 'pandas.core.frame' (most likely due to a circular import) (C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py)
2023-03-30 09:40:03,120 ERROR worker.py:399 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::apply_func() (pid=28736, ip=127.0.0.1)
File "python\ray_raylet.pyx", line 528, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RayTaskError: ray::deploy_ray_func() (pid=28736, ip=127.0.0.1)
File "python\ray_raylet.pyx", line 528, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RayTaskError: ray::apply_func() (pid=21260, ip=127.0.0.1)
File "python\ray_raylet.pyx", line 528, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RaySystemError: System error: cannot import name 'DataFrame' from partially initialized module 'pandas.core.frame' (most likely due to a circular import) (C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py)
traceback: Traceback (most recent call last):
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 369, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 252, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata_fields)
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 207, in _deserialize_msgpack_data
python_objects = self._deserialize_pickle5_data(pickle5_data)
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\ray_private\serialization.py", line 195, in _deserialize_pickle5_data
obj = pickle.loads(in_band, buffers=buffers)
File "C:\Anwendungen\Anaconda\envs\hans\lib\site-packages\pandas\core\frame.py", line 129, in
from pandas.core import (
...
Installed Versions
INSTALLED VERSIONS
commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python : 3.9.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.14393
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : de
LOCALE : de_DE.cp1252
pandas : 1.3.4
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.25
pytest : None
hypothesis : None
sphinx : 4.4.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.31.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fsspec : 2022.01.0
fastparquet : None
gcsfs : None
matplotlib : 3.5.1
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
sqlalchemy : 1.4.27
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: