FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set #4603

vnlitvinov · 2022-06-24T13:10:51Z

What do these changes do?

Based on what I found in #4564 (comment), I think the real root cause of #4564 (and other similar issues with inability to import pandas) is the Python bug https://bugs.python.org/issue38884, which rises from the fact that they seem to have dropped the idea of the import lock in Python 3.3+.

This PR aims to workaround that by doing the following:

Create special .pth file to be shipped which, when package is installed, is placed under site-packages automatically and, when processed by Python interpreter, imports pandas if special environment variable is set
Change setup.py so this special file is part of both source and binary distributions
Initialize Ray so it sets this environment variable on each worker
Check Ray configuration if Ray was pre-initialized and warn the user to set the variables if they don't match

commit message follows format outlined here
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves BUG: ray 1.13 breaks CI, including test_binary and test_pickle, with circular import #4564
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date
added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

codecov · 2022-06-24T13:31:42Z

Codecov Report

Merging #4603 (01682b0) into master (7181569) will increase coverage by 3.13%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4603      +/-   ##
==========================================
+ Coverage   86.56%   89.69%   +3.13%     
==========================================
  Files         230      231       +1     
  Lines       18470    18734     +264     
==========================================
+ Hits        15988    16803     +815     
+ Misses       2482     1931     -551

Impacted Files	Coverage Δ
modin/utils.py	`93.66% <ø> (+13.93%)`	⬆️
modin/config/envvars.py	`89.10% <100.00%> (+3.46%)`	⬆️
modin/core/execution/ray/common/utils.py	`95.23% <100.00%> (-1.64%)`	⬇️
modin/experimental/batch/test/test_pipeline.py	`100.00% <0.00%> (ø)`
modin/pandas/base.py	`95.30% <0.00%> (+0.08%)`	⬆️
...mentations/pandas_on_ray/partitioning/partition.py	`93.57% <0.00%> (+1.83%)`	⬆️
...tations/pandas_on_python/partitioning/partition.py	`93.75% <0.00%> (+2.08%)`	⬆️
...entations/pandas_on_dask/partitioning/partition.py	`91.46% <0.00%> (+2.43%)`	⬆️
modin/pandas/__init__.py	`69.69% <0.00%> (+3.03%)`	⬆️
... and 15 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

devin-petersohn

This looks pretty good, although it introduces minor friction for users who want to get started and initialize Ray themselves. Probably need extra documentation or something because we recommend users to start the engine themselves.

Can you add runtime env to an already running Ray cluster?

prutskov

How does it work in the case I didn't install modin? Suppose, I'm working in modin-root folder.

vnlitvinov · 2022-06-24T13:48:40Z

Can you add runtime env to an already running Ray cluster?

You can set runtime env even on per-task basis, but in this case it won't have any particular effect as the interpreter in each worker has already been initialized.

As for docs... I'm open for suggestions on where to put things.

How does it work in the case I didn't install modin? Suppose, I'm working in modin-root folder.

Obviously, it won't have the workaround effect, but TBH this isn't the main use case, plus I personally never saw this issue locally. If someone is really bothered by this issue, one could pip install . to enable the workaround.

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

vnlitvinov · 2022-06-24T13:54:22Z

For hinting the users on how to initialize Ray, this is how current (in this PR) warning look like:

>>> import modin.pandas as pd
UserWarning: The pandas version installed 1.4.2 does not match the supported pandas version in Modin 1.4.3. This may cause undesired side effects!
>>> df=pd.DataFrame({})
UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:

    import ray
    ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS_WORKAROUND__': '1'}})

UserWarning: Distributing <class 'dict'> object. This may take some time.

Garra1980 · 2022-06-24T14:28:12Z

Whoa, poor user, so many extra info...

vnlitvinov · 2022-06-24T15:22:17Z

It was like this for quite a while, but before this PR it simply stated ray.init() without parameters:

UserWarning: The pandas version installed 1.4.2 does not match the supported pandas version in Modin 1.4.3. This may cause undesired side effects!
UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:

    import ray
    ray.init()

UserWarning: Distributing <class 'dict'> object. This may take some time.

modin/core/execution/ray/common/utils.py

modin-autoimport-pandas.pth

modin/core/execution/ray/common/utils.py

YarShev · 2022-06-27T15:31:03Z

What do you think if we leave the warning as is (just import ray; ray.init()) when initializing Ray by users themselves, and put a paragraph regarding race condition in the troubleshooting page?

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

…ports

vnlitvinov · 2022-06-27T16:39:04Z

I don't think adding a little bit of info in our warning would make any harm, but I think it would harm user's experience when they would receive some spurious failures until they would realize that there is a troubleshooting guide on that...

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

prutskov · 2022-06-27T17:07:25Z

modin/core/execution/ray/common/utils.py

@@ -138,17 +87,16 @@ def initialize_ray(
                include_dashboard=False,
                ignore_reinit_error=True,
                _redis_password=redis_password,
+                **extra_init_kw,


Did you check this case? I can't find that extra parameters could be provided in case of existing cluster https://github.com/ray-project/ray/blob/master/python/ray/_private/worker.py#L1400-L1412

I hadn't, but apparently extra ones would just get ignored, and, if in the future we'll add more arguments than runtime_env it would be useful.

As for the runtime environment being different, there is a code later on checking the variables. It should give a warning to the user.

I hadn't, but apparently extra ones would just get ignored, and, if in the future we'll add more arguments than runtime_env it would be useful.

Can you elaborate why it would be useful?

As for the runtime environment being different, there is a code later on checking the variables. It should give a warning to the user.

What the code do you mean?

Can you elaborate why it would be useful?

This creates one central place to add more arguments to ray.init() instead of copy-pasting them in several different places.

What the code do you mean?

modin/modin/core/execution/ray/common/utils.py

Lines 170 to 177 in e426ce0

else: # ray is already initialized, check runtime env config

env_vars = ray.get_runtime_context().runtime_env.get("env_vars", {})

for varname, varvalue in extra_init_kw["runtime_env"]["env_vars"].items():

if str(env_vars.get(varname, "")) != str(varvalue):

ErrorMessage.single_warning(

"If initialising Ray yourself, please ensure its runtime env "

+ f"sets environment variable {varname} to {varvalue}"

)

We don't hit this branch in case of cluster init.

Huh?.. we either call ray.init() or hit this else: branch.

We hit ray.init() at line 84 (under if cluster:) or at line 158 (in else branch of that if cluster).

I was just wondering if we should check the runtime environment after ray.init() in case Modin itself initializes Ray? Because of this.

Did you check this case? I can't find that extra parameters could be provided in case of existing cluster https://github.com/ray-project/ray/blob/master/python/ray/_private/worker.py#L1400-L1412

I hadn't, but apparently extra ones would just get ignored

good point!

modin/core/execution/ray/common/utils.py

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

setup.py

YarShev · 2022-06-27T19:28:22Z

As for docs... I'm open for suggestions on where to put things.

What about modin.config - https://modin.readthedocs.io/en/stable/flow/modin/config.html?

YarShev · 2022-07-04T16:24:28Z

@vnlitvinov, that is fine to me. Please also take into account my previous comments, which are unresolved.

devin-petersohn

@vnlitvinov looks great! There are a couple of comments here about cluster-related handling.

RehanSD

Left a quick change as well as a question

modin/core/execution/ray/common/utils.py

RehanSD · 2022-07-07T07:46:03Z

modin/core/execution/ray/common/utils.py

-    )
-    ray.worker.global_worker.run_function_on_all_workers(_import_pandas)
+    else:  # ray is already initialized, check runtime env config
+        env_vars = ray.get_runtime_context().runtime_env.get("env_vars", {})


This only checks that the head nodes env vars are set correctly, right? Don't we need to set the environment variable on all of the workers/nodes, and if so, shouldn't we be checking that the env vars are set correctly on all of the workers? We could probably do that using the ray function that runs on all workers.

No, .runtime_env is the thing which tells Ray which environment variables to set for its workers. It should set these variables for the workers automatically.

Co-authored-by: Rehan Sohail Durrani <rdurrani@berkeley.edu>

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

docs/getting_started/troubleshooting.rst

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

…ports

YarShev · 2022-07-07T17:02:03Z

@vnlitvinov, build docs job failed because of AttributeError: partially initialized module 'pandas' has no attribute 'core' (most likely due to a circular import). Do you think why that happened?

vnlitvinov · 2022-07-07T17:13:25Z

Huh, when I look at the CI of the last commit (887a5f4) docs are built fine. Am I missing things?

YarShev · 2022-07-07T17:23:13Z

Here is the job https://github.com/modin-project/modin/runs/7237231595?check_suite_focus=true.

vnlitvinov · 2022-07-07T17:32:53Z

Thanks. I've missed an empty line after .. code-block:: thingy, which broke sphinx. Fixed.

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

YarShev

@vnlitvinov, LGTM, thanks!

Garra1980 · 2022-07-07T20:53:14Z

So green light for python 3.10 support? :)

vnlitvinov · 2022-07-08T15:57:01Z

Technically yes, though note that Ray still does not have published 1.13 in conda-forge...

vnlitvinov added 4 commits June 24, 2022 16:09

FIX-modin-project#4564: Add auto-import pandas if env variable set

e2fe3c5

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

Pass __MODIN_AUTOIMPORT_PANDAS_WORKAROUND__=1 when initializing Ray

cddfba1

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

Remove pandas import workaround

a8d568d

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

Unpin Ray version

848d8b2

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

vnlitvinov requested a review from a team as a code owner June 24, 2022 13:10

Add release note

843dbbb

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

vnlitvinov changed the title ~~Workaround import issues in Ray: auto-import pandas on python start if env var is set~~ FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set (#4603) Jun 24, 2022

vnlitvinov changed the title ~~FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set (#4603)~~ FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set Jun 24, 2022

devin-petersohn reviewed Jun 24, 2022

View reviewed changes

prutskov reviewed Jun 24, 2022

View reviewed changes

Fix ErrorMessage import

965c5b9

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

vnlitvinov added the Ready for review label Jun 24, 2022

fishbone reviewed Jun 24, 2022

View reviewed changes

modin/core/execution/ray/common/utils.py Outdated Show resolved Hide resolved

YarShev reviewed Jun 25, 2022

View reviewed changes

modin/core/execution/ray/common/utils.py Outdated Show resolved Hide resolved

YarShev reviewed Jun 27, 2022

View reviewed changes

modin-autoimport-pandas.pth Outdated Show resolved Hide resolved

modin/core/execution/ray/common/utils.py Outdated Show resolved Hide resolved

modin/core/execution/ray/common/utils.py Show resolved Hide resolved

vnlitvinov added 2 commits June 27, 2022 19:35

Simplify env variable name, drop stdlib workaround

2eb728f

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

Merge remote-tracking branch 'upstream/master' into workaround-ray-im…

158f0f3

…ports

Use public Ray API

2373716

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

prutskov reviewed Jun 27, 2022

View reviewed changes

vnlitvinov added 2 commits June 27, 2022 20:24

Remove unneeded import

23bdcec

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

Merge branch 'master' into workaround-ray-imports

e426ce0

YarShev reviewed Jun 27, 2022

View reviewed changes

setup.py Outdated Show resolved Hide resolved

mvashishtha mentioned this pull request Jul 5, 2022

BUG: ray 1.13 breaks CI, including test_binary and test_pickle, with circular import #4564

Closed

devin-petersohn reviewed Jul 5, 2022

View reviewed changes

RehanSD reviewed Jul 7, 2022

View reviewed changes

vnlitvinov and others added 3 commits July 7, 2022 16:53

Update modin/core/execution/ray/common/utils.py

0ac7c82

Co-authored-by: Rehan Sohail Durrani <rdurrani@berkeley.edu>

Add troubleshooting section

995febd

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

Use variable for extra files to ship

49029e2

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

vnlitvinov force-pushed the workaround-ray-imports branch from 74ebc29 to 49029e2 Compare July 7, 2022 15:01

YarShev reviewed Jul 7, 2022

View reviewed changes

docs/getting_started/troubleshooting.rst Show resolved Hide resolved

vnlitvinov added 2 commits July 7, 2022 19:02

Fix doc, always check ray runtime_env

2a8b4f7

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

Merge remote-tracking branch 'upstream/master' into workaround-ray-im…

887a5f4

…ports

Fix sphinx docs build

01682b0

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

YarShev approved these changes Jul 7, 2022

View reviewed changes

YarShev merged commit 05933a5 into modin-project:master Jul 7, 2022

vnlitvinov deleted the workaround-ray-imports branch July 8, 2022 15:46

mvashishtha mentioned this pull request Jul 14, 2022

silence warnings completely #4596

Closed

YarShev mentioned this pull request Sep 19, 2022

FIX-#3599: Get rid of redundant function calls on Ray workers #3600

Closed

7 tasks

mvashishtha mentioned this pull request Oct 12, 2022

Get rid of redundant function calls on Ray workers #3599

Closed

vnlitvinov mentioned this pull request Oct 18, 2022

BUG: UserWarning or ImportError when using Modin on a pre-initialized Ray Cluster #5131

Open

3 tasks

vnlitvinov mentioned this pull request Oct 26, 2022

[PERF] Why is the first read_csv call slower than subsequent read_csv calls? #5157

Closed

YarShev mentioned this pull request Dec 17, 2022

[BUG] AttributeError: module 'pandas' has no attribute 'core' #5466

Closed

mvashishtha mentioned this pull request Mar 30, 2023

BUG: #5904

Closed

3 tasks

pcmoritz mentioned this pull request Jun 8, 2023

[Core][deprecate run_function_on_all_workers 3/n] delete run_function_on_all_workers ray-project/ray#30895

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set #4603

FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set #4603

vnlitvinov commented Jun 24, 2022 •

edited

Loading

codecov bot commented Jun 24, 2022 •

edited

Loading

devin-petersohn left a comment

prutskov left a comment •

edited

Loading

vnlitvinov commented Jun 24, 2022

vnlitvinov commented Jun 24, 2022

Garra1980 commented Jun 24, 2022

vnlitvinov commented Jun 24, 2022

YarShev commented Jun 27, 2022

vnlitvinov commented Jun 27, 2022

prutskov Jun 27, 2022 •

edited

Loading

vnlitvinov Jun 27, 2022

YarShev Jun 27, 2022

vnlitvinov Jun 28, 2022

YarShev Jun 28, 2022

vnlitvinov Jul 7, 2022

YarShev Jul 7, 2022

vnlitvinov Jul 7, 2022

YarShev commented Jun 27, 2022

YarShev commented Jul 4, 2022

devin-petersohn left a comment

RehanSD left a comment

RehanSD Jul 7, 2022

vnlitvinov Jul 7, 2022

YarShev commented Jul 7, 2022

vnlitvinov commented Jul 7, 2022

YarShev commented Jul 7, 2022

vnlitvinov commented Jul 7, 2022

YarShev left a comment

Garra1980 commented Jul 7, 2022

vnlitvinov commented Jul 8, 2022

	else: # ray is already initialized, check runtime env config
	env_vars = ray.get_runtime_context().runtime_env.get("env_vars", {})
	for varname, varvalue in extra_init_kw["runtime_env"]["env_vars"].items():
	if str(env_vars.get(varname, "")) != str(varvalue):
	ErrorMessage.single_warning(
	"If initialising Ray yourself, please ensure its runtime env "
	+ f"sets environment variable {varname} to {varvalue}"
	)

FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set #4603

FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set #4603

Conversation

vnlitvinov commented Jun 24, 2022 • edited Loading

What do these changes do?

codecov bot commented Jun 24, 2022 • edited Loading

Codecov Report

devin-petersohn left a comment

Choose a reason for hiding this comment

prutskov left a comment • edited Loading

Choose a reason for hiding this comment

vnlitvinov commented Jun 24, 2022

vnlitvinov commented Jun 24, 2022

Garra1980 commented Jun 24, 2022

vnlitvinov commented Jun 24, 2022

YarShev commented Jun 27, 2022

vnlitvinov commented Jun 27, 2022

prutskov Jun 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev commented Jun 27, 2022

YarShev commented Jul 4, 2022

devin-petersohn left a comment

Choose a reason for hiding this comment

RehanSD left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YarShev commented Jul 7, 2022

vnlitvinov commented Jul 7, 2022

YarShev commented Jul 7, 2022

vnlitvinov commented Jul 7, 2022

YarShev left a comment

Choose a reason for hiding this comment

Garra1980 commented Jul 7, 2022

vnlitvinov commented Jul 8, 2022

vnlitvinov commented Jun 24, 2022 •

edited

Loading

codecov bot commented Jun 24, 2022 •

edited

Loading

prutskov left a comment •

edited

Loading

prutskov Jun 27, 2022 •

edited

Loading