-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#7021: Implement to/from_dask functions #7022
Conversation
Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>
modin/core/io/io.py
Outdated
@@ -140,6 +140,30 @@ def from_ray_dataset(cls, ray_obj): | |||
"Modin Dataframe can only be converted to a Ray Dataset if Modin uses a Ray engine." | |||
) | |||
|
|||
@classmethod | |||
def from_dask_dataframe(cls, dask_obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a test that catch these exceptions (for example, when ray is using).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, in the previous peer I implemented the same test for ray, but they demanded that it be removed (you can see this discussion here)
import dask.dataframe as dd | ||
from distributed.client import default_client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use DaskWrapper
to interact with Dask API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? As I see it, we use the public API without wrapping it in Wrapper, for both dask and ray.
Also DaskWrapper
has annotation: "The class responsible for execution of remote operations.", but creating a Dask Dataframe is not valid for it.
Considering your wishes, I can suggest changing import dask.dataframe as dd
to from dask.dataframe import from_delayed
. What do you think about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? As I see it, we use the public API without wrapping it in Wrapper, for both dask and ray.
Also DaskWrapper has annotation: "The class responsible for execution of remote operations.", but creating a Dask Dataframe is not valid for it.
Yes, I'm sure. It's okay to change the description because that's what these classes were designed for. All places where the API is used directly need to be updated to work through a wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see reasons for this, so I just want to clarify. In this case, we just create the same method in DaskWrapper
that proxies it to DaskDatafame, like this:
@classmethod
def from_delayed(cls, partitions):
dd.from_delayed(partitions)
Is correct?
Do I need to create an issue to refactor another public API calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I need to create an issue to refactor another public API calls?
It would be great.
In this case, we just create the same method in DaskWrapper that proxies it to DaskDatafame, like this:
We discussed it offline, this can also be done separately.
@@ -31,7 +31,7 @@ | |||
from modin import pandas as pd | |||
from modin.error_message import ErrorMessage | |||
from modin.logging import ClassLogger | |||
from modin.pandas.io import to_ray_dataset | |||
from modin.pandas.io import to_dask, to_ray_dataset |
Check notice
Code scanning / CodeQL
Cyclic import Note
modin.pandas.io
setup.py
Outdated
@@ -5,7 +5,7 @@ | |||
with open("README.md", "r", encoding="utf-8") as fh: | |||
long_description = fh.read() | |||
|
|||
dask_deps = ["dask>=2.22.0", "distributed>=2.22.0"] | |||
dask_deps = ["dask[dataframe]>=2.22.0", "distributed>=2.22.0"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave setup.py without changes to keep the dependencies set minimal for modin-dask and update/align env files with a single dask package. dask[dataframe]/dask-dataframe should be enough for all files. If not, then we should use dask[complete]/dask-complete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, your suggestion will not help if the library is installed from source or from modin[dask]
. This is also important for our CI.
Our env files (environment-dev.yml
, requirements-dev.txt
) don't need any updates, because they include all required dependencies (dask
package for conda and dask[complete]
for pip).
In this case, dask[dataframe]` is the best option with the fewest dependencies needed to successfully work with Dask Dataframes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our env files (environment-dev.yml, requirements-dev.txt) don't need any updates, because they include all required dependencies (dask package for conda and dask[complete] for pip).
I think we should align these files in terms of dependencies from dask. The package names should match betwen the files. We can do that in this PR or in a separate one.
No, your suggestion will not help if the library is installed from source or from modin[dask]. This is also important for our CI.
Our env files (environment-dev.yml, requirements-dev.txt) don't need any updates, because they include all required dependencies (dask package for conda and dask[complete] for pip).
from_dask
and to_dask
do not look like the integral methods of Modin. If these methods require an extra package, I think we can consider it as an optional dependency. As to our CI, we can install the required package directly in CI.
modin/core/execution/dask/implementations/pandas_on_dask/io/io.py
Outdated
Show resolved
Hide resolved
modin/core/execution/dask/implementations/pandas_on_dask/io/io.py
Outdated
Show resolved
Hide resolved
f554aad
to
edc59f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also take a loot at CI, which is failing.
modin/core/execution/dask/implementations/pandas_on_dask/io/io.py
Outdated
Show resolved
Hide resolved
@@ -171,9 +171,6 @@ def to_dask(cls, modin_obj): | |||
dask.dataframe.DataFrame or dask.dataframe.Series | |||
Converted object with type depending on input. | |||
""" | |||
import_optional_dependency( | |||
"dask-expr", "dask-expr is required to create a Dask DataFrame." | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default message from dask.dataframe is more user friendly, so I suggest not using this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that we should not check the version for dusk_expr. Yes, it has problems in 1.0.1 version, but this is a problem in Dusk, not our compatibility with it. In addition, this library releases new versions very quickly and, in my opinion, users are unlikely to encounter this problem in the future.
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date