Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#7021: Implement to/from_dask functions #7022

Merged
merged 10 commits into from
Mar 18, 2024

Conversation

Retribution98
Copy link
Collaborator

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Implement to/from_dask_dataframe functions #7021
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>
modin/pandas/test/test_io.py Fixed Show fixed Hide fixed
modin/pandas/accessor.py Fixed Show fixed Hide fixed
@@ -140,6 +140,30 @@ def from_ray_dataset(cls, ray_obj):
"Modin Dataframe can only be converted to a Ray Dataset if Modin uses a Ray engine."
)

@classmethod
def from_dask_dataframe(cls, dask_obj):
Copy link
Collaborator

@anmyachev anmyachev Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test that catch these exceptions (for example, when ray is using).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in the previous peer I implemented the same test for ray, but they demanded that it be removed (you can see this discussion here)

Comment on lines 16 to 17
import dask.dataframe as dd
from distributed.client import default_client
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use DaskWrapper to interact with Dask API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? As I see it, we use the public API without wrapping it in Wrapper, for both dask and ray.
Also DaskWrapper has annotation: "The class responsible for execution of remote operations.", but creating a Dask Dataframe is not valid for it.

Considering your wishes, I can suggest changing import dask.dataframe as dd to from dask.dataframe import from_delayed. What do you think about it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? As I see it, we use the public API without wrapping it in Wrapper, for both dask and ray.
Also DaskWrapper has annotation: "The class responsible for execution of remote operations.", but creating a Dask Dataframe is not valid for it.

Yes, I'm sure. It's okay to change the description because that's what these classes were designed for. All places where the API is used directly need to be updated to work through a wrapper.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see reasons for this, so I just want to clarify. In this case, we just create the same method in DaskWrapper that proxies it to DaskDatafame, like this:

@classmethod
def from_delayed(cls, partitions):
    dd.from_delayed(partitions)

Is correct?

Do I need to create an issue to refactor another public API calls?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I need to create an issue to refactor another public API calls?

It would be great.

In this case, we just create the same method in DaskWrapper that proxies it to DaskDatafame, like this:

We discussed it offline, this can also be done separately.

modin/pandas/accessor.py Outdated Show resolved Hide resolved
@@ -31,7 +31,7 @@
from modin import pandas as pd
from modin.error_message import ErrorMessage
from modin.logging import ClassLogger
from modin.pandas.io import to_ray_dataset
from modin.pandas.io import to_dask, to_ray_dataset

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
modin.pandas.io
begins an import cycle.
@YarShev YarShev changed the title FEAT-#7021: Implement to/from_dask_dataframe functions FEAT-#7021: Implement to/from_dask functions Mar 13, 2024
modin/pandas/test/test_io.py Outdated Show resolved Hide resolved
modin/pandas/test/test_io.py Outdated Show resolved Hide resolved
setup.py Outdated
@@ -5,7 +5,7 @@
with open("README.md", "r", encoding="utf-8") as fh:
long_description = fh.read()

dask_deps = ["dask>=2.22.0", "distributed>=2.22.0"]
dask_deps = ["dask[dataframe]>=2.22.0", "distributed>=2.22.0"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave setup.py without changes to keep the dependencies set minimal for modin-dask and update/align env files with a single dask package. dask[dataframe]/dask-dataframe should be enough for all files. If not, then we should use dask[complete]/dask-complete

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, your suggestion will not help if the library is installed from source or from modin[dask]. This is also important for our CI.
Our env files (environment-dev.yml, requirements-dev.txt) don't need any updates, because they include all required dependencies (dask package for conda and dask[complete] for pip).

In this case, dask[dataframe]` is the best option with the fewest dependencies needed to successfully work with Dask Dataframes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our env files (environment-dev.yml, requirements-dev.txt) don't need any updates, because they include all required dependencies (dask package for conda and dask[complete] for pip).

I think we should align these files in terms of dependencies from dask. The package names should match betwen the files. We can do that in this PR or in a separate one.

No, your suggestion will not help if the library is installed from source or from modin[dask]. This is also important for our CI.

Our env files (environment-dev.yml, requirements-dev.txt) don't need any updates, because they include all required dependencies (dask package for conda and dask[complete] for pip).

from_dask and to_dask do not look like the integral methods of Modin. If these methods require an extra package, I think we can consider it as an optional dependency. As to our CI, we can install the required package directly in CI.

modin/pandas/test/test_io.py Outdated Show resolved Hide resolved
docs/ecosystem.rst Outdated Show resolved Hide resolved
docs/ecosystem.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@YarShev YarShev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also take a loot at CI, which is failing.

modin/pandas/io.py Outdated Show resolved Hide resolved
modin/pandas/test/test_io.py Outdated Show resolved Hide resolved
@@ -171,9 +171,6 @@ def to_dask(cls, modin_obj):
dask.dataframe.DataFrame or dask.dataframe.Series
Converted object with type depending on input.
"""
import_optional_dependency(
"dask-expr", "dask-expr is required to create a Dask DataFrame."
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default message from dask.dataframe is more user friendly, so I suggest not using this method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that we should not check the version for dusk_expr. Yes, it has problems in 1.0.1 version, but this is a problem in Dusk, not our compatibility with it. In addition, this library releases new versions very quickly and, in my opinion, users are unlikely to encounter this problem in the future.

@YarShev YarShev merged commit 1de52f2 into modin-project:master Mar 18, 2024
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement to/from_dask_dataframe functions
3 participants