Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#0000: Implement DataFrame API standard #6928

Closed
wants to merge 8 commits into from

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Feb 12, 2024

What do these changes do?

This is an example of how DataFrame API protocol implementation might look like if it will be added directly to Modin, instead of https://github.com/data-apis/dataframe-api-compat.

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves #?
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

modin/dataframe_api_standard/column_object.py Dismissed Show dismissed Hide dismissed
modin/dataframe_api_standard/column_object.py Dismissed Show dismissed Hide dismissed
kwargs: dict[str, Any],
) -> None:
df = integer_dataframe_1(library)
ns = df.__dataframe_namespace__()

Check warning

Code scanning / CodeQL

Variable defined multiple times Warning test

This assignment to 'ns' is unnecessary as it is
redefined
before this value is used.
pytest.skip("TODO: enable for modin")
df1 = integer_dataframe_1(library)
df2 = integer_dataframe_2(library)
ns = df1.__dataframe_namespace__()

Check warning

Code scanning / CodeQL

Variable defined multiple times Warning test

This assignment to 'ns' is unnecessary as it is
redefined
before this value is used.

def test_column_to_array_object(library: BaseHandler) -> None:
col = integer_dataframe_1(library).col("a")
result = np.asarray(col.persist().to_array())

Check warning

Code scanning / CodeQL

Variable defined multiple times Warning test

This assignment to 'result' is unnecessary as it is
redefined
before this value is used.
df1 = df.filter(df.col("a") > 1)
df2 = df.filter(df.col("a") <= 1)

df1 = df1.persist()

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable df1 is not used.
import numpy as np
from pandas.api.types import is_extension_array_dtype

import modin.dataframe_api_standard

Check notice

Code scanning / CodeQL

Module is imported with 'import' and 'import from' Note

Module 'modin.dataframe_api_standard' is imported with both 'import' and 'import from'.
return df.shape # type: ignore[no-any-return]

def group_by(self, *keys: str) -> GroupBy:
from modin.dataframe_api_standard.group_by_object import GroupBy

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
modin.dataframe_api_standard.group_by_object
begins an import cycle.

import modin.pandas as pd
from modin.dataframe_api_standard import Namespace
from modin.dataframe_api_standard.dataframe_object import DataFrame

Check notice

Code scanning / CodeQL

Cyclic import Note

Import of module
modin.dataframe_api_standard.dataframe_object
begins an import cycle.
failed_columns = self._df.columns.difference(result.columns)
if len(failed_columns) > 0: # pragma: no cover
msg = "Groupby operation could not be performed on columns "
+f"{failed_columns}. Please drop them before calling group_by."

Check notice

Code scanning / CodeQL

Statement has no effect Note

This statement has no effect.
@anmyachev anmyachev marked this pull request as ready for review February 12, 2024 22:54
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@dchigarev
Copy link
Collaborator

Why we're merging this to our repo instead of https://github.com/data-apis/dataframe-api-compat? Is this already determined that the dataframe-api repo will be abandoned and won't accept contributions?

@dchigarev
Copy link
Collaborator

In particular, I'm asking about this PR (data-apis/dataframe-api-compat#71), I've seen some fresh comments there from you. Are you expecting this to be merged? If so, do we need the PR in modin?

@anmyachev
Copy link
Collaborator Author

@dchigarev Why we're merging this to our repo instead of https://github.com/data-apis/dataframe-api-compat?

I consider this change as temporary in order to be in time for the 0.27 release.

Is this already determined that the dataframe-api repo will be abandoned and won't accept contributions?

I don't think so.

In particular, I'm asking about this PR (data-apis/dataframe-api-compat#71), I've seen some fresh comments there from you. Are you expecting this to be merged? If so, do we need the PR in modin?

I’m still sure that it can be merged, but it definitely won’t make it in time for tomorrow’s release.

@dchigarev
Copy link
Collaborator

dchigarev commented Feb 13, 2024

I consider this change as temporary in order to be in time for the 0.27 release.

Why do we want this to be in 0.27 release? cc @YarShev
It looks like nothing pushes us to have a working df-api implementation? If someone is interested to try df-api on Modin, we could provide them @anmyachev's branch with the implementation

@anmyachev
Copy link
Collaborator Author

I consider this change as temporary in order to be in time for the 0.27 release.

why do we want this to be in 0.27 release? cc @YarShev

There is an opportunity to provide these changes earlier.

@anmyachev
Copy link
Collaborator Author

anmyachev commented Feb 13, 2024

It is also worth noting that:

  1. The protocol is still in beta and may change significantly. However, almost all the code can be reused from Pandas, given that the public API is the same.
  2. Even if we place the code in dataframe-api-compat repository, this does not mean that it will be fully supported by other developers. Support is still expected from the Modin developers.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@dchigarev
Copy link
Collaborator

dchigarev commented Feb 13, 2024

Even if we place the code in dataframe-api-compat repository, this does not mean that it will be fully supported by other developers.

Sure, I also thought the same. I'm just worried about how we're going to sync modin's repos with the PR to dataframes-api (data-apis/dataframe-api-compat#71). The diffs are massive, so applying review suggestions for the dataframes-api PR and possibly fixing bugs for the implementation in the Modin at the same time, could result into a severe unsynchrony between these two implementations.

Don't we want to instead spend our efforts on the dataframes-api PR data-apis/dataframe-api-compat#71 (making a review by our team/working on suggestions together) and the future support of modin's implementation there?

@anmyachev
Copy link
Collaborator Author

Closed in favour of #7196

@anmyachev anmyachev closed this Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants