Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#1503: Proper implementation of Series.values #5469

Merged
merged 5 commits into from
Dec 19, 2022

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Dec 17, 2022

Signed-off-by: Anatoly Myachev anatoly.myachev@intel.com

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves astype({"col": "category"}) should convert Series.values type to pandas.core.arrays.categorical.Categorical #1503
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@anmyachev anmyachev marked this pull request as ready for review December 18, 2022 00:59
@anmyachev anmyachev requested a review from a team as a code owner December 18, 2022 00:59
Comment on lines 488 to 489
if not is_numeric_dtype(self.dtype):
return self._default_to_pandas("values")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these non-numeric dtypes whose behavior deviates from a common .to_numpy()? If it's only categories can we then just apply the same approach as for .ravel()?:

modin/modin/pandas/series.py

Lines 1502 to 1505 in 3e314d8

if isinstance(self.dtype, pandas.CategoricalDtype):
data = pandas.Categorical(data, dtype=self.dtype)
return data

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code snippet from pandas:

        Overview:

        dtype       | values        | _values       | array         |
        ----------- | ------------- | ------------- | ------------- |
        Numeric     | ndarray       | ndarray       | PandasArray   |
        Category    | Categorical   | Categorical   | Categorical   |
        dt64[ns]    | ndarray[M8ns] | DatetimeArray | DatetimeArray |
        dt64[ns tz] | ndarray[M8ns] | DatetimeArray | DatetimeArray |
        td64[ns]    | ndarray[m8ns] | TimedeltaArray| ndarray[m8ns] |
        Period      | ndarray[obj]  | PeriodArray   | PeriodArray   |
        Nullable    | EA            | EA            | EA            |

There seems to be a case with the Nullable type where EA (extended array?) would be returned.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EA = ExtensionArray

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
modin/pandas/series.py Outdated Show resolved Hide resolved
anmyachev and others added 2 commits December 19, 2022 13:09
Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@YarShev YarShev changed the title FIX-#1503: proper implementation of Series.values FIX-#1503: Proper implementation of Series.values Dec 19, 2022
@YarShev
Copy link
Collaborator

YarShev commented Dec 19, 2022

@anmyachev, please check #4187 if this PR resolves it or maybe partitally.

@anmyachev
Copy link
Collaborator Author

@anmyachev, please check #4187 if this PR resolves it or maybe partitally.

@YarShev yes, the PR will also fix #4187.

@YarShev YarShev merged commit b9a2cab into modin-project:master Dec 19, 2022
@anmyachev anmyachev deleted the issue1503 branch December 19, 2022 16:11
data = self.to_numpy()
if isinstance(self.dtype, pd.CategoricalDtype):
data = pd.Categorical(data, dtype=self.dtype)
return data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still going to mess up a bunch of EA cases. The correct way to handle this is #4187 (comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I will continue to work on this separately. At the moment, Modin doesn't have a internal interface to implement your suggestion, I've tried that before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel Is there some simple condition by which we can determine what this is EA case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(self.dtype, ExtensionDtype)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jbrockmendel! Should be fixed in #5493.

vnlitvinov pushed a commit that referenced this pull request Jan 24, 2023
Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

astype({"col": "category"}) should convert Series.values type to pandas.core.arrays.categorical.Categorical
5 participants