Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#6227: Make sure Series.unique() with pyarrow dtype returns ArrowExtensionArray #7042

Merged
merged 17 commits into from
Apr 15, 2024

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Mar 7, 2024

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves BUG: Series.unique with pyarrow dtype #6227
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

…e returns 'ArrowExtensionArray'

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@@ -1936,7 +1936,10 @@ def str_split(self, pat=None, n=-1, expand=False, regex=None):
def unique(self, keep="first", ignore_index=True, subset=None):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchigarev when combining functions drop_duplicates and unique into one, did you take into account that they return different data types? It seems to me that because of this, using qc.unique in its current form is inconvenient. From my side, the best solution seems to be to split this function into several. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when #7091 drop_duplicates and unique into one, did you take into account that they return different data types

Initially, I haven't seen a problem with it, since the result of both .unique() and .drop_duplicates() was wrapped into a pandas dataframe anyway inside of partition class. I didn't want to add another unique-like qc method in order not to bloat QCs api (which is already quite big and sometimes redundant).

An ideal solution I see is to continue wrapping the results with a pandas dataframe inside of partitions and then make the front-end handle the conversion to the desired output type (for example, .drop_duplicates() returns a DataFrame(result_qc) and .unique() returns result_qc.to_values()). Arrow extension arrays seem to round-trip to pandas with no problems:

>>> arr = pd.array([1, 1, None], dtype="int64[pyarrow]")
>>> type(arr)
<class 'pandas.core.arrays.arrow.array.ArrowExtensionArray'>
>>> arr is pd.DataFrame(arr).squeeze().values
True

subset is None
and ignore_index
and len(self.columns) == 1
and keep is not False
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchigarev If we do not make this change, then the string is stored in the variable (for example, "first"). Did you do this on purpose?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it's a bug, thanks for the fix!

Comment on lines 1964 to 1965
res = self.to_pandas().squeeze(axis=1)
return res.unique()
Copy link
Collaborator Author

@anmyachev anmyachev Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchigarev I measured the new approach on small data (using your script from #7091). It seems to work faster than the previous one. Do you have the ability to measure on big sizes? (maybe you still have your env)
image

Env: windows 11, ray, 8 cores

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will always work faster, since the old full-column approach is basically '.to_pandas() + splitting + putting splits to ray storage', and in your approach it's only .to_pandas().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will always work faster, since the old full-column approach is basically '.to_pandas() + splitting + putting splits to ray storage', and in your approach it's only .to_pandas().

Yes, but I compare with range-partitioning implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the same scripts on linux gives different results (range-partitioning is still faster in a lot of cases):

perf results for linux

MODIN_CPUS=8
image

MODIN_CPUS=16
image

I've been able to reproduce your results though when tried to run the same on windows:

perf results for windows

image

I don't think that the main reasons for the difference is OS here. I think that it's a hardware difference, since for windows runs I've used my own laptop instead of a dedicated machine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchigarev thank you very much for running the benchmark!

I don't think that the main reasons for the difference is OS here. I think that it's a hardware difference, since for windows runs I've used my own laptop instead of a dedicated machine.

Interesting detail.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@anmyachev anmyachev changed the title FIX-#6227: Make sure 'Series.unique()' with pyarrow dtype returns ArrowExtensionArray FIX-#6227: Make sure Series.unique() with pyarrow dtype returns ArrowExtensionArray Apr 9, 2024
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@anmyachev anmyachev requested review from mvashishtha, RehanSD and a team as code owners April 13, 2024 13:52
modin_result = modin_series.unique()
pandas_result = pandas_series.unique()

df_equals(modin_result, pandas_result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use eval_general?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewritten

return (
self.__constructor__(query_compiler=self._query_compiler.unique())
.modin.to_pandas()
._values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a public method that returns the same?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched but didn't find it (values and array are not suitable).

Comment on lines +2049 to +2050
# return self.to_pandas().squeeze(axis=1).unique() works faster
# but returns pandas type instead of query compiler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to perform this in a separate issue? I guess so.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Series.unique with pyarrow dtype
3 participants