FIX-#6227: Make sure `Series.unique()` with pyarrow dtype returns `ArrowExtensionArray` #7042

anmyachev · 2024-03-07T20:04:47Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves BUG: Series.unique with pyarrow dtype #6227
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

…e returns 'ArrowExtensionArray' Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

…issue6227

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev · 2024-04-02T20:19:05Z

modin/core/storage_formats/pandas/query_compiler.py

@@ -1936,7 +1936,10 @@ def str_split(self, pat=None, n=-1, expand=False, regex=None):
    def unique(self, keep="first", ignore_index=True, subset=None):


@dchigarev when combining functions drop_duplicates and unique into one, did you take into account that they return different data types? It seems to me that because of this, using qc.unique in its current form is inconvenient. From my side, the best solution seems to be to split this function into several. What do you think?

when #7091 drop_duplicates and unique into one, did you take into account that they return different data types

Initially, I haven't seen a problem with it, since the result of both .unique() and .drop_duplicates() was wrapped into a pandas dataframe anyway inside of partition class. I didn't want to add another unique-like qc method in order not to bloat QCs api (which is already quite big and sometimes redundant).

An ideal solution I see is to continue wrapping the results with a pandas dataframe inside of partitions and then make the front-end handle the conversion to the desired output type (for example, .drop_duplicates() returns a DataFrame(result_qc) and .unique() returns result_qc.to_values()). Arrow extension arrays seem to round-trip to pandas with no problems:

>>> arr = pd.array([1, 1, None], dtype="int64[pyarrow]") >>> type(arr) <class 'pandas.core.arrays.arrow.array.ArrowExtensionArray'> >>> arr is pd.DataFrame(arr).squeeze().values True

anmyachev · 2024-04-02T20:20:30Z

modin/core/storage_formats/pandas/query_compiler.py

+            subset is None
+            and ignore_index
+            and len(self.columns) == 1
+            and keep is not False


@dchigarev If we do not make this change, then the string is stored in the variable (for example, "first"). Did you do this on purpose?

nope, it's a bug, thanks for the fix!

anmyachev · 2024-04-02T20:25:51Z

modin/core/storage_formats/pandas/query_compiler.py

+            res = self.to_pandas().squeeze(axis=1)
+            return res.unique()


@dchigarev I measured the new approach on small data (using your script from #7091). It seems to work faster than the previous one. Do you have the ability to measure on big sizes? (maybe you still have your env)

Env: windows 11, ray, 8 cores

I think it will always work faster, since the old full-column approach is basically '.to_pandas() + splitting + putting splits to ray storage', and in your approach it's only .to_pandas().

I think it will always work faster, since the old full-column approach is basically '.to_pandas() + splitting + putting splits to ray storage', and in your approach it's only .to_pandas().

Yes, but I compare with range-partitioning implementation.

Running the same scripts on linux gives different results (range-partitioning is still faster in a lot of cases):

perf results for linux

MODIN_CPUS=8

MODIN_CPUS=16

I've been able to reproduce your results though when tried to run the same on windows:

perf results for windows

I don't think that the main reasons for the difference is OS here. I think that it's a hardware difference, since for windows runs I've used my own laptop instead of a dedicated machine.

@dchigarev thank you very much for running the benchmark!

I don't think that the main reasons for the difference is OS here. I think that it's a hardware difference, since for windows runs I've used my own laptop instead of a dedicated machine.

Interesting detail.

…issue6227

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

…issue6227

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

…issue6227

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

YarShev · 2024-04-15T10:42:20Z

modin/tests/pandas/test_series.py

+    modin_result = modin_series.unique()
+    pandas_result = pandas_series.unique()
+
+    df_equals(modin_result, pandas_result)


Why don't we use eval_general?

YarShev · 2024-04-15T10:43:16Z

modin/pandas/series.py

+        return (
+            self.__constructor__(query_compiler=self._query_compiler.unique())
+            .modin.to_pandas()
+            ._values


Is there a public method that returns the same?

I searched but didn't find it (values and array are not suitable).

YarShev · 2024-04-15T10:45:00Z

modin/core/storage_formats/pandas/query_compiler.py

+            # return self.to_pandas().squeeze(axis=1).unique() works faster
+            # but returns pandas type instead of query compiler


Do we want to perform this in a separate issue? I guess so.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

…issue6227

anmyachev added 2 commits March 7, 2024 21:04

FIX-modin-project#6227: Make sure 'Series.unique()' with pyarrow dtyp…

58c5874

…e returns 'ArrowExtensionArray' Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

fixes

3f208df

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev force-pushed the issue6227 branch from d62467a to 3f208df Compare March 8, 2024 16:10

anmyachev added 5 commits April 2, 2024 16:12

Merge branch 'master' of https://github.com/modin-project/modin into …

56335e5

…issue6227

fix

bb425d7

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

fixes after merge

103d771

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

use 'to_pandas().unique'

c11a43a

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

fixs

be324c5

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev commented Apr 2, 2024

View reviewed changes

Merge branch 'master' of https://github.com/modin-project/modin into …

a001bd2

…issue6227

anmyachev force-pushed the issue6227 branch from 9d0ef54 to bc58314 Compare April 8, 2024 11:52

fix

6fe11c5

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev force-pushed the issue6227 branch from bc58314 to 6fe11c5 Compare April 8, 2024 14:44

anmyachev added 3 commits April 9, 2024 13:19

Merge branch 'master' of https://github.com/modin-project/modin into …

2521b41

…issue6227

fix

80e54fd

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

cleanup

7120a12

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev changed the title ~~FIX-#6227: Make sure 'Series.unique()' with pyarrow dtype returns ArrowExtensionArray~~ FIX-#6227: Make sure Series.unique() with pyarrow dtype returns ArrowExtensionArray Apr 9, 2024

anmyachev added 3 commits April 12, 2024 21:56

Merge branch 'master' of https://github.com/modin-project/modin into …

a4d4c0f

…issue6227

fix hdk

658d391

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

fix docstring

34a34ca

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev marked this pull request as ready for review April 13, 2024 13:52

anmyachev requested review from aregm, gshimansky, ienkovich, Garra1980, YarShev, vnlitvinov, AndreyPavlenko, a team and devin-petersohn as code owners April 13, 2024 13:52

anmyachev requested review from mvashishtha, RehanSD and a team as code owners April 13, 2024 13:52

YarShev reviewed Apr 15, 2024

View reviewed changes

anmyachev mentioned this pull request Apr 15, 2024

Possibility of unique function optimization #7182

Open

anmyachev added 2 commits April 15, 2024 13:57

address review comments

9638e81

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Merge branch 'master' of https://github.com/modin-project/modin into …

de2e13c

…issue6227

YarShev approved these changes Apr 15, 2024

View reviewed changes

YarShev merged commit c31feba into modin-project:master Apr 15, 2024
35 checks passed

anmyachev deleted the issue6227 branch April 15, 2024 12:31

korbit-ai bot mentioned this pull request Aug 13, 2024

FIX-#6227: Make sure Series.unique() with pyarrow dtype returns ArrowExtensionArray furwellness/modin#20

Closed

7 tasks

local-dev-korbit-ai-mentor bot mentioned this pull request Aug 15, 2024

FIX-#6227: Make sure Series.unique() with pyarrow dtype returns ArrowExtensionArray furwellness/modin#35

Closed

7 tasks

korbit-ai bot mentioned this pull request Aug 15, 2024

FIX-#6227: Make sure Series.unique() with pyarrow dtype returns ArrowExtensionArray furwellness/modin#50

Open

7 tasks

local-dev-korbit-ai-mentor bot mentioned this pull request Aug 15, 2024

FIX-#6227: Make sure Series.unique() with pyarrow dtype returns ArrowExtensionArray furwellness/modin#70

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#6227: Make sure `Series.unique()` with pyarrow dtype returns `ArrowExtensionArray` #7042

FIX-#6227: Make sure `Series.unique()` with pyarrow dtype returns `ArrowExtensionArray` #7042

anmyachev commented Mar 7, 2024 •

edited

Loading

anmyachev Apr 2, 2024

dchigarev Apr 5, 2024

anmyachev Apr 2, 2024

dchigarev Apr 5, 2024

anmyachev Apr 2, 2024 •

edited

Loading

dchigarev Apr 5, 2024

anmyachev Apr 5, 2024

dchigarev Apr 5, 2024

anmyachev Apr 5, 2024

YarShev Apr 15, 2024

anmyachev Apr 15, 2024

YarShev Apr 15, 2024

anmyachev Apr 15, 2024

YarShev Apr 15, 2024

anmyachev Apr 15, 2024

		@@ -1936,7 +1936,10 @@ def str_split(self, pat=None, n=-1, expand=False, regex=None):
		def unique(self, keep="first", ignore_index=True, subset=None):

		# return self.to_pandas().squeeze(axis=1).unique() works faster
		# but returns pandas type instead of query compiler

FIX-#6227: Make sure Series.unique() with pyarrow dtype returns ArrowExtensionArray #7042

FIX-#6227: Make sure Series.unique() with pyarrow dtype returns ArrowExtensionArray #7042

Conversation

anmyachev commented Mar 7, 2024 • edited Loading

What do these changes do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anmyachev Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FIX-#6227: Make sure `Series.unique()` with pyarrow dtype returns `ArrowExtensionArray` #7042

FIX-#6227: Make sure `Series.unique()` with pyarrow dtype returns `ArrowExtensionArray` #7042

anmyachev commented Mar 7, 2024 •

edited

Loading

anmyachev Apr 2, 2024 •

edited

Loading