Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#4743: avoid partition.length() in the parquet dispatcher #4960

Closed
wants to merge 1 commit into from

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Sep 13, 2022

Signed-off-by: Myachev anatoly.myachev@intel.com

What do these changes do?

  • commit message follows format outlined here
  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves PERF: Consider whether to avoid partition.length() in the parquet dispatcher. #4743
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date
  • added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

@codecov
Copy link

codecov bot commented Sep 13, 2022

Codecov Report

Merging #4960 (7871c7b) into master (b5f7ed3) will decrease coverage by 0.34%.
The diff coverage is 73.85%.

❗ Current head 7871c7b differs from pull request most recent head 122f4d9. Consider uploading reports for the commit 122f4d9 to get more accurate results

@@            Coverage Diff             @@
##           master    #4960      +/-   ##
==========================================
- Coverage   84.91%   84.56%   -0.35%     
==========================================
  Files         266      256      -10     
  Lines       19763    19345     -418     
==========================================
- Hits        16781    16359     -422     
- Misses       2982     2986       +4     
Impacted Files Coverage Δ
modin/_compat/core/py36/pandas_common.py 0.00% <0.00%> (ø)
modin/_compat/pandas_api/latest/window.py 100.00% <ø> (ø)
modin/_compat/pandas_api/py36/__init__.py 0.00% <0.00%> (ø)
modin/_compat/pandas_api/py36/base.py 0.00% <0.00%> (ø)
modin/_compat/pandas_api/py36/dataframe.py 0.00% <0.00%> (ø)
modin/_compat/pandas_api/py36/io.py 0.00% <0.00%> (ø)
modin/_compat/pandas_api/py36/resample.py 0.00% <0.00%> (ø)
modin/_compat/pandas_api/py36/series.py 0.00% <0.00%> (ø)
modin/_compat/pandas_api/py36/window.py 0.00% <ø> (ø)
...tations/pandas_on_python/partitioning/partition.py 90.00% <0.00%> (ø)
... and 64 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@@ -557,15 +557,10 @@ def build_query_compiler(cls, dataset, columns, index_columns, **kwargs):
)
index, sync_index = cls.build_index(dataset, partition_ids, index_columns)
remote_parts = cls.build_partition(partition_ids, column_widths)
if len(partition_ids) > 0:
row_lengths = [part.length() for part in remote_parts.T[0]]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyrito could you tell me why this should be done, if it is automatically calculated if necessary?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be calculated later but there might be slight overhead in doing so. The proposal from @YarShev was to have build_index return the lengths since we already have to materialize the index. I think we should do something like that instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyrito we can change the metadata getters to avoid overhead in this and many other cases. what do you think?

diff --git a/modin/core/dataframe/pandas/dataframe/dataframe.py b/modin/core/dataframe/pandas/dataframe/dataframe.py
index 4bd3b538..36801541 100644
--- a/modin/core/dataframe/pandas/dataframe/dataframe.py
+++ b/modin/core/dataframe/pandas/dataframe/dataframe.py
@@ -247,12 +247,15 @@ class PandasDataframe(ClassLogger):
         """
         if self._row_lengths_cache is None:
             if len(self._partitions) > 0:
-                (
-                    index,
-                    self._row_lengths_cache,
-                ) = self._compute_axis_labels_and_lengths(0)
-                if self._index_cache is None:
-                    self._index_cache = index
+                row_parts = self._partitions.T[0]
+                if self._index_cache is not None:
+                    # do not do extra work to get an index that is already known
+                    self._row_lengths_cache = [part.length() for part in row_parts]
+                else:
+                    (
+                        self._index_cache,
+                        self._row_lengths_cache,
+                    ) = self._compute_axis_labels_and_lengths(0)
             else:
                 self._row_lengths_cache = []
         return self._row_lengths_cache
@@ -269,12 +272,15 @@ class PandasDataframe(ClassLogger):
         """
         if self._column_widths_cache is None:
             if len(self._partitions) > 0:
-                (
-                    columns,
-                    self._column_widths_cache,
-                ) = self._compute_axis_labels_and_lengths(1)
-                if self._columns_cache is None:
-                    self._columns_cache = columns
+                col_parts = self._partitions[0]
+                if self._columns_cache is not None:
+                    # do not do extra work to get columns that is already known
+                    self._column_widths_cache = [part.width() for part in col_parts]
+                else:
+                    (
+                        self._columns_cache,
+                        self._column_widths_cache,
+                    ) = self._compute_axis_labels_and_lengths(1)
             else:
                 self._column_widths_cache = []
         return self._column_widths_cache

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anmyachev I'm not sure I fully understand. Is the proposal to try to do all the calculations lazy? In other words, don't calculate the row or column lengths here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, my suggestion is not to return indices from remote functions (on the basis of which lengths are calculated), but immediately the length (just a number). This greatly reduces the time to serialize / deserialize the index in some cases (for example, a multi-index). I do not have specific numbers, but it seems to me that the performance gain here should be obvious.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anmyachev Oh I see. To be honest, I'm not completely sure. My suggestion would be to try it yourself and see if you are getting good perf/correctness from not returning the index and building it lazily. I'm not sure if that would run into any issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyrito my suggestion for now is to lazily calculate lengths (yes, we can lazily evaluate the index as well, but I would like to think about it separately). Once #4964 has been merged, there should no longer be an additional overhead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyrito #4964 is merged. Could you review again?

Copy link
Collaborator

@mvashishtha mvashishtha Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anmyachev I'm confused about how this PR is helping. Here's my understanding. Right now we have two main cases for the parquet index:

  1. index consists of one or more RangeIndex only: we block on getting index from metadata rather than reading whole file:
    if range_index or (len(partition_ids) == 0 and len(column_names_to_read) != 0):
    complete_index = dataset.to_pandas_dataframe(
    columns=column_names_to_read
    ).index

    I expect that to be cheap anyway.
  2. index has at least one index that's not RangeIndex: we block on materializing the entire index from the partitions:
    else:
    index_ids = [part_id[0][1] for part_id in partition_ids if len(part_id) > 0]
    index_objs = cls.materialize(index_ids)
    complete_index = index_objs[0].append(index_objs[1:])

Since even after this PR we will continue blocking on getting the whole index in case (2), I don't see how this PR is helping.

@anmyachev anmyachev marked this pull request as ready for review September 13, 2022 16:01
@anmyachev anmyachev requested a review from a team as a code owner September 13, 2022 16:01
Copy link
Collaborator

@pyrito pyrito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anmyachev could you rebase the PR so we can re-run CI and make sure there are no issues here?

…atcher

Signed-off-by: Myachev <anatoly.myachev@intel.com>
@anmyachev
Copy link
Collaborator Author

@anmyachev could you rebase the PR so we can re-run CI and make sure there are no issues here?

done

Copy link
Collaborator

@pyrito pyrito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the performance implications here. Is there a way you could check quickly @anmyachev ?

@anmyachev
Copy link
Collaborator Author

I'm curious about the performance implications here. Is there a way you could check quickly @anmyachev ?

This is part of the change required for asynchronous execution. The performance difference should not be visible now.

Without knowing that this is needed for asynchronous execution, these changes can be considered as refactoring, since there is no immediate impact on performance. I can change the first commit category (to REFACTOR) if it's better.

@vnlitvinov
Copy link
Collaborator

We should label things PERF if they're the PR which helps performance. If they're building the base they could be FEAT or REFACTOR, but not PERF.

Comment on lines -562 to -565
if len(partition_ids) > 0:
row_lengths = [part.length() for part in remote_parts.T[0]]
else:
row_lengths = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how there could be any performance hit, as we initialize the partitions with their respective sizes here, so .length() should be just returning immediately:

return np.array(
[
[
cls.frame_partition_cls(
part_id[0],
length=part_id[2],
width=col_width,
)
for part_id, col_width in zip(part_ids, column_widths)
]
for part_ids in partition_ids
]
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same thing too. There was some discussion before about this, but we tabled the discussion for another time. length should just be returning the object reference here right? I don't think anything will be getting materialized.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a special place to materialize length.

self._length_cache = RayWrapper.materialize(self._length_cache)

@anmyachev
Copy link
Collaborator Author

We should label things PERF if they're the PR which helps performance. If they're building the base they could be FEAT or REFACTOR, but not PERF.

Apparently yes, I was wrong about that. Convert to draft.

@anmyachev anmyachev marked this pull request as draft October 5, 2022 21:18
@anmyachev
Copy link
Collaborator Author

not actual

@anmyachev anmyachev closed this Mar 1, 2023
@anmyachev anmyachev deleted the issue4743 branch March 24, 2023 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Consider whether to avoid partition.length() in the parquet dispatcher.
4 participants