Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#5268: Call get on all partitions at once in to_pandas #4776

Merged
merged 15 commits into from
Nov 27, 2022

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Aug 4, 2022

Signed-off-by: Myachev anatoly.myachev@intel.com

What do these changes do?

  • commit message follows format outlined here
  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Improve scaling of to_pandas; getting all objects from partitions at once #5268
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date
  • added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

@codecov
Copy link

codecov bot commented Aug 4, 2022

Codecov Report

Merging #4776 (be44af7) into master (19ac8ff) will decrease coverage by 18.90%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           master    #4776       +/-   ##
===========================================
- Coverage   84.82%   65.92%   -18.91%     
===========================================
  Files         268      269        +1     
  Lines       19701    19986      +285     
===========================================
- Hits        16712    13176     -3536     
- Misses       2989     6810     +3821     
Impacted Files Coverage Δ
...dataframe/pandas/partitioning/partition_manager.py 89.31% <100.00%> (+3.94%) ⬆️
modin/core/execution/dask/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
modin/experimental/sklearn/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
modin/core/execution/dask/common/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
modin/experimental/xgboost/test/test_dmatrix.py 0.00% <0.00%> (-100.00%) ⬇️
modin/experimental/xgboost/test/test_xgboost.py 0.00% <0.00%> (-100.00%) ⬇️
modin/experimental/core/execution/ray/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
modin/core/execution/dask/common/engine_wrapper.py 0.00% <0.00%> (-100.00%) ⬇️
...odin/experimental/core/storage_formats/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
...in/core/execution/dask/implementations/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
... and 112 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@anmyachev anmyachev force-pushed the issue2814 branch 2 times, most recently from 23511c1 to cca6978 Compare November 22, 2022 18:36
@anmyachev anmyachev changed the title PERF-#2814: Call 'get' on all partitions at once in 'to_pandas' PERF-#2814: Call get on all partitions at once in to_pandas Nov 22, 2022
@anmyachev anmyachev marked this pull request as ready for review November 22, 2022 18:55
@anmyachev anmyachev requested a review from a team as a code owner November 22, 2022 18:55
@anmyachev
Copy link
Collaborator Author

@vnlitvinov could you take a look?

Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got benchmarks for this change?

@anmyachev
Copy link
Collaborator Author

Got benchmarks for this change?

@vnlitvinov results by using asv:

the PR

          --                             cpus
          --------------- ---------------------------------
               shape          4          16          32    
          =============== ========== ========== ===========
            [5000, 5000]   164±10ms   283±10ms    449±30ms
           [1000000, 10]   60.2±4ms   74.6±3ms   84.9±10ms
          =============== ========== ========== ===========

Master branch

          --                             cpus
          --------------- ---------------------------------
               shape          4          16          32
          =============== ========== ========== ===========
            [5000, 5000]   167±4ms    368±10ms   1.01±0.1s
           [1000000, 10]   63.3±6ms   77.3±4ms    96.5±6ms
          =============== ========== ========== ===========

anmyachev and others added 7 commits November 24, 2022 14:55
…pandas'

Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Co-authored-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
@anmyachev
Copy link
Collaborator Author

@vnlitvinov ready for review

Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
anmyachev and others added 4 commits November 24, 2022 17:16
Co-authored-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
anmyachev and others added 2 commits November 24, 2022 19:39
Co-authored-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
vnlitvinov
vnlitvinov previously approved these changes Nov 25, 2022
Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor questions left only

Signed-off-by: Myachev <anatoly.myachev@intel.com>
Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vnlitvinov
Copy link
Collaborator

@anmyachev one note - this PR is solving only half of problem stated in #2814 (getting things serially), but not the other (using N*N pandas.concat operations), maybe we should open a new issue for the N-square concat-s?

@anmyachev anmyachev changed the title PERF-#2814: Call get on all partitions at once in to_pandas PERF-#5268: Call get on all partitions at once in to_pandas Nov 25, 2022
@anmyachev
Copy link
Collaborator Author

anmyachev commented Nov 25, 2022

@anmyachev one note - this PR is solving only half of problem stated in #2814 (getting things serially), but not the other (using N*N pandas.concat operations), maybe we should open a new issue for the N-square concat-s?

I created a separate issue: #5268

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve scaling of to_pandas; getting all objects from partitions at once
4 participants