Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#4705: Improve perf of arithmetic operations between Series objects with shared .index #4689

Merged
merged 1 commit into from
Jul 26, 2022

Conversation

jbrockmendel
Copy link
Collaborator

@jbrockmendel jbrockmendel commented Jul 20, 2022

closes #4705

import modin.config as cfg
cfg.BenchmarkMode.put(True)

import ray
ray.init()
import modin.pandas as pd
import numpy as np

arr = np.random.randn(100_000, 2)
df = pd.DataFrame(arr)
df.index = pd.MultiIndex.from_product([list('abcdefghij'), np.arange(10_000)])
df = pd.concat([df]*10)

In [4]: %timeit df[0] + df[1]
337 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- master
264 ms ± 6.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- PR

In [5]: df = pd.concat([df]*10)

In [6]: %timeit df[0] + df[1]
3.29 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- master
2.63 s ± 334 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- PR

@jbrockmendel jbrockmendel requested a review from a team as a code owner July 20, 2022 15:08
@pyrito
Copy link
Collaborator

pyrito commented Jul 20, 2022

Thanks for the PR @jbrockmendel !

Could you try running your performance measurements again with benchmark mode enabled?

import modin.config as cfg

# Enable benchmark mode
cfg.BenchmarkMode.put(True)

@jbrockmendel
Copy link
Collaborator Author

With BenchmarkMode enabled I get

In [4]: %timeit df[0] + df[1]
337 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- master
264 ms ± 6.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- PR

@pyrito
Copy link
Collaborator

pyrito commented Jul 21, 2022

@jbrockmendel could you try running the benchmarks for larger datasets and let us know how the performance looks there?

Also, it looks like the rest of the CI runs aren't running because your commit message isn't formatted properly. You'll probably have to create a Github issue first and then link this PR to that. You can check this link out for more details: https://modin.readthedocs.io/en/stable/development/contributing.html

@codecov
Copy link

codecov bot commented Jul 22, 2022

Codecov Report

Merging #4689 (1f06a3b) into master (cc713c5) will increase coverage by 4.59%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4689      +/-   ##
==========================================
+ Coverage   85.25%   89.84%   +4.59%     
==========================================
  Files         259      260       +1     
  Lines       19211    19494     +283     
==========================================
+ Hits        16378    17515    +1137     
+ Misses       2833     1979     -854     
Impacted Files Coverage Δ
modin/pandas/series.py 94.23% <100.00%> (+0.24%) ⬆️
modin/logging/config.py 94.59% <0.00%> (-1.30%) ⬇️
modin/experimental/batch/test/test_pipeline.py 100.00% <0.00%> (ø)
modin/pandas/series_utils.py 99.43% <0.00%> (+0.56%) ⬆️
...ns/pandas_on_ray/partitioning/partition_manager.py 82.19% <0.00%> (+1.36%) ⬆️
modin/core/io/text/excel_dispatcher.py 93.33% <0.00%> (+1.66%) ⬆️
...tations/pandas_on_python/partitioning/partition.py 93.75% <0.00%> (+2.08%) ⬆️
modin/config/envvars.py 89.10% <0.00%> (+3.46%) ⬆️
...dataframe/pandas/partitioning/partition_manager.py 90.09% <0.00%> (+3.71%) ⬆️
... and 31 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@jbrockmendel
Copy link
Collaborator Author

Updated OP with results on 10x larger DataFrame.

pyrito
pyrito previously approved these changes Jul 22, 2022
Copy link
Collaborator

@pyrito pyrito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @jbrockmendel could you please change the release notes?

Signed-off-by: Brock Mendel <jbrockmendel@gmail.com>
Copy link
Collaborator

@mvashishtha mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jbrockmendel !

@YarShev YarShev changed the title PERF: df[0]+df[1] PERF-#4705: Improve perf of arithmetic operations between Series objects with shared .index Jul 26, 2022
Copy link
Collaborator

@YarShev YarShev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel, LGTM, thanks!

@YarShev YarShev merged commit 49c0398 into modin-project:master Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Avoid deepcopy in Series arithmetic operations
4 participants