Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: slow SeriesGroupBy.sum() and DataFrameGroupBy.apply() #5905

Closed
mvashishtha opened this issue Mar 30, 2023 · 5 comments
Closed

PERF: slow SeriesGroupBy.sum() and DataFrameGroupBy.apply() #5905

mvashishtha opened this issue Mar 30, 2023 · 5 comments
Labels
P1 Important tasks that we should complete soon pandas.groupby Performance 🚀 Performance related issues and pull requests. Ray ⚡ Issues related to the Ray engine

Comments

@mvashishtha
Copy link
Collaborator

mvashishtha commented Mar 30, 2023

following up from #5904

import numpy  as np
import modin.pandas as pd
import pandas

a = np.random.randint(0, high=100, size=(50_000_000, 4))
df = pd.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])
pdf = pandas.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])
# part 1: slow groupby.apply()
%time pdf2 = pdf.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
%time df2 = df.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
# part 2: slow groupby.transform()
%time pdf['s'] = pdf.groupby(['x1']).x2.transform('sum')
%time df['s'] = df.groupby(['x1']).x2.transform('sum')

Firstr, the groupby.apply() is about 6.11 sec in pandas and 13.9 sec in modin.

SeriesGroupBy.transform is about 1.89 sec with pandas and 4.39 sec on. I do see that ray is spilling data to the object store, though, and that usually makes modin on ray very slow.

Marking for triage because I need to look at it a bit more to see what is slow.

@LudsteckJ How much RAM is available on your windows machine, and what kind of CPU do you have with how many cores?

System Information
my laptop (macOS Monterey version 12.4 with 16 GB RAM and 2.3 GHz 8-core intel CPU on MacBook Pro (16-inch, 2019))

@mvashishtha mvashishtha added Performance 🚀 Performance related issues and pull requests. Ray ⚡ Issues related to the Ray engine pandas.groupby Triage 🩹 Issues that need triage labels Mar 30, 2023
@mvashishtha mvashishtha changed the title PERF: slow SeriesGroupBy.sum() a PERF: slow SeriesGroupBy.sum() and DataFrameGroupBy.apply() Mar 30, 2023
@mvashishtha mvashishtha mentioned this issue Mar 30, 2023
3 tasks
@mvashishtha mvashishtha added P1 Important tasks that we should complete soon and removed Triage 🩹 Issues that need triage labels Mar 30, 2023
@dchigarev
Copy link
Collaborator

dchigarev commented Mar 30, 2023

The #5867 that I'm working on might fix this. I've tried to run the slightly modified reproducer script on my branch for #5867 and got the following numbers:

pandas modin master modin experimental groupby
apply 4.11 6.31 1.48
transform 2.01 2.86 2.08

Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz, Num cores: 112; RAM: 192gb

The experimental branch I used: https://github.com/dchigarev/modin/tree/issue_5867

Exact script I used to measure this
import numpy as np
import modin.pandas as pd
import pandas
import ray
import modin.config as cfg

cfg.BenchmarkMode.put(True) # to perform in eager mode

if hasattr(cfg, "ExperimentalGroupbyImpl"):
    cfg.ExperimentalGroupbyImpl.put(True)
    print("Using experimental groupby")
else:
    print("Using old groupby")

ray.init(num_cpus=cfg.CpuCount.get())
from timeit import default_timer as timer

a = np.random.randint(0, high=100, size=(50_000_000, 4))
df = pd.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])
pdf = pandas.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])

# part 1: slow groupby.apply()
t1 = timer()
pdf2 = pdf.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
print("pandas apply", timer() - t1)

t1 = timer()
df2 = df.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
print("modin apply", timer() - t1)

# part 2: slow groupby.transform()
t1 = timer()
pdf.groupby(['x1'])[["x2"]].transform('sum')
print("pandas transform", timer() - t1)

t1 = timer()
res = df.groupby(['x1'])[["x2"]].transform('sum')
print("modin transform", timer() - t1)

I would be really pleased if someone would be able to run this script on my branch with different hardware and provide their performance feedback. (note that you need to set the cfg.ExperimentalGroupbyImpl.put(True) to run via experimental groupby)

@mvashishtha
Copy link
Collaborator Author

@dchigarev thank you for running this benchmark. It looks promising! Unfortunately when I try your branch on my laptop, it seems that Modin is causing the pandas transform to slow down significantly. It looks like ray is doing some disk spilling while pandas is doing the transform. Anyway, this is what I get-- looks like modin is helping, but i don't know if this benchmark is valid on my machine:

your branch

pandas apply 7.739943127999998
modin apply 2.4368688879999993
(raylet) Spilled 2670 MiB, 30 objects, write throughput 315 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet) Spilled 4959 MiB, 291 objects, write throughput 417 MiB/s.
pandas transform 24.632068368
modin transform 15.550206659000004

master at 55ec621

pandas apply 11.576054658
modin apply 29.954450095000002
pandas transform 5.754427107000005
(raylet) Spilled 2121 MiB, 46 objects, write throughput 302 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
modin transform 9.946109193999987

Note that macs only have 2 GB of RAM due to a ray bug.

@Garra1980
Copy link
Collaborator

@dchigarev can you please check whether this one still persists?

@YarShev
Copy link
Collaborator

YarShev commented Jan 11, 2024

@dchigarev, can we close this issue?

@dchigarev
Copy link
Collaborator

On the current master the repro from the issue works as follows (with RangePartitioningGroupby enabled):
image

The only thing I had to change to make it work is to modify groupby.__getitem__ call as follows:

- df.groupby(...)["x2"].transform(...)
+ df.groupby(...)[["x2"]].transform(...)

Range-partitioning groupby doesn't support the initial case for now, however it should soon, once this issue is resolved (#5926)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Important tasks that we should complete soon pandas.groupby Performance 🚀 Performance related issues and pull requests. Ray ⚡ Issues related to the Ray engine
Projects
None yet
Development

No branches or pull requests

4 participants