-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: slow SeriesGroupBy.sum() and DataFrameGroupBy.apply() #5905
Comments
The #5867 that I'm working on might fix this. I've tried to run the slightly modified reproducer script on my branch for #5867 and got the following numbers:
Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz, Num cores: 112; RAM: 192gb The experimental branch I used: https://github.com/dchigarev/modin/tree/issue_5867 Exact script I used to measure thisimport numpy as np
import modin.pandas as pd
import pandas
import ray
import modin.config as cfg
cfg.BenchmarkMode.put(True) # to perform in eager mode
if hasattr(cfg, "ExperimentalGroupbyImpl"):
cfg.ExperimentalGroupbyImpl.put(True)
print("Using experimental groupby")
else:
print("Using old groupby")
ray.init(num_cpus=cfg.CpuCount.get())
from timeit import default_timer as timer
a = np.random.randint(0, high=100, size=(50_000_000, 4))
df = pd.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])
pdf = pandas.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])
# part 1: slow groupby.apply()
t1 = timer()
pdf2 = pdf.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
print("pandas apply", timer() - t1)
t1 = timer()
df2 = df.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
print("modin apply", timer() - t1)
# part 2: slow groupby.transform()
t1 = timer()
pdf.groupby(['x1'])[["x2"]].transform('sum')
print("pandas transform", timer() - t1)
t1 = timer()
res = df.groupby(['x1'])[["x2"]].transform('sum')
print("modin transform", timer() - t1) I would be really pleased if someone would be able to run this script on my branch with different hardware and provide their performance feedback. (note that you need to set the |
@dchigarev thank you for running this benchmark. It looks promising! Unfortunately when I try your branch on my laptop, it seems that Modin is causing the pandas transform to slow down significantly. It looks like ray is doing some disk spilling while pandas is doing the transform. Anyway, this is what I get-- looks like modin is helping, but i don't know if this benchmark is valid on my machine: your branch
master at 55ec621
Note that macs only have 2 GB of RAM due to a ray bug. |
@dchigarev can you please check whether this one still persists? |
@dchigarev, can we close this issue? |
On the current master the repro from the issue works as follows (with The only thing I had to change to make it work is to modify - df.groupby(...)["x2"].transform(...)
+ df.groupby(...)[["x2"]].transform(...) Range-partitioning groupby doesn't support the initial case for now, however it should soon, once this issue is resolved (#5926) |
following up from #5904
Firstr, the groupby.apply() is about 6.11 sec in pandas and 13.9 sec in modin.
SeriesGroupBy.transform
is about 1.89 sec with pandas and 4.39 sec on. I do see that ray is spilling data to the object store, though, and that usually makes modin on ray very slow.Marking for triage because I need to look at it a bit more to see what is slow.
@LudsteckJ How much RAM is available on your windows machine, and what kind of CPU do you have with how many cores?
System Information
my laptop (macOS Monterey version 12.4 with 16 GB RAM and 2.3 GHz 8-core intel CPU on MacBook Pro (16-inch, 2019))
The text was updated successfully, but these errors were encountered: