PERF: slow SeriesGroupBy.sum() and DataFrameGroupBy.apply() #5905

mvashishtha · 2023-03-30T12:34:14Z

following up from #5904

import numpy  as np
import modin.pandas as pd
import pandas

a = np.random.randint(0, high=100, size=(50_000_000, 4))
df = pd.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])
pdf = pandas.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])
# part 1: slow groupby.apply()
%time pdf2 = pdf.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
%time df2 = df.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
# part 2: slow groupby.transform()
%time pdf['s'] = pdf.groupby(['x1']).x2.transform('sum')
%time df['s'] = df.groupby(['x1']).x2.transform('sum')

Firstr, the groupby.apply() is about 6.11 sec in pandas and 13.9 sec in modin.

SeriesGroupBy.transform is about 1.89 sec with pandas and 4.39 sec on. I do see that ray is spilling data to the object store, though, and that usually makes modin on ray very slow.

Marking for triage because I need to look at it a bit more to see what is slow.

@LudsteckJ How much RAM is available on your windows machine, and what kind of CPU do you have with how many cores?

System Information
my laptop (macOS Monterey version 12.4 with 16 GB RAM and 2.3 GHz 8-core intel CPU on MacBook Pro (16-inch, 2019))

The text was updated successfully, but these errors were encountered:

dchigarev · 2023-03-30T14:41:56Z

The #5867 that I'm working on might fix this. I've tried to run the slightly modified reproducer script on my branch for #5867 and got the following numbers:

	pandas	modin master	modin experimental groupby
apply	4.11	6.31	1.48
transform	2.01	2.86	2.08

Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz, Num cores: 112; RAM: 192gb

The experimental branch I used: https://github.com/dchigarev/modin/tree/issue_5867

Exact script I used to measure this

import numpy as np
import modin.pandas as pd
import pandas
import ray
import modin.config as cfg

cfg.BenchmarkMode.put(True) # to perform in eager mode

if hasattr(cfg, "ExperimentalGroupbyImpl"):
    cfg.ExperimentalGroupbyImpl.put(True)
    print("Using experimental groupby")
else:
    print("Using old groupby")

ray.init(num_cpus=cfg.CpuCount.get())
from timeit import default_timer as timer

a = np.random.randint(0, high=100, size=(50_000_000, 4))
df = pd.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])
pdf = pandas.DataFrame(a, columns=['x'+str(i+1) for i in range(4)])

# part 1: slow groupby.apply()
t1 = timer()
pdf2 = pdf.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
print("pandas apply", timer() - t1)

t1 = timer()
df2 = df.groupby('x1').apply(lambda g: (g.x1 + g.x2).mean())
print("modin apply", timer() - t1)

# part 2: slow groupby.transform()
t1 = timer()
pdf.groupby(['x1'])[["x2"]].transform('sum')
print("pandas transform", timer() - t1)

t1 = timer()
res = df.groupby(['x1'])[["x2"]].transform('sum')
print("modin transform", timer() - t1)

I would be really pleased if someone would be able to run this script on my branch with different hardware and provide their performance feedback. (note that you need to set the cfg.ExperimentalGroupbyImpl.put(True) to run via experimental groupby)

mvashishtha · 2023-03-30T15:03:49Z

@dchigarev thank you for running this benchmark. It looks promising! Unfortunately when I try your branch on my laptop, it seems that Modin is causing the pandas transform to slow down significantly. It looks like ray is doing some disk spilling while pandas is doing the transform. Anyway, this is what I get-- looks like modin is helping, but i don't know if this benchmark is valid on my machine:

your branch

pandas apply 7.739943127999998
modin apply 2.4368688879999993
(raylet) Spilled 2670 MiB, 30 objects, write throughput 315 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet) Spilled 4959 MiB, 291 objects, write throughput 417 MiB/s.
pandas transform 24.632068368
modin transform 15.550206659000004

master at 55ec621

pandas apply 11.576054658
modin apply 29.954450095000002
pandas transform 5.754427107000005
(raylet) Spilled 2121 MiB, 46 objects, write throughput 302 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
modin transform 9.946109193999987

Note that macs only have 2 GB of RAM due to a ray bug.

Garra1980 · 2023-07-04T17:36:55Z

@dchigarev can you please check whether this one still persists?

YarShev · 2024-01-11T21:46:30Z

@dchigarev, can we close this issue?

dchigarev · 2024-01-12T10:08:03Z

On the current master the repro from the issue works as follows (with RangePartitioningGroupby enabled):

The only thing I had to change to make it work is to modify groupby.__getitem__ call as follows:

- df.groupby(...)["x2"].transform(...)
+ df.groupby(...)[["x2"]].transform(...)

Range-partitioning groupby doesn't support the initial case for now, however it should soon, once this issue is resolved (#5926)

mvashishtha added Performance 🚀 Performance related issues and pull requests. Ray ⚡ Issues related to the Ray engine pandas.groupby Triage 🩹 Issues that need triage labels Mar 30, 2023

mvashishtha changed the title ~~PERF: slow SeriesGroupBy.sum() a~~ PERF: slow SeriesGroupBy.sum() and DataFrameGroupBy.apply() Mar 30, 2023

mvashishtha mentioned this issue Mar 30, 2023

BUG: #5904

Closed

3 tasks

mvashishtha added P1 Important tasks that we should complete soon and removed Triage 🩹 Issues that need triage labels Mar 30, 2023

dchigarev closed this as completed Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: slow SeriesGroupBy.sum() and DataFrameGroupBy.apply() #5905

PERF: slow SeriesGroupBy.sum() and DataFrameGroupBy.apply() #5905

mvashishtha commented Mar 30, 2023 •

edited

Loading

dchigarev commented Mar 30, 2023 •

edited

Loading

mvashishtha commented Mar 30, 2023

Garra1980 commented Jul 4, 2023

YarShev commented Jan 11, 2024

dchigarev commented Jan 12, 2024

PERF: slow SeriesGroupBy.sum() and DataFrameGroupBy.apply() #5905

PERF: slow SeriesGroupBy.sum() and DataFrameGroupBy.apply() #5905

Comments

mvashishtha commented Mar 30, 2023 • edited Loading

dchigarev commented Mar 30, 2023 • edited Loading

mvashishtha commented Mar 30, 2023

Garra1980 commented Jul 4, 2023

YarShev commented Jan 11, 2024

dchigarev commented Jan 12, 2024

mvashishtha commented Mar 30, 2023 •

edited

Loading

dchigarev commented Mar 30, 2023 •

edited

Loading