[SPARK-49640][PS] Apply reservoir sampling in `SampledPlotBase` #48105

zhengruifeng · 2024-09-13T10:50:29Z

What changes were proposed in this pull request?

Apply reservoir sampling in SampledPlotBase

Why are the changes needed?

Existing sampling approach has two drawbacks:

1, it needs two jobs to sample max_rows rows:

df.count() to compute fraction = max_rows / count
df.sample(fraction).to_pandas() to do the sampling

2, the df.sample is based on Bernoulli sampling which cannot guarantee the sampled size == expected max_rows, e.g.

In [1]: df = spark.range(10000)

In [2]: [df.sample(0.01).count() for i in range(0, 10)]
Out[2]: [96, 97, 95, 97, 105, 105, 105, 87, 95, 110]

The size of sampled data is floating near the target size 10000*0.01=100.
This relative deviation cannot be ignored, when the input dataset is large and the sampling fraction is small.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI and manually check

Was this patch authored or co-authored using generative AI tooling?

No

init init init

zhengruifeng · 2024-09-13T10:53:13Z

python/pyspark/pandas/plot/core.py

+                    F.monotonically_increasing_id().alias(id_col_name),
+                )
+                .sort(rand_col_name)
+                .limit(max_rows + 1)


sort + limit is likely be optimized to TakeOrderedAndProject which output single partition, this coalesce here is just used to guarantee the partitioning.

zhengruifeng · 2024-09-13T10:57:37Z

python/pyspark/pandas/plot/core.py

+                .sort(rand_col_name)
+                .limit(max_rows + 1)
+                .coalesce(1)
+                .sortWithinPartitions(id_col_name)


using local sorting to avoid unnecessary shuffle

zhengruifeng · 2024-09-13T10:58:19Z

will send separate PR for the new dataframe plotting

xinrong-meng · 2024-09-17T06:09:17Z

I’m wondering if this is considered a “user-facing change” since the sampling directly affects the appearance of the plot

xinrong-meng · 2024-09-17T06:10:15Z

Thank you @zhengruifeng !

zhengruifeng · 2024-09-18T00:30:28Z

I’m wondering if this is considered a “user-facing change” since the sampling directly affects the appearance of the plot

hmm, I don't think this is a “user-facing change” because:
1, it only take affect when number of rows > max_rows;
2, it is just another (better) implementation of uniform sampling;

zhengruifeng · 2024-09-18T00:44:56Z

thanks, merged to master

### What changes were proposed in this pull request? Apply reservoir sampling in `SampledPlotBase` ### Why are the changes needed? Existing sampling approach has two drawbacks: 1, it needs two jobs to sample `max_rows` rows: - df.count() to compute `fraction = max_rows / count` - df.sample(fraction).to_pandas() to do the sampling 2, the df.sample is based on Bernoulli sampling which **cannot** guarantee the sampled size == expected `max_rows`, e.g. ``` In [1]: df = spark.range(10000) In [2]: [df.sample(0.01).count() for i in range(0, 10)] Out[2]: [96, 97, 95, 97, 105, 105, 105, 87, 95, 110] ``` The size of sampled data is floating near the target size 10000*0.01=100. This relative deviation cannot be ignored, when the input dataset is large and the sampling fraction is small. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI and manually check ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48105 from zhengruifeng/ps_sampling. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

init

51b639d

init init init

github-actions bot added PYTHON PANDAS API ON SPARK labels Sep 13, 2024

zhengruifeng commented Sep 13, 2024

View reviewed changes

zhengruifeng requested review from HyukjinKwon and xinrong-meng September 13, 2024 10:53

zhengruifeng commented Sep 13, 2024

View reviewed changes

zhengruifeng changed the title ~~[SPARK-49640][PS] Apply Reservoir sampling in SampledPlotBase~~ [SPARK-49640][PS] Apply reservoir sampling in SampledPlotBase Sep 13, 2024

xinrong-meng approved these changes Sep 17, 2024

View reviewed changes

HyukjinKwon approved these changes Sep 17, 2024

View reviewed changes

zhengruifeng closed this in a7f191b Sep 18, 2024

zhengruifeng deleted the ps_sampling branch September 18, 2024 00:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49640][PS] Apply reservoir sampling in `SampledPlotBase` #48105

[SPARK-49640][PS] Apply reservoir sampling in `SampledPlotBase` #48105

zhengruifeng commented Sep 13, 2024 •

edited

Loading

zhengruifeng Sep 13, 2024

zhengruifeng Sep 13, 2024

zhengruifeng commented Sep 13, 2024

xinrong-meng commented Sep 17, 2024

xinrong-meng commented Sep 17, 2024

zhengruifeng commented Sep 18, 2024

zhengruifeng commented Sep 18, 2024

[SPARK-49640][PS] Apply reservoir sampling in SampledPlotBase #48105

[SPARK-49640][PS] Apply reservoir sampling in SampledPlotBase #48105

Conversation

zhengruifeng commented Sep 13, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng Sep 13, 2024

Choose a reason for hiding this comment

zhengruifeng Sep 13, 2024

Choose a reason for hiding this comment

zhengruifeng commented Sep 13, 2024

xinrong-meng commented Sep 17, 2024

xinrong-meng commented Sep 17, 2024

zhengruifeng commented Sep 18, 2024

zhengruifeng commented Sep 18, 2024

[SPARK-49640][PS] Apply reservoir sampling in `SampledPlotBase` #48105

[SPARK-49640][PS] Apply reservoir sampling in `SampledPlotBase` #48105

zhengruifeng commented Sep 13, 2024 •

edited

Loading