Optimize sample perf #4159

res-life · 2021-11-19T11:00:38Z

This fixes #4096

Samples data by GPU JNI to improve the performance.
Added "spark.rapids.sql.fast.sample" to switch.

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-11-19T11:04:44Z

The bottlenecks are XORShiftRandom.sample and ArrayBuffer append.
	It's hard to optimize the XORShiftRandom.sample. XORShiftRandom is serialized algorithm(current value depends on privious value), can't run paralleled.
	For ArrayBuffer append, it's also hard to avoid, we should collect the sampled row numbers.
Optimization
	Add pure GPU solution to invoke JNI, will be turned on by switching the config "spark.rapids.sql.fast.sample". 
	This solution is inconsistent with CPU exec although it's faster.
   This solution is 3x faster than the previous.
	Test Deails
		Data set: sample 1% from 70,000,000 rows
		Result:
			1. CPU samples and GPU gathers(Previous solution): avg 275 ms
			2. GPU sample JNI:  used time: avg 81ms
Others
	Add the metrics for all the scenarios
	Fixed the bug: When Columnar Bach has no columns, throws an exception.

Note: depends on CUDF issue: rapidsai/cudf#9728

res-life · 2021-11-19T11:42:35Z

build

res-life · 2021-11-19T12:37:53Z

Build failed because of CUDF PR is not merged.

revans2 · 2021-11-19T13:09:22Z

Marking as draft because the CUDF JNI dependency is not in yet.

revans2 · 2021-11-19T13:39:54Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GatherUtils.scala

+        // copy row indexes to host buffer
+        var idx = 0
+        while (idx < rows.length) {
+          hostBuffer.setInt(idx * intBytes, rows(idx))


This is mostly for my own curiosity. You could have also done this with

withResource(cudf.ColumnVector.fromInts(rows: _*)) { gatherCv => withResource(GpuColumnVector.from(cb)) { table => // GPU gather withResource(table.gather(gatherCv)) { gatheredTable => GpuColumnVector.from(gatheredTable, colTypes) } } }

Did you try this? If so was your current way more performant? I realize it might involve an extra memory copy because the ArrayBuffer needs to be transformed into an Array, so it can be passed to fromInts. So if you didn't I don't think it is worth spending too much time on it.

Updated, code is more neat and more effecient.

revans2 · 2021-11-19T13:46:53Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

-        preservesPartitioning = true,
-        seed)
-    } else {
+    val useGpuToSample = new RapidsConf(conf).isFastSampleEnabled


nit: I think it would be cleaner if we had a GpuFastSampleExec and a GpuSampleExec, then we can select which one to use when we replace it. Part of this is because creating a RapidsConf is not cheap.

revans2 · 2021-11-19T13:49:16Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

+                    c
+                } else {
+                  withResource(GpuColumnVector.from(cb)) { table =>
+                    withResource(table.sample(numSampleRows, withReplacement, seed)) { sampled =>


nit: Because we use the same seed every time then I think all of the batches will be sampled the same way. I am not sure mathematically how that all works out. Could we at least set the seed to seed + index so each task does it slightly differently?

Updated, use the way like CPU to set the seed.

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2021-11-23T07:33:01Z

The test result is as follow:
data set
	parquet file only has one int column, one partition, has 2,100,000,000 rows.
config is
	spark.executor.cores=1
	spark.rapids.sql.concurrentGpuTasks=1
	// limit to one thead
sql is:
	spark.read.parquet("/home/chong/data/parquet").sample(0.01, 0).agg(f.sum("salary")).show()
result
	CPU used time: 4.6s
	GPU: 5.1s  		   // 	spark.conf.set("spark.rapids.sql.fast.sample", True)
	CPU GPU hybrid: 11.41s  // 	spark.conf.set("spark.rapids.sql.fast.sample", False)
conclusion
	GPU is a little slow than CPU, maybe we need to file a CUDF issue to improve the performance of the sample JNI

res-life · 2021-12-01T11:52:14Z

build

res-life · 2021-12-01T11:57:21Z

@revans2 Please see the above test result; Please help to review.

revans2 · 2021-12-01T18:58:29Z

I get very different results, and I think your issue is because of Parquet reading.

I ran spark.time(spark.range(Int.MaxValue).sample(0.01, 0).count)

First single task/thread. This is the median of 3 runs, and measured in ms.

CPU	GPU	FAST GPU
10633	11108	83

It is also clear from the op time metrics on the UI for the runs.

When I run with 12 tasks/threads I get

CPU	GPU	FAST GPU
1236	2952	100

The regular GPU being slower is because I have my GPU semaphore set to 4 in parallel. So it cannot fully utilize the CPU when running.

If I change the query to include replacement spark.time(spark.range(Int.MaxValue).sample(0.01, 0).count) I get.

CPU	GPU	FAST GPU
11135	22556	116

So at this point the only thing that is lacking here is some documentation to let people know that there is a crazy fast version that is not the same as the Spark version.

res-life · 2021-12-02T13:21:48Z

The following optimization makes counting a data frame very fast.

GpuFastSampleExec
  } else if (cb.numCols() == 0) {
	  // for count agg, num of cols is 0
	  val c = GpuColumnVector.emptyBatchFromTypes(colTypes)
	  c.setNumRows(numSampleRows.toInt)  // can skip GPU sample.
	  c
  } else {

When querying sum the CPU is a little faster.

revans2 · 2021-12-02T16:11:42Z

Yup I got ahead of myself and did things too quickly. You are 100% correct. Could you file a follow on issue to then look at what we can do to possibly speed up sampling on the GPU?

revans2 · 2021-12-02T16:23:53Z

To be clear the issue appears to be related to doing it without replacement. With replacement is really fast. This indicates that thrust::shuffle_copy is the reason for the slowness.

spark.time(spark.range(Int.MaxValue * 12L).sample(true, 0.01, 0).selectExpr("SUM(id)", "COUNT(id)").show())
+-------------------+---------+
|            sum(id)|count(id)|
+-------------------+---------+
|3320410258800588221|257697960|
+-------------------+---------+

Time taken: 670 ms

res-life · 2021-12-03T02:21:03Z

build

res-life · 2021-12-03T07:11:14Z

build

res-life · 2021-12-03T09:40:48Z

The issue filed rapidsai/cudf#9834

res-life · 2021-12-03T09:42:26Z

Updated the doc, please review again @revans2

Signed-off-by: Chong Gao <res_life@163.com>

Optimize sample perf

f90c210

Signed-off-by: Chong Gao <res_life@163.com>

revans2 assigned res-life Nov 19, 2021

revans2 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Nov 19, 2021

revans2 marked this pull request as draft November 19, 2021 13:09

revans2 reviewed Nov 19, 2021

View reviewed changes

Chong Gao added 2 commits November 23, 2021 15:00

Refactor

08fac3e

Signed-off-by: Chong Gao <res_life@163.com>

Delete unused import

625d23d

Signed-off-by: Chong Gao <res_life@163.com>

res-life marked this pull request as ready for review December 2, 2021 13:22

revans2 previously approved these changes Dec 2, 2021

View reviewed changes

Chong Gao added 2 commits December 3, 2021 14:53

Merge branch 'branch-22.02' into sample-perf

40407d3

Update doc

6816f5a

res-life dismissed revans2’s stale review via 6816f5a December 3, 2021 07:10

revans2 approved these changes Dec 3, 2021

View reviewed changes

revans2 merged commit 34994dd into NVIDIA:branch-22.02 Dec 3, 2021

res-life deleted the sample-perf branch December 4, 2021 07:35

res-life pushed a commit to res-life/spark-rapids that referenced this pull request Dec 6, 2021

Optimize sample perf (NVIDIA#4159)

714ecbb

Signed-off-by: Chong Gao <res_life@163.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize sample perf #4159

Optimize sample perf #4159

res-life commented Nov 19, 2021

res-life commented Nov 19, 2021 •

edited

Loading

res-life commented Nov 19, 2021

res-life commented Nov 19, 2021

revans2 commented Nov 19, 2021

revans2 Nov 19, 2021

res-life Nov 23, 2021

revans2 Nov 19, 2021

res-life Nov 23, 2021

revans2 Nov 19, 2021

res-life Nov 23, 2021

res-life commented Nov 23, 2021

res-life commented Dec 1, 2021

res-life commented Dec 1, 2021

revans2 commented Dec 1, 2021

res-life commented Dec 2, 2021 •

edited

Loading

revans2 commented Dec 2, 2021

revans2 commented Dec 2, 2021

res-life commented Dec 3, 2021

res-life commented Dec 3, 2021

res-life commented Dec 3, 2021

res-life commented Dec 3, 2021

Optimize sample perf #4159

Optimize sample perf #4159

Conversation

res-life commented Nov 19, 2021

res-life commented Nov 19, 2021 • edited Loading

res-life commented Nov 19, 2021

res-life commented Nov 19, 2021

revans2 commented Nov 19, 2021

revans2 Nov 19, 2021

Choose a reason for hiding this comment

res-life Nov 23, 2021

Choose a reason for hiding this comment

revans2 Nov 19, 2021

Choose a reason for hiding this comment

res-life Nov 23, 2021

Choose a reason for hiding this comment

revans2 Nov 19, 2021

Choose a reason for hiding this comment

res-life Nov 23, 2021

Choose a reason for hiding this comment

res-life commented Nov 23, 2021

res-life commented Dec 1, 2021

res-life commented Dec 1, 2021

revans2 commented Dec 1, 2021

res-life commented Dec 2, 2021 • edited Loading

revans2 commented Dec 2, 2021

revans2 commented Dec 2, 2021

res-life commented Dec 3, 2021

res-life commented Dec 3, 2021

res-life commented Dec 3, 2021

res-life commented Dec 3, 2021

res-life commented Nov 19, 2021 •

edited

Loading

res-life commented Dec 2, 2021 •

edited

Loading