-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize sample perf #4159
Optimize sample perf #4159
Conversation
Signed-off-by: Chong Gao <res_life@163.com>
Note: depends on CUDF issue: rapidsai/cudf#9728 |
build |
Build failed because of CUDF PR is not merged. |
Marking as draft because the CUDF JNI dependency is not in yet. |
// copy row indexes to host buffer | ||
var idx = 0 | ||
while (idx < rows.length) { | ||
hostBuffer.setInt(idx * intBytes, rows(idx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly for my own curiosity. You could have also done this with
withResource(cudf.ColumnVector.fromInts(rows: _*)) { gatherCv =>
withResource(GpuColumnVector.from(cb)) { table =>
// GPU gather
withResource(table.gather(gatherCv)) { gatheredTable =>
GpuColumnVector.from(gatheredTable, colTypes)
}
}
}
Did you try this? If so was your current way more performant? I realize it might involve an extra memory copy because the ArrayBuffer
needs to be transformed into an Array
, so it can be passed to fromInts
. So if you didn't I don't think it is worth spending too much time on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, code is more neat and more effecient.
preservesPartitioning = true, | ||
seed) | ||
} else { | ||
val useGpuToSample = new RapidsConf(conf).isFastSampleEnabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it would be cleaner if we had a GpuFastSampleExec
and a GpuSampleExec
, then we can select which one to use when we replace it. Part of this is because creating a RapidsConf is not cheap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done.
c | ||
} else { | ||
withResource(GpuColumnVector.from(cb)) { table => | ||
withResource(table.sample(numSampleRows, withReplacement, seed)) { sampled => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Because we use the same seed every time then I think all of the batches will be sampled the same way. I am not sure mathematically how that all works out. Could we at least set the seed to seed + index
so each task does it slightly differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, use the way like CPU to set the seed.
Signed-off-by: Chong Gao <res_life@163.com>
|
build |
@revans2 Please see the above test result; Please help to review. |
I get very different results, and I think your issue is because of Parquet reading. I ran First single task/thread. This is the median of 3 runs, and measured in ms.
It is also clear from the op time metrics on the UI for the runs. When I run with 12 tasks/threads I get
The regular GPU being slower is because I have my GPU semaphore set to 4 in parallel. So it cannot fully utilize the CPU when running. If I change the query to include replacement
So at this point the only thing that is lacking here is some documentation to let people know that there is a crazy fast version that is not the same as the Spark version. |
The following optimization makes counting a data frame very fast.
When querying sum the CPU is a little faster. |
Yup I got ahead of myself and did things too quickly. You are 100% correct. Could you file a follow on issue to then look at what we can do to possibly speed up sampling on the GPU? |
To be clear the issue appears to be related to doing it without replacement. With replacement is really fast. This indicates that
|
build |
build |
The issue filed rapidsai/cudf#9834 |
Updated the doc, please review again @revans2 |
Signed-off-by: Chong Gao <res_life@163.com>
This fixes #4096
Samples data by GPU JNI to improve the performance.
Added "spark.rapids.sql.fast.sample" to switch.