Skip to content

Commit

Permalink
[cuda] Improve kernel return value performance when unified memory is…
Browse files Browse the repository at this point in the history
… available (#965)

* [cuda] Improve kernel return value performance when unified memory is available

* synchronize before fetching results
  • Loading branch information
yuanming-hu authored May 13, 2020
1 parent bcd8560 commit 979ec63
Showing 1 changed file with 8 additions and 6 deletions.
14 changes: 8 additions & 6 deletions taichi/program/program.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -485,14 +485,16 @@ Kernel &Program::get_snode_writer(SNode *snode) {
uint64 Program::fetch_result_uint64(int i) {
uint64 ret;
auto arch = config.arch;
synchronize();
if (arch == Arch::cuda) {
// TODO: refactor
// We use a `memcpy_device_to_host` call here even if we have unified
// memory. This simplifies code. Also note that a unified memory (4KB) page
// fault is rather expensive for reading 4-8 bytes.
#if defined(TI_WITH_CUDA)
CUDADriver::get_instance().memcpy_device_to_host(
&ret, (uint64 *)result_buffer + i, sizeof(uint64));
if (config.use_unified_memory) {
// More efficient than a cudaMemcpy call in practice
ret = ((uint64 *)result_buffer)[i];
} else {
CUDADriver::get_instance().memcpy_device_to_host(
&ret, (uint64 *)result_buffer + i, sizeof(uint64));
}
#else
TI_NOT_IMPLEMENTED;
#endif
Expand Down

0 comments on commit 979ec63

Please sign in to comment.