Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cuda virtual memory management and merge blocks #36189

Merged
merged 44 commits into from
Nov 8, 2021
Merged

Use cuda virtual memory management and merge blocks #36189

merged 44 commits into from
Nov 8, 2021

Conversation

wanghuancoder
Copy link
Contributor

@wanghuancoder wanghuancoder commented Sep 28, 2021

PR types

New features

PR changes

Others

Describe

使用NV Virtual Memory Management(VMM)机制新写Allocator(AutoGrowthV2)。
使用VMM后,可以对向CUDA申请的显存块进行合并(而老的AutoGrowth是不能够合并显存块的)。

  • 本Allocator仍是AutoGrowth的,但比老的AutoGrowth相比,能够大幅减少碎片的产生:
    • 使用speedyspeech模型实验。显存使用量从8710MiB降到3660MiB。
  • Allocator接口性能
    • 申请释放显存,新的Allocator(AutoGrowthV2)与老的AutoGrowth的算法复杂度是一致的。
    • 但新的Allocator(AutoGrowthV2)的FreeBlocks相对减少,查询速度会稍有提升。使用standalone_executor_test测试,最大FreeBlocks数量,从185降低到166。
  • 使用VMM会对Kernel的计算性能造成影响吗?
    • NV同学表示不会
    • 使用Bert模型测试,未发现VMM造成明显性能下降。

@zhiqiu zhiqiu closed this Oct 22, 2021
@zhiqiu zhiqiu reopened this Oct 22, 2021
Comment on lines 280 to 285
auto result =
paddle::platform::dynload::cuDeviceGet(&device, p.GetDeviceId());
PADDLE_ENFORCE_EQ(
result, CUDA_SUCCESS,
platform::errors::Fatal("Call CUDA API cuDeviceGet faild, return %d.",
result));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PADDLE_ENFORCE_CUDA_SUCCESS ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感谢,已修改。

result, CUDA_SUCCESS,
platform::errors::Fatal(
"Call CUDA API cuDeviceGetAttribute faild, return %d.", result));
} catch (...) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which case it may raise exception?

Copy link
Contributor Author

@wanghuancoder wanghuancoder Oct 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, plz add comments on that.

@@ -131,6 +131,20 @@ gpuError_t RecordedCudaMalloc(void **ptr, size_t size, int dev_id);
//! CudaFree with recorded info
void RecordedCudaFree(void *p, size_t size, int dev_id);

#ifdef PADDLE_WITH_CUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems not need

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我不确定在不是CUDA的情况下CUDA_VERSION是否会被初始化为乱码,因此严谨了一些。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is #ifdef PADDLE_WITH_CUDA at the beginning, so I think maybe it is duplicated.

Comment on lines 50 to 61
__macro(cuInit); \
__macro(cuDriverGetVersion); \
__macro(cuGetErrorString); \
__macro(cuModuleLoadData); \
__macro(cuModuleGetFunction); \
__macro(cuModuleUnload); \
__macro(cuOccupancyMaxActiveBlocksPerMultiprocessor); \
__macro(cuLaunchKernel); \
__macro(cuCtxCreate); \
__macro(cuCtxGetCurrent); \
__macro(cuDeviceGetCount); \
__macro(cuDevicePrimaryCtxGetState); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are duplicated with APIs in #else, maybe #else is not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里面有些API,只有cuda10.2以上才有,所以用宏控制了一下。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can always define the APIs that exists in version < 10.2 and > 10.2 without macro.

@@ -641,6 +646,30 @@ class RecordedCudaMallocHelper {

uint64_t LimitSize() const { return limit_size_; }

#ifdef PADDLE_WITH_CUDA
#if CUDA_VERSION >= 10020
CUresult cuMemCreate(CUmemGenericAllocationHandle *handle, size_t size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better name the member function like other functions, for example, "CreateMem". (Start with upper case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,thx!

paddle::platform::CUDADeviceGuard guard(place.device);
PADDLE_ENFORCE_CUDA_SUCCESS(cudaMemGetInfo(&actual_avail, &actual_total));

virtual_mem_size_ = (actual_total + granularity_ - 1) & ~(granularity_ - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wahy do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是为了申请一个虚拟地址空间,大小恰好等于GPU的显存大小。确保一次性开辟的虚拟地址空间够用。

Copy link
Contributor Author

@wanghuancoder wanghuancoder Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

virtual_mem_size_ = (actual_total + granularity_ - 1) & ~(granularity_ - 1);
已经改成使用函数调用方式

Comment on lines 96 to 101
block--;
auto pre = block;
block++;
block++;
auto next = block;
block--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot easily understand it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有pre、next还不好理解吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改成使用std::next和std::prev

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@phlrain phlrain self-requested a review November 8, 2021 11:07
@wanghuancoder wanghuancoder merged commit a1ec1d5 into PaddlePaddle:develop Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants