Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new APIs for GPU memory monitoring (max_memory_allocated, max_memory_reserved, memory_allocated, memory_reserved) #38657

Merged
merged 16 commits into from
Mar 30, 2022

Conversation

From00
Copy link
Contributor

@From00 From00 commented Jan 1, 2022

PR types

New features

PR changes

APIs

Describe

Add 4 new APIs: paddle.device.cuda.max_memory_allocated, paddle.device.cuda.max_memory_reserved, paddle.device.cuda.memory_allocated and paddle.device.cuda.memory_reserved

image
image

image

image
image

image

CN docs PR:PaddlePaddle/docs#4193

@paddle-bot-old
Copy link

paddle-bot-old bot commented Jan 1, 2022

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot-old
Copy link

Sorry to inform you that 4d506ea's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@From00 From00 changed the title Add new API memory_reserved and max_memory_reserved Add new APIs for GPU memory monitoring (memory_reserved, memory_allocated, max_memory_reserved, max_memory_allocated) Jan 17, 2022
@From00 From00 changed the title Add new APIs for GPU memory monitoring (memory_reserved, memory_allocated, max_memory_reserved, max_memory_allocated) Add new APIs for GPU memory monitoring (max_memory_allocated, max_memory_reserved, memory_allocated, memory_reserved) Jan 17, 2022
if (platform::is_gpu_place(place)) {
int dev_id = place.GetDeviceId();
int64_t alloc_size =
STAT_INT_ADD("STAT_gpu" + std::to_string(dev_id) + "_alloc_size",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请问这里能做到通过 AllocatorFacade 分配内存等价于拿到具体的 Allocator 然后返回 Allocator->Allocate(size) 吗?后续 Tensor 计划不走 AllocatorFacade,而是直接传入具体的 Allocator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已另作讨论,这里采集显存数据的相关逻辑无法实现到具体的Allocator里,与pten直接获取Allocator对象后分配内存的设想不等价,之后pten的Alloc接口在获取Allocator分配内存后,也需要添加类似的数据采集逻辑。此处存在一些和Allocator以及Pten最初设计不太切合的修改,短期先同步后进行合入,不阻塞相关功能的开发,后续pten项目相关负责人员腾出时间后,再对类似的问题进行集中讨论和优化整改。 @phlrain @chenwhql @zhiqiu @jim19930609

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按先期形成的共识,Allocator 分配逻辑的统一出口为 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/memory/allocation/allocator.h#L142 。目前因为进度原因先行同意此合入,相关问题 @From00 后续处理。

@From00 From00 requested a review from zhiqiu January 20, 2022 06:10
Shixiaowei02
Shixiaowei02 previously approved these changes Jan 20, 2022
Copy link
Contributor

@Shixiaowei02 Shixiaowei02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按先期形成的共识,Allocator 分配逻辑的统一出口为 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/memory/allocation/allocator.h#L142 。目前此 PR 在实现上有差异。因为进度原因先行同意此合入,相关问题 @From00 后续处理。

@paddle-bot-old
Copy link

Sorry to inform you that fb04a61's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@ZHUI
Copy link
Collaborator

ZHUI commented Mar 22, 2022

此PR有计划随着 paddle2.3 发版吗? @From00

void Update(int64_t increment) override {
ThreadLocalStatType thread_local_stat =
ThreadDataRegistry<ThreadLocalStatType>::GetInstance()
.GetCurrentThreadData();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetMutableCurrentThreadData is designed for read&write scene

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thx.

@From00
Copy link
Contributor Author

From00 commented Mar 26, 2022

按先期形成的共识,Allocator 分配逻辑的统一出口为 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/memory/allocation/allocator.h#L142 。目前此 PR 在实现上有差异。因为进度原因先行同意此合入,相关问题 @From00 后续处理。

新的方案通过包装一层StatAllocator实现显存统计功能,目前已不影响Alloacator分配逻辑的出口函数GetAllocator,与PHI(原PTEN)的设计原则无差异,可兼容直接获取Allocator对象后分配内存的行为。

@From00
Copy link
Contributor Author

From00 commented Mar 26, 2022

此PR有计划随着 paddle2.3 发版吗? @From00

有计划。此PR目前由于人力原因推动较慢,会争取在Paddle 2.3随版发布。

@From00 From00 requested a review from phlrain March 27, 2022 09:24
Copy link
Contributor

@tianshuo78520a tianshuo78520a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve for coverage-ci build size

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@TCChenlong TCChenlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG API

@From00 From00 merged commit afe02e9 into PaddlePaddle:develop Mar 30, 2022
@From00 From00 deleted the add-new-api-memory_reserved branch April 4, 2022 12:29
@luotao1
Copy link
Contributor

luotao1 commented May 25, 2022

build目录体积增大的情况说明与豁免申请

  • PR内容:在paddle主框架fluid/memory目录下新实现一套用于模型显存监测的基础组件,对底层Allocator显存分配情况进行实时采集和统计,并暴露相关的python端API供用户使用。
  • 影响幅度:Coverage CI上build目录体积从136G增长到140G,本地编译测试build目录体积从212G增加到216G,增加了4G,超过CI限制的3G上限。
  • 具体原因:因Allocator相关代码过于底层,paddle中几乎所有模块都有依赖,底层新功能的改动和增加牵一发而动全身。在本地测试,加入相关功能代码后,memory目录自身大小增加约0.4G(其中110M是新增的单测文件),phi、framework和operator三个目录均各增加约1G,framework目录增加约0.4G,imperative目录增加约0.1G,合计整个build目录体积增加接近4G。新增体积主要来自paddle/fluid/memory/stats.h头文件,该文件中新实现了一套显存统计的基础组件,并在allocator和gpu_info中被用于实时统计allocator分配给Tensor以及从GPU设备申请的显存大小,而allocator和gpu_info被框架中大部分文件依赖。
  • 其它情况说明
    1. PR合入的必要性?模型显存相关问题覆盖面广、涉及用户多,显存监测功能的缺失会导致模型训练遇到显存相关问题时排查困难、针对显存的性能分析和优化不方便、极大影响用户体验,相关需求已有多次用户反馈。
    2. 是否可以复用框架中已有的性能监测代码,而不引入一套全新的实现?前期尝试过直接使用框架中前人开发的monitor组件进行显存监测,但实测证明已有组件性能较差,再加上模型训练时显存相关操作频繁,对显存模块性能非常敏感,已有代码用于显存监测后对许多模型最大可造成超过10个点的训练性能下降,因而需要新实现一套高性能的统计方案。
    3. 代码实现上是否可以对编译体积进行优化?有考虑过不在有较多依赖的头文件中添加太多内容,但相关代码很难不在头文件中进行实现。一方面为了支持多种统计指标,相关数据结构使用类模板实现,无法不在头文件中进行定义。另一方面,为了运行时性能考虑,一些映射逻辑设计成宏函数在编译期进行处理,宏函数只有实现在头文件中才能被其它模块使用。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.