-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] For printing more logs 6148 CI failures #21802
[WIP] For printing more logs 6148 CI failures #21802
Conversation
529ede7
to
6b7f15e
Compare
test=develop
@tianshuo78520a Could you help run this PR on PR_CI_Manylinux_Coverage both (WITH_GPU=ON and WITH_GPU=OFF) |
6b7f15e
to
bc6372e
Compare
test=develop
b911979
to
ddc6b88
Compare
@tianshuo78520a Can you please take current commit and run on b6148? Thanks a lot |
PR_CI_Manylinux_Coverage_CPU: http://ci.paddlepaddle.org/viewLog.html?buildId=249173&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverageCpu&tab=buildResultsDiv PR_CI_Manylinux_Coverage_GPU:http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverage&buildId=249175 If you push the new commit, you can rerun above CI. |
test=develop
55f80a8
to
d97e28f
Compare
test=develop
6d2a138
to
28a1e21
Compare
test=develop
This is too big and always fail because this test load two models (fp32 and int8) and make performance comparisons. I replace to mobilenetv1. command to get above table: resnet50 QAT2 out of memory, vgg16, vgg19 are child killed. These models take bigger memory. The mechanism is if some process failed in OOM, it will choose random running process and kill them. Kill process or sacrifice child.
Because if overcommit_memory is 1, it means it allow overcommitting memory, indicating that every malloc() should succeed.
|
I am investigating in:
|
c6594cf
to
bee812d
Compare
@luotao1 After the commit bee812d, PR_CI_Manylinux_Coverage_GPU http://ci.paddlepaddle.org/viewLog.html?buildId=251980&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverage&tab=buildLog is qat_performance_test passed, int8 相关测试中只有 有没有人修test_warpctc_op 这些fail啊?因为如果前面有测试fail了,就会选择同时并行的占用内存最大的测试kill掉。int8测试因为有模型基本都比并行的更占内存,如果kill都是先kill他们。这个机器的docker内存limit了多少啊?我试着在cicheck里加了 |
83f70b4
to
bee812d
Compare
这个结论是怎么得出呢?这个结论的意思是如果有单测挂了,肯定会把同时并行跑的单测也给挂了,也就是如果单测挂了,数量至少2个以上,但我们常常出现只挂1个单测的情况。 |
打印的顺序和运行的顺序不一定一致吧?打印有延迟的吧?选择oom_score最高的kill, memory management 机制是这样的:
6148的nivdia-docker是不是给的内存小了?6148的日志中显示的memory 显示如下,只有50G左右,有些一个测试就15G了(15G 那个qat_performance我改了在 #21895 ,现在那个不挂了,但是还有的12G的。本地测得如果batch_size大或者model大,就一次加载很多,内存就大,但batch_size 不好往下改了,精度会fail。
可以明天看下现在一直顺利运行的CI (5117?)的 docker 的MEM LIMIT是多少吗?用
|
@tianshuo78520a 能用 |
5117: |
test=develop
enhanced ops: conv2d, conv3d elementwise_pow: change to a reasonable shape
* add erf op and python interface. * add fp16 support for erf op. * add unitests for erf op and its python interface.
* fix grad clip, clip op belongs to Backward op when running in Parameter Server mode.
…22090) * Fix the global_step & continuous applying error in EMA test=develop * Fix for step 0 & add unit test, test=develop
…le#21577) * fix Variable's gradient api in framework.py, test=develop * remove namescope, test=develop
* add special way to add distribute vars, Update Pyramid hash op
* set esp as 1e-6 to solve elu unitest fail,test=develop
test=develop
test=develop
test=develop
test=develop
test=develop
test=develop
…Paddle into log-ci-failures
memory problem seems be solved. Now only time out problems |
test=develop
Close because #22147 is checking the log. |
To print more logs related to #21594, could this PR be run on 6148 CI machine, so that we can see MKLDNN_VERBOSE logs and memory status?
@wojtuss Please check if anything else to add.