[WIP] For printing more logs 6148 CI failures #21802

lidanqing-intel · 2019-12-18T00:46:03Z

To print more logs related to #21594, could this PR be run on 6148 CI machine, so that we can see MKLDNN_VERBOSE logs and memory status?
@wojtuss Please check if anything else to add.

test=develop

luotao1 · 2019-12-18T07:42:08Z

@tianshuo78520a Could you help run this PR on PR_CI_Manylinux_Coverage both (WITH_GPU=ON and WITH_GPU=OFF)

test=develop

lidanqing-intel · 2019-12-18T15:08:31Z

@tianshuo78520a Can you please take current commit and run on b6148? Thanks a lot

luotao1 · 2019-12-19T03:20:00Z

PR_CI_Manylinux_Coverage_CPU: http://ci.paddlepaddle.org/viewLog.html?buildId=249173&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverageCpu&tab=buildResultsDiv

PR_CI_Manylinux_Coverage_GPU：http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverage&buildId=249175

If you push the new commit, you can rerun above CI.

test=develop

lidanqing-intel · 2019-12-20T09:09:08Z

qat_performance(ResNet50) | 14482424maxresident)k. Resolved
int8_mobilenet_ssd | 12459172maxresident)k

This is too big and always fail because this test load two models (fp32 and int8) and make performance comparisons. I replace to mobilenetv1.

command to get above table:
sudo /usr/bin/time ctest -R test_qat_int8_mobilenetv1 -V

resnet50 QAT2 out of memory, vgg16, vgg19 are child killed. These models take bigger memory. The mechanism is if some process failed in OOM, it will choose random running process and kill them. Kill process or sacrifice child.

Could you try to set this and then run CI:
echo 2 > /proc/sys/vm/overcommit_memory
echo 80 > /proc/sys/vm/overcommit_ratio
to set permantly, you need to use sysctl

Because if overcommit_memory is 1, it means it allow overcommitting memory, indicating that every malloc() should succeed.

There are other ways. like allowing swapping, increasing swaping file and page file like following links. But I will confirm with Jacek.
https://plumbr.io/blog/memory-leaks/out-of-memory-kill-process-or-sacrifice-child
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-captun
Also, from the table, python tests take bigger memory. If all trials about system settings does not work, I will try to improve python code. But like vgg19 model itself is 500M, I am not sure to what extend code can be optimized.

lidanqing-intel · 2019-12-20T12:37:08Z

I am investigating in:

cgroups setting in docker
total-vm and anon rss big diff
python profiler https://stackoverflow.com/questions/110259/which-python-memory-profiler-is-recommended

…et50 test=develop

lidanqing-intel · 2019-12-22T22:41:38Z

@luotao1 After the commit bee812d, PR_CI_Manylinux_Coverage_GPU http://ci.paddlepaddle.org/viewLog.html?buildId=251980&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverage&tab=buildLog is

qat_performance_test passed, int8 相关测试中只有
test_slim_int8_mobilenet_mkldnn failed
test_qat_int8_resnet50_mkldnn failed
最后一个commit 应该能让test_qat_int8_resnet50_mkldnn 过。明天会把代码清理下。

有没有人修test_warpctc_op 这些fail啊？因为如果前面有测试fail了，就会选择同时并行的占用内存最大的测试kill掉。int8测试因为有模型基本都比并行的更占内存，如果kill都是先kill他们。这个机器的docker内存limit了多少啊？我试着在cicheck里加了nvidia-docker stats --no-stream和nvidia-smi，Log里都不显示。有人可以协助调查吗？

luotao1 · 2019-12-23T16:11:44Z

因为如果前面有测试fail了，就会选择同时并行的占用内存最大的测试kill掉。

这个结论是怎么得出呢？这个结论的意思是如果有单测挂了，肯定会把同时并行跑的单测也给挂了，也就是如果单测挂了，数量至少2个以上，但我们常常出现只挂1个单测的情况。

lidanqing-intel · 2019-12-23T16:38:01Z

因为如果前面有测试fail了，就会选择同时并行的占用内存最大的测试kill掉。

这个结论是怎么得出呢？这个结论的意思是如果有单测挂了，肯定会把同时并行跑的单测也给挂了，也就是如果单测挂了，数量至少2个以上，但我们常常出现只挂1个单测的情况。

打印的顺序和运行的顺序不一定一致吧？打印有延迟的吧？选择oom_score最高的kill, memory management 机制是这样的：
/*

If any of p's children has a different mm and is eligible for kill,
the one with the highest oom_badness() score is sacrificed for its
parent. This attempts to lose the minimal amount of work done while
still freeing memory.
*/

6148的nivdia-docker是不是给的内存小了？6148的日志中显示的memory 显示如下，只有50G左右，有些一个测试就15G了(15G 那个qat_performance我改了在 #21895 ，现在那个不挂了，但是还有的12G的。本地测得如果batch_size大或者model大，就一次加载很多，内存就大，但batch_size 不好往下改了，精度会fail。

[12:51:47] :	 [Step 1/1] [2799340.836399] Memory cgroup out of memory: Kill process 84722 (python2.7) score 0 or sacrifice child
[12:51:47] :	 [Step 1/1] [2799340.836402] Killed process 70104 (python2.7) total-vm:21618072kB, anon-rss:13039280kB, file-rss:0kB
[Step 1/1] [3211594.813611] memory: usage 53545580kB, limit 9007199254740991kB, failcnt 0
[Step 1/1] [3211594.813612] memory+swap: usage 53545580kB, limit 9007199254740991kB, failcnt 0

解读可以参考cgroup https://www.linuxjournal.com/content/everything-you-need-know-about-linux-containers-part-i-linux-control-groups-and-process

可以明天看下现在一直顺利运行的CI （5117?)的 docker 的MEM LIMIT是多少吗？用nvidia-docker stats --no-stream，波兰这边是376G：

[lidanqin@aipg-igk-skx-01 /]$ nvidia-docker stats --no-stream
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
8bc048ae0100        awesome_babbage     0.42%               2.855MiB / 376.4GiB   0.00%               8.81GB / 175MB      807kB / 7.7GB       3

luotao1 · 2019-12-24T10:03:22Z

@tianshuo78520a 能用nvidia-docker stats --no-stream看下5117和6148上的内存限制分别是多少么？

tianshuo78520a · 2019-12-24T10:57:22Z

@tianshuo78520a 能用nvidia-docker stats --no-stream看下5117和6148上的内存限制分别是多少么？

5117：
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
77c671752386 1568.12% 17.21GiB / 98.18GiB 17.52% 1.03GB / 44.1MB 49.3GB / 189GB 0
6148：
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
cdd8b1eebdfc 0.00% 3.113MiB / 502.2GiB 0.00% 0B / 0B 31.8GB / 93GB 0

test=develop

…text constructors (PaddlePaddle#22046)

enhanced ops: conv2d, conv3d elementwise_pow: change to a reasonable shape

…addlePaddle#22109)

* add erf op and python interface. * add fp16 support for erf op. * add unitests for erf op and its python interface.

* fix grad clip， clip op belongs to Backward op when running in Parameter Server mode.

…nt_fix (PaddlePaddle#22108)

…22090) * Fix the global_step & continuous applying error in EMA test=develop * Fix for step 0 & add unit test, test=develop

…le#21577) * fix Variable's gradient api in framework.py, test=develop * remove namescope, test=develop

* add special way to add distribute vars， Update Pyramid hash op

* set esp as 1e-6 to solve elu unitest fail,test=develop

)

test=develop

…Paddle into log-ci-failures

lidanqing-intel · 2020-01-07T11:21:31Z

memory problem seems be solved. Now only time out problems

test=develop

lidanqing-intel · 2020-01-08T15:07:38Z

Close because #22147 is checking the log.

lidanqing-intel force-pushed the log-ci-failures branch from 529ede7 to 6b7f15e Compare December 18, 2019 00:48

For printing more logs in issue 21594

bc6372e

test=develop

lidanqing-intel added the Intel label Dec 18, 2019

luotao1 requested review from juncaipeng and wzzju December 18, 2019 07:46

lidanqing-intel force-pushed the log-ci-failures branch from 6b7f15e to bc6372e Compare December 18, 2019 07:58

add dmesg to generate memory info

ddc6b88

test=develop

lidanqing-intel force-pushed the log-ci-failures branch from b911979 to ddc6b88 Compare December 18, 2019 14:12

lidanqing-intel changed the title ~~[WIP] For printing more logs in issue 21594~~ [WIP] For printing more logs 6148 CI failures Dec 18, 2019

move position of dmesg

d97e28f

test=develop

lidanqing-intel force-pushed the log-ci-failures branch from 55f80a8 to d97e28f Compare December 19, 2019 13:20

only leave one test

28a1e21

test=develop

lidanqing-intel force-pushed the log-ci-failures branch from 6d2a138 to 28a1e21 Compare December 19, 2019 20:18

one test failed in printing memory status

2745075

test=develop

lidanqing-intel added 3 commits December 20, 2019 15:54

check the docker status

821b910

Merge branch 'develop' into log-ci-failures

5248a88

change qat_performance with mobilenet, change batch_size of qat2_resn…

bee812d

…et50 test=develop

lidanqing-intel force-pushed the log-ci-failures branch from c6594cf to bee812d Compare December 22, 2019 21:35

lidanqing-intel force-pushed the log-ci-failures branch from 83f70b4 to bee812d Compare December 23, 2019 12:22

check 5117 docker memory

c134ef8

test=develop

silingtong123 and others added 22 commits January 6, 2020 21:55

test=develop, remove unused parameter from class RuntimeInferShapeCon…

6c20e7c

…text constructors (PaddlePaddle#22046)

all cases use large shape (PaddlePaddle#22106)

cce9af0

enhanced ops: conv2d, conv3d elementwise_pow: change to a reasonable shape

replace CUDNN_ENFORCE with PADDLE_ENFORCE_CUDA_SUCCESS, test=develop (P…

ba8414d

…addlePaddle#22109)

fix fleet collective api run on cpu, test=develop (PaddlePaddle#22064)

f385c34

add erf op (PaddlePaddle#21785)

14aebc7

* add erf op and python interface. * add fp16 support for erf op. * add unitests for erf op and its python interface.

Fix grad clip (PaddlePaddle#21784)

5c33919

* fix grad clip， clip op belongs to Backward op when running in Parameter Server mode.

add Note in the doc of old control flow ops. test=develop,test=docume…

de56887

…nt_fix (PaddlePaddle#22108)

fix test_bilinear_tensor_product_op timeout (PaddlePaddle#22120)

5de6a19

Fix the global_step & continuous applying error in EMA (PaddlePaddle#…

434f7b9

…22090) * Fix the global_step & continuous applying error in EMA test=develop * Fix for step 0 & add unit test, test=develop

fix format in operator.cc (PaddlePaddle#22101)

4b4a9cc

fix Variable's gradient api in framework.py, test=develop (PaddlePadd…

3b84584

…le#21577) * fix Variable's gradient api in framework.py, test=develop * remove namescope, test=develop

Update pyramid related OP (PaddlePaddle#21372)

418abc9

* add special way to add distribute vars， Update Pyramid hash op

Solve elu unitest fail (PaddlePaddle#22123)

b9a6354

* set esp as 1e-6 to solve elu unitest fail,test=develop

Enable CI check to match PADDLE_ENFORCE_CUDA_SUCCESS (PaddlePaddle#22122

f220be4

)

For printing more logs in issue 21594

88cd52a

test=develop

add dmesg to generate memory info

82f7d5e

test=develop

move position of dmesg

4abac31

test=develop

only leave one test

3896ca8

test=develop

one test failed in printing memory status

682bc77

test=develop

check the docker status

02d320d

check 5117 docker memory

ad041e9

test=develop

Merge branch 'log-ci-failures' of https://github.com/lidanqing-intel/…

10e9463

…Paddle into log-ci-failures

lidanqing-intel closed this Jan 7, 2020

For checking vgg16, vgg19 time out failure

1f1768e

test=develop

lidanqing-intel reopened this Jan 7, 2020

lidanqing-intel mentioned this pull request Jan 7, 2020

Enable int8v2 and qat resnet101 vgg16, vgg19 UT on 6148 #22147

Closed

lidanqing-intel closed this Jan 8, 2020

lidanqing-intel deleted the log-ci-failures branch June 16, 2021 07:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] For printing more logs 6148 CI failures #21802

[WIP] For printing more logs 6148 CI failures #21802

lidanqing-intel commented Dec 18, 2019 •

edited

Loading

luotao1 commented Dec 18, 2019

lidanqing-intel commented Dec 18, 2019

luotao1 commented Dec 19, 2019

lidanqing-intel commented Dec 20, 2019 •

edited

Loading

lidanqing-intel commented Dec 20, 2019

lidanqing-intel commented Dec 22, 2019 •

edited

Loading

luotao1 commented Dec 23, 2019

lidanqing-intel commented Dec 23, 2019 •

edited

Loading

luotao1 commented Dec 24, 2019

tianshuo78520a commented Dec 24, 2019

lidanqing-intel commented Jan 7, 2020

lidanqing-intel commented Jan 8, 2020

[WIP] For printing more logs 6148 CI failures #21802

[WIP] For printing more logs 6148 CI failures #21802

Conversation

lidanqing-intel commented Dec 18, 2019 • edited Loading

luotao1 commented Dec 18, 2019

lidanqing-intel commented Dec 18, 2019

luotao1 commented Dec 19, 2019

lidanqing-intel commented Dec 20, 2019 • edited Loading

lidanqing-intel commented Dec 20, 2019

lidanqing-intel commented Dec 22, 2019 • edited Loading

luotao1 commented Dec 23, 2019

lidanqing-intel commented Dec 23, 2019 • edited Loading

luotao1 commented Dec 24, 2019

tianshuo78520a commented Dec 24, 2019

lidanqing-intel commented Jan 7, 2020

lidanqing-intel commented Jan 8, 2020

lidanqing-intel commented Dec 18, 2019 •

edited

Loading

lidanqing-intel commented Dec 20, 2019 •

edited

Loading

lidanqing-intel commented Dec 22, 2019 •

edited

Loading

lidanqing-intel commented Dec 23, 2019 •

edited

Loading