Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] For printing more logs 6148 CI failures #21802

Closed

Conversation

lidanqing-intel
Copy link
Contributor

@lidanqing-intel lidanqing-intel commented Dec 18, 2019

To print more logs related to #21594, could this PR be run on 6148 CI machine, so that we can see MKLDNN_VERBOSE logs and memory status?
@wojtuss Please check if anything else to add.

@luotao1
Copy link
Contributor

luotao1 commented Dec 18, 2019

@tianshuo78520a Could you help run this PR on PR_CI_Manylinux_Coverage both (WITH_GPU=ON and WITH_GPU=OFF)

@lidanqing-intel lidanqing-intel changed the title [WIP] For printing more logs in issue 21594 [WIP] For printing more logs 6148 CI failures Dec 18, 2019
@lidanqing-intel
Copy link
Contributor Author

@tianshuo78520a Can you please take current commit and run on b6148? Thanks a lot

@luotao1
Copy link
Contributor

luotao1 commented Dec 19, 2019

test=develop
@lidanqing-intel
Copy link
Contributor Author

lidanqing-intel commented Dec 20, 2019

image
qat_performance(ResNet50) | 14482424maxresident)k. Resolved
int8_mobilenet_ssd | 12459172maxresident)k

This is too big and always fail because this test load two models (fp32 and int8) and make performance comparisons. I replace to mobilenetv1.

command to get above table:
sudo /usr/bin/time ctest -R test_qat_int8_mobilenetv1 -V

resnet50 QAT2 out of memory, vgg16, vgg19 are child killed. These models take bigger memory. The mechanism is if some process failed in OOM, it will choose random running process and kill them. Kill process or sacrifice child.

  1. Could you try to set this and then run CI:
    echo 2 > /proc/sys/vm/overcommit_memory
    echo 80 > /proc/sys/vm/overcommit_ratio
    to set permantly, you need to use sysctl

Because if overcommit_memory is 1, it means it allow overcommitting memory, indicating that every malloc() should succeed.

  1. There are other ways. like allowing swapping, increasing swaping file and page file like following links. But I will confirm with Jacek.
    https://plumbr.io/blog/memory-leaks/out-of-memory-kill-process-or-sacrifice-child
    https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-captun

  2. Also, from the table, python tests take bigger memory. If all trials about system settings does not work, I will try to improve python code. But like vgg19 model itself is 500M, I am not sure to what extend code can be optimized.

@lidanqing-intel
Copy link
Contributor Author

I am investigating in:

  1. cgroups setting in docker
  2. total-vm and anon rss big diff
  3. python profiler https://stackoverflow.com/questions/110259/which-python-memory-profiler-is-recommended

@lidanqing-intel
Copy link
Contributor Author

lidanqing-intel commented Dec 22, 2019

@luotao1 After the commit bee812d, PR_CI_Manylinux_Coverage_GPU http://ci.paddlepaddle.org/viewLog.html?buildId=251980&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverage&tab=buildLog is

qat_performance_test passed, int8 相关测试中只有
test_slim_int8_mobilenet_mkldnn failed
test_qat_int8_resnet50_mkldnn failed
最后一个commit 应该能让test_qat_int8_resnet50_mkldnn 过。明天会把代码清理下。

有没有人修test_warpctc_op 这些fail啊?因为如果前面有测试fail了,就会选择同时并行的占用内存最大的测试kill掉。int8测试因为有模型基本都比并行的更占内存,如果kill都是先kill他们。这个机器的docker内存limit了多少啊?我试着在cicheck里加了nvidia-docker stats --no-streamnvidia-smi,Log里都不显示。有人可以协助调查吗?

@luotao1
Copy link
Contributor

luotao1 commented Dec 23, 2019

因为如果前面有测试fail了,就会选择同时并行的占用内存最大的测试kill掉。

这个结论是怎么得出呢?这个结论的意思是如果有单测挂了,肯定会把同时并行跑的单测也给挂了,也就是如果单测挂了,数量至少2个以上,但我们常常出现只挂1个单测的情况。

@lidanqing-intel
Copy link
Contributor Author

lidanqing-intel commented Dec 23, 2019

因为如果前面有测试fail了,就会选择同时并行的占用内存最大的测试kill掉。

这个结论是怎么得出呢?这个结论的意思是如果有单测挂了,肯定会把同时并行跑的单测也给挂了,也就是如果单测挂了,数量至少2个以上,但我们常常出现只挂1个单测的情况。

打印的顺序和运行的顺序不一定一致吧?打印有延迟的吧?选择oom_score最高的kill, memory management 机制是这样的:
/*

  • If any of p's children has a different mm and is eligible for kill,
  • the one with the highest oom_badness() score is sacrificed for its
  • parent. This attempts to lose the minimal amount of work done while
  • still freeing memory.
    */

6148的nivdia-docker是不是给的内存小了?6148的日志中显示的memory 显示如下,只有50G左右,有些一个测试就15G了(15G 那个qat_performance我改了在 #21895 ,现在那个不挂了,但是还有的12G的。本地测得如果batch_size大或者model大,就一次加载很多,内存就大,但batch_size 不好往下改了,精度会fail。

[12:51:47] :	 [Step 1/1] [2799340.836399] Memory cgroup out of memory: Kill process 84722 (python2.7) score 0 or sacrifice child
[12:51:47] :	 [Step 1/1] [2799340.836402] Killed process 70104 (python2.7) total-vm:21618072kB, anon-rss:13039280kB, file-rss:0kB
[Step 1/1] [3211594.813611] memory: usage 53545580kB, limit 9007199254740991kB, failcnt 0
[Step 1/1] [3211594.813612] memory+swap: usage 53545580kB, limit 9007199254740991kB, failcnt 0

解读可以参考cgroup https://www.linuxjournal.com/content/everything-you-need-know-about-linux-containers-part-i-linux-control-groups-and-process

可以明天看下现在一直顺利运行的CI (5117?)的 docker 的MEM LIMIT是多少吗?用nvidia-docker stats --no-stream,波兰这边是376G:

[lidanqin@aipg-igk-skx-01 /]$ nvidia-docker stats --no-stream
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
8bc048ae0100        awesome_babbage     0.42%               2.855MiB / 376.4GiB   0.00%               8.81GB / 175MB      807kB / 7.7GB       3

@luotao1
Copy link
Contributor

luotao1 commented Dec 24, 2019

@tianshuo78520a 能用nvidia-docker stats --no-stream看下5117和6148上的内存限制分别是多少么?

@tianshuo78520a
Copy link
Contributor

@tianshuo78520a 能用nvidia-docker stats --no-stream看下5117和6148上的内存限制分别是多少么?

5117:
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
77c671752386 1568.12% 17.21GiB / 98.18GiB 17.52% 1.03GB / 44.1MB 49.3GB / 189GB 0
6148:
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
cdd8b1eebdfc 0.00% 3.113MiB / 502.2GiB 0.00% 0B / 0B 31.8GB / 93GB 0

silingtong123 and others added 22 commits January 6, 2020 21:55
enhanced ops: conv2d, conv3d
elementwise_pow: change to a reasonable shape
* add erf op and python interface.

* add fp16 support for erf op.

* add unitests for erf op and its python interface.
* fix grad clip, clip op belongs to Backward op when running in Parameter Server mode.
…22090)

* Fix the global_step & continuous applying error in EMA

test=develop

* Fix for step 0 & add unit test, test=develop
…le#21577)

* fix Variable's gradient api in framework.py, test=develop

* remove namescope, test=develop
* add special way to add distribute vars, Update Pyramid hash op
* set esp as 1e-6 to solve elu unitest fail,test=develop
test=develop
@lidanqing-intel
Copy link
Contributor Author

memory problem seems be solved. Now only time out problems

@lidanqing-intel
Copy link
Contributor Author

Close because #22147 is checking the log.

@lidanqing-intel lidanqing-intel deleted the log-ci-failures branch June 16, 2021 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.