Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int8 unit-test fails on 6148 machine #21594

Closed
luotao1 opened this issue Dec 6, 2019 · 13 comments
Closed

int8 unit-test fails on 6148 machine #21594

luotao1 opened this issue Dec 6, 2019 · 13 comments

Comments

@luotao1
Copy link
Contributor

luotao1 commented Dec 6, 2019

The PR_CI_Coverage is 5117 machine, and we add 6148 machine for nightly jobs.

PR_CI_Manylinux_Coverage

This job use cmake .. -DWITH_GPU=ON.

http://ci.paddlepaddle.org/viewLog.html?buildId=238851&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverage

  • test_analyzer_int8_vgg16 (OTHER_FAULT)
18:14:53]	[Step 1/1] I1205 18:14:37.423142 107913 analysis_predictor.cc:475] ======= optimize end =======
[18:14:53]	[Step 1/1] --- Running warmup iteration for quantization
[18:14:53]	[Step 1/1] W1205 18:14:37.528687 107913 naive_executor.cc:45] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[18:14:53]	[Step 1/1] W1205 18:14:37.528708 107913 naive_executor.cc:47] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[18:14:53]	[Step 1/1] W1205 18:14:37.528713 107913 naive_executor.cc:50] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
  • test_qat_int8_vgg19_mkldnn(Failed)
[18:19:10]	[Step 1/1] 94/98 Test #825: test_qat_int8_vgg19_mkldnn ................***Failed   23.54 sec
[18:19:10]	[Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[18:19:10]	[Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: QAT FP32 & INT8 prediction run.
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: QAT model: /root/.cache/inference_demo/int8v2/VGG19_QAT/model
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: Batch size: 25
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: Batch number: 2
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: Accuracy drop threshold: 0.1.
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: --- QAT FP32 prediction start ---
[18:19:10]	[Step 1/1] Child killed
  • test_analyzer_qat_performance_benchmark(OTHER_FAULT)
[18:15:12]	[Step 1/1] W1205 18:14:41.332576 109699 naive_executor.cc:47] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[18:15:12]	[Step 1/1] W1205 18:14:41.332579 109699 naive_executor.cc:50] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
[18:15:12]	[Step 1/1] W1205 18:15:05.995213 109699 naive_executor.cc:45] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[18:15:12]	[Step 1/1] W1205 18:15:05.996029 109699 naive_executor.cc:47] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[18:15:12]	[Step 1/1] W1205 18:15:05.996034 109699 naive_executor.cc:50] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
  • test_analyzer_int8_mobilenet_ssd(OTHER_FAULT)
[18:14:52]	[Step 1/1] --- Running analysis [ir_graph_to_program_pass]
[18:14:52]	[Step 1/1] I1205 18:14:36.681308 109630 analysis_predictor.cc:475] ======= optimize end =======
[18:14:52]	[Step 1/1] I1205 18:14:36.682399 109630 tester_helper.h:376] Thread 0, number of threads 1, run 1 times...
[18:14:52]	[Step 1/1] W1205 18:14:37.378950 109630 naive_executor.cc:45] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[18:14:52]	[Step 1/1] W1205 18:14:37.378979 109630 naive_executor.cc:47] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[18:14:52]	[Step 1/1] W1205 18:14:37.378983 109630 naive_executor.cc:50] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference

PR_CI_Manylinux_Coverage_CPU

This job use cmake .. -DWITH_GPU=OFF.
http://ci.paddlepaddle.org/viewLog.html?buildId=238814&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverageCpu

  • test_qat_int8_resnet101_mkldnn
[17:12:47]	[Step 1/1] 2019-12-05 17:11:55,736-INFO: --- QAT FP32 prediction start ---
[17:12:47]	[Step 1/1] 2019-12-05 17:12:09,631-INFO: batch 1, acc1: 0.9200, acc5: 0.9600, latency: 496.9521 ms, fps: 2.01
[17:12:47]	[Step 1/1] 2019-12-05 17:12:18,439-INFO: batch 2, acc1: 0.7200, acc5: 0.9200, latency: 338.0427 ms, fps: 2.96
[17:12:47]	[Step 1/1] 2019-12-05 17:12:18,725-INFO: Total inference run time: 21.55 s
[17:12:47]	[Step 1/1] 2019-12-05 17:12:18,821-INFO: --- QAT INT8 prediction start ---
[17:12:47]	[Step 1/1] 2019-12-05 17:12:43,925-INFO: batch 1, acc1: 0.8400, acc5: 1.0000, latency: 175.9410 ms, fps: 5.68
[17:12:47]	[Step 1/1] W1205 17:12:46.972718 137610 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly
[17:12:47]	[Step 1/1] W1205 17:12:46.972774 137610 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
[17:12:47]	[Step 1/1] W1205 17:12:46.972779 137610 init.cc:214] The detail failure signal is:
[17:12:47]	[Step 1/1] 
[17:12:47]	[Step 1/1] W1205 17:12:46.972784 137610 init.cc:217] *** Aborted at 1575565966 (unix time) try "date -d @1575565966" if you are using GNU date ***
[17:12:47]	[Step 1/1] W1205 17:12:46.974493 137610 init.cc:217] PC: @                0x0 (unknown)
[17:12:47]	[Step 1/1] W1205 17:12:46.974908 137610 init.cc:217] *** SIGSEGV (@0x7f4dd02c2378) received by PID 137610 (TID 0x7f4e2e5c0700) from PID 18446744072907137912; stack trace: ***
[17:12:47]	[Step 1/1] W1205 17:12:46.975927 137610 init.cc:217]     @     0x7f4e2dd7c390 (unknown)
[17:12:47]	[Step 1/1] W1205 17:12:46.975975 137610 init.cc:217]     @     0x7f4dd02c2378 (unknown)
[17:12:47]	[Step 1/1] Segmentation fault
[17:12:47]	[Step 1/1] 
[17:12:47]	[Step 1/1] 
[17:12:47]	[Step 1/1] 99% tests passed, 1 tests failed out of 92
[17:12:47]	[Step 1/1] 
[17:12:47]	[Step 1/1] Total Test time (real) = 153.34 sec
@luotao1
Copy link
Contributor Author

luotao1 commented Dec 6, 2019

Child killed

This error occurs when we add pyramid_hash_op in #20822. There are a lot of unit-tests fails on Child killed, see http://ci.paddlepaddle.org/viewLog.html?buildId=203802&tab=buildLog&buildTypeId=Paddle_PrCiCoverage&logTab=tree&filter=all, but we don't find the reason.

We skip the error by compile pyramid_hash_op only on WITH_COVERAGE=OFF.

if(WITH_COVERAGE OR NOT WITH_AVX OR WIN32)
SET(OP_MKL_DEPS ${OP_MKL_DEPS} pyramid_hash_op)

@luotao1
Copy link
Contributor Author

luotao1 commented Dec 9, 2019

PR_CI_Manylinux_Coverage

This job use cmake .. -DWITH_GPU=ON.

[18:22:53]	[Step 1/1] 95/99 Test #828: test_qat_int8_vgg19_mkldnn ...................***Failed   24.51 sec
[18:22:53]	[Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[18:22:53]	[Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: QAT FP32 & INT8 prediction run.
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: QAT model: /root/.cache/inference_demo/int8v2/VGG19_QAT/model
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: Batch size: 25
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: Batch number: 2
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: Accuracy drop threshold: 0.1.
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: --- QAT FP32 prediction start ---
[18:22:53]	[Step 1/1] Child killed
[18:18:25]	[Step 1/1] I1206 18:18:04.640463 115555 analysis_predictor.cc:475] ======= optimize end =======
[18:18:25]	[Step 1/1] I1206 18:18:04.643754 115555 tester_helper.h:376] Thread 0, number of threads 1, run 1 times...
[18:18:25]	[Step 1/1] E1206 18:18:04.713287 115555 analysis_predictor.cc:330] feed names from program do not have name: [image] from specified input
[18:18:25]	[Step 1/1] W1206 18:18:04.713332 115555 naive_executor.cc:45] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[18:18:25]	[Step 1/1] W1206 18:18:04.713340 115555 naive_executor.cc:47] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[18:18:25]	[Step 1/1] W1206 18:18:04.713343 115555 naive_executor.cc:50] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
[18:18:26]	[Step 1/1] 95/98 Test #823: test_qat_int8_resnet101_mkldnn ................***Failed   23.36 sec
[18:18:26]	[Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[18:18:26]	[Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,725-INFO: QAT FP32 & INT8 prediction run.
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,726-INFO: QAT model: /root/.cache/inference_demo/int8v2/ResNet101_QAT/model
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,726-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,726-INFO: Batch size: 25
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,726-INFO: Batch number: 2
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,726-INFO: Accuracy drop threshold: 0.1.
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,726-INFO: --- QAT FP32 prediction start ---
[18:18:26]	[Step 1/1] Child killed
[18:18:17]	[Step 1/1]       Start 825: test_qat_int8_mobilenetv1_mkldnn
[18:18:21]	[Step 1/1] 95/99 Test #836: test_graph ...................................***Failed    4.43 sec
[18:18:21]	[Step 1/1] Traceback (most recent call last):
[18:18:21]	[Step 1/1]   File "test_graph.py", line 20, in <module>
[18:18:21]	[Step 1/1]     import paddle
[18:18:21]	[Step 1/1]   File "/paddle/build/python/paddle/__init__.py", line 30, in <module>
[18:18:21]	[Step 1/1]     import paddle.dataset
[18:18:21]	[Step 1/1]   File "/paddle/build/python/paddle/dataset/__init__.py", line 28, in <module>
[18:18:21]	[Step 1/1]     import paddle.dataset.mq2007
[18:18:21]	[Step 1/1]   File "/paddle/build/python/paddle/dataset/mq2007.py", line 30, in <module>
[18:18:21]	[Step 1/1]     import rarfile
[18:18:21]	[Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/rarfile.py", line 2950, in <module>
[18:18:21]	[Step 1/1]     _check_unrar_tool()
[18:18:21]	[Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/rarfile.py", line 2931, in _check_unrar_tool
[18:18:21]	[Step 1/1]     custom_check([ORIG_UNRAR_TOOL], True)
[18:18:21]	[Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/rarfile.py", line 2823, in custom_check
[18:18:21]	[Step 1/1]     p = custom_popen(cmd)
[18:18:21]	[Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/rarfile.py", line 2813, in custom_popen
[18:18:21]	[Step 1/1]     creationflags=creationflags)
[18:18:21]	[Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/subprocess.py", line 394, in __init__
[18:18:21]	[Step 1/1]     errread, errwrite)
[18:18:21]	[Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/subprocess.py", line 938, in _execute_child
[18:18:21]	[Step 1/1]     self.pid = os.fork()
[18:18:21]	[Step 1/1] OSError: [Errno 12] Cannot allocate memory

PR_CI_Manylinux_Coverage_CPU

This job use cmake .. -DWITH_GPU=OFF.

[17:17:09]	[Step 1/1] 93/93 Test #781: test_qat_int8_resnet101_mkldnn ................***Failed   42.61 sec
[17:17:09]	[Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[17:17:09]	[Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: QAT FP32 & INT8 prediction run.
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: QAT model: /root/.cache/inference_demo/int8v2/ResNet101_QAT/model
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: Batch size: 25
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: Batch number: 2
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: Accuracy drop threshold: 0.1.
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: --- QAT FP32 prediction start ---
[17:17:09]	[Step 1/1] 2019-12-06 17:16:39,954-INFO: batch 1, acc1: 0.9200, acc5: 0.9600, latency: 411.1581 ms, fps: 2.43
[17:17:09]	[Step 1/1] 2019-12-06 17:16:46,611-INFO: batch 2, acc1: 0.7200, acc5: 0.9200, latency: 254.0701 ms, fps: 3.94
[17:17:09]	[Step 1/1] 2019-12-06 17:16:46,837-INFO: Total inference run time: 17.21 s
[17:17:09]	[Step 1/1] 2019-12-06 17:16:46,928-INFO: --- QAT INT8 prediction start ---
[17:17:09]	[Step 1/1] 2019-12-06 17:17:06,482-INFO: batch 1, acc1: 0.8400, acc5: 1.0000, latency: 156.0834 ms, fps: 6.41
[17:17:09]	[Step 1/1] W1206 17:17:08.886281 157379 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly
[17:17:09]	[Step 1/1] W1206 17:17:08.886318 157379 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
[17:17:09]	[Step 1/1] W1206 17:17:08.886323 157379 init.cc:214] The detail failure signal is:
[17:17:09]	[Step 1/1] 
[17:17:09]	[Step 1/1] W1206 17:17:08.886330 157379 init.cc:217] *** Aborted at 1575652628 (unix time) try "date -d @1575652628" if you are using GNU date ***
[17:17:09]	[Step 1/1] W1206 17:17:08.888025 157379 init.cc:217] PC: @                0x0 (unknown)
[17:17:09]	[Step 1/1] W1206 17:17:08.888304 157379 init.cc:217] *** SIGSEGV (@0x7fc50733b000) received by PID 157379 (TID 0x7fc565437700) from PID 120827904; stack trace: ***
[17:17:09]	[Step 1/1] W1206 17:17:08.889425 157379 init.cc:217]     @     0x7fc564bf3390 (unknown)
[17:17:09]	[Step 1/1] W1206 17:17:08.889477 157379 init.cc:217]     @     0x7fc50733affd (unknown)
[17:17:09]	[Step 1/1] Segmentation fault
[17:17:09]	[Step 1/1] 

@bingyanghuang
Copy link
Contributor

bingyanghuang commented Dec 9, 2019

@ddokupil Can you reproduce this random fails in local machine?

@ddokupil
Copy link
Contributor

ddokupil commented Dec 9, 2019

They are passing on my configuration (8180 + GPU)

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Dec 11, 2019

I notice in the building configuration, FLAGS_fraction_of_gpu_memory_to_use=0.15. Usually GPU use bigger memory allocation. Like in this issue #6268

What about setting to 0.92 and go through CI and see if there is failure difference

@ddokupil
Copy link
Contributor

ddokupil commented Dec 11, 2019

We tried several ways to build it with following cmake:
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_DSO=ON -DWITH_GPU=ON -DWITH_AMD_GPU=OFF -DWITH_DISTRIBUTE=OFF -DWITH_MKL=ON -DWITH_NGRAPH=ON -DWITH_AVX=ON -DNOAVX_CORE_FILE= -DWITH_GOLANG=OFF -DCUDA_ARCH_NAME=All -DCUDA_ARCH_BIN= -DWITH_PYTHON=ON -DCUDNN_ROOT=/usr/ -DWITH_TESTING=ON -DWITH_COVERAGE=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_CONTRIB=ON -DWITH_INFERENCE_API_TEST=ON -DWITH_GRPC=OFF

But with no luck. We keep getting
[ 96%] Built target test_dot /usr/bin/ar: /data/ddokupil/Paddle/build/paddle/fluid/inference/libpaddle_fluid_origin.a: File truncated paddle/fluid/inference/CMakeFiles/paddle_fluid_origin.dir/build.make:6913: recipe for target 'paddle/fluid/inference/libpaddle_fluid_origin.a' failed make[2]: *** [paddle/fluid/inference/libpaddle_fluid_origin.a] Error 1 make[2]: *** Deleting file 'paddle/fluid/inference/libpaddle_fluid_origin.a' CMakeFiles/Makefile2:86765: recipe for target 'paddle/fluid/inference/CMakeFiles/paddle_fluid_origin.dir/all' failed make[1]: *** [paddle/fluid/inference/CMakeFiles/paddle_fluid_origin.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs....

If we disable GPU everything builds fine but doesn't reproduce the issue.

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Dec 11, 2019

If we disable GPU, build with -DWITH_GPU=OFF -DWITH_COVERAGE=ON -DWITH_NGRAPH=ON, all int8 UT tests passed on our local machine. So we can not reproduce the test_qat_int8_resnet101_mkldnn failure when GPU OFF. Could we somehow run on baidu's machine?

@luotao1
Copy link
Contributor Author

luotao1 commented Dec 12, 2019

@bingyanghuang
Copy link
Contributor

bingyanghuang commented Dec 12, 2019

libpaddle_fluid_origin.a

@ddokupil you can refer this issue #14775 and this issue #17832

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Dec 12, 2019

When GPU ON, We partly reproduced some failures (all qat_int8 passed, all test_analyzer_int8_modelname failed, because of same error). We suspect that the UT used some common libraries or file where GPU settings there. Now we are trying to fix.

    Start 178: test_analyzer_qat_performance_benchmark
178: Test command: /data/ddokupil/Paddle/build/paddle/fluid/inference/tests/api/test_analyzer_qat_image_classification "ARGS" "--fraction_of_gpu_memory_to_use=0.5" "--fp32_model=/data/ddokupil/Paddle/build/third_party/inference_demo/int8v2/ResNet50_qat_perf/ResNet50_qat_perf/float" "--int8_model=/data/ddokupil/Paddle/build/third_party/inference_demo/int8v2/ResNet50_qat_perf_int8/ResNet50_qat_perf_int8" "--infer_data=/data/ddokupil/Paddle/build/third_party/inference_demo/int8v2/data.bin" "--batch_size=50" "--paddle_num_threads=4" "--with_accuracy_layer=false" "--iterations=2"
178: Environment variables:
178:  FLAGS_cudnn_deterministic=true
178: Test timeout computed to be: 600
178: [==========] Running 1 test from 1 test case.
178: [----------] Global test environment set-up.
178: [----------] 1 test from Analyzer_qat_image_classification
178: [ RUN      ] Analyzer_qat_image_classification.quantization
178: WARNING: Logging before InitGoogleLogging() is written to STDERR
178: E1212 09:59:53.723207 124261 analysis_config.cc:307] EnableMKLDNN() only works when IR optimization is enabled.
178: E1212 09:59:53.723310 124261 analysis_config.cc:307] EnableMKLDNN() only works when IR optimization is enabled.
178: I1212 09:59:53.723361 124261 analyzer_qat_image_classification_tester.cc:79] Total images in file: 100
178: I1212 09:59:53.758296 124261 tester_helper.h:696] FP32 & INT8 prediction run: batch_size 50, warmup batch size 100.
178: I1212 09:59:53.758322 124261 tester_helper.h:699] --- FP32 prediction start ---
178: I1212 09:59:53.758329 124261 tester_helper.h:94] AnalysisConfig {
178: unknown file: Failure
178: C++ exception with description "
178:
178: --------------------------------------------
178: C++ Call Stacks (More useful to developers):
178: --------------------------------------------
178:
178: ----------------------
178: Error Message Summary:
178: ----------------------
178: Error: id must less than GPU count
178:   [Hint: Expected id < GetCUDADeviceCount(), but received id:0 >= GetCUDADeviceCount():0.] at (/data/ddokupil/Paddle/paddle/fluid/platform/gpu_info.cc:216)
178: " thrown in the test body.
178: [  FAILED  ] Analyzer_qat_image_classification.quantization (57 ms)
178: [----------] 1 test from Analyzer_qat_image_classification (57 ms total)
178:
178: [----------] Global test environment tear-down
178: [==========] 1 test from 1 test case ran. (57 ms total)
178: [  PASSED  ] 0 tests.
178: [  FAILED  ] 1 test, listed below:
178: [  FAILED  ] Analyzer_qat_image_classification.quantization
178:
178:  1 FAILED TEST
1/1 Test #178: test_analyzer_qat_performance_benchmark ...***Failed    0.10 sec
0% tests passed, 1 tests failed out of 1
Total Test time (real) =   0.13 sec
The following tests FAILED:
        178 - test_analyzer_qat_performance_benchmark (Failed)

@wojtuss
Copy link

wojtuss commented Dec 16, 2019

@luotao1 ,
In the logs there are entries saying that there are problems with memory allocation, e.g.

Out of memory error on GPU 0. Cannot allocate 24.000000B memory on GPU 0, available memory is only 13.832520GB.

or

OSError: [Errno 12] Cannot allocate memory

To investigate the issue further please send us the output of the dmesg command after the build with failing tests.
Also, could you please run reproduction using only a single thread (probably with CTEST_PARALLEL_LEVEL=1)? That would limit the possibility of interfering memory problems with other tests.

@luotao1
Copy link
Contributor Author

luotao1 commented Dec 16, 2019

Sorry, we could not use single thread(CTEST_PARALLEL_LEVEL=1), since it will cause the CI elapsed time longer.

send us the output of the dmesg command after the build with failing tests.

Sorry, we have disable these unit-tests in #21696, and our nightly tests are on develop branch.

@lidanqing-intel
Copy link
Contributor

Moved to other machines that have bigger GPU memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants