int8 unit-test fails on 6148 machine #21594

luotao1 · 2019-12-06T03:11:56Z

The PR_CI_Coverage is 5117 machine, and we add 6148 machine for nightly jobs.

PR_CI_Manylinux_Coverage

This job use cmake .. -DWITH_GPU=ON.

http://ci.paddlepaddle.org/viewLog.html?buildId=238851&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverage

test_analyzer_int8_vgg16 (OTHER_FAULT)

18:14:53]	[Step 1/1] I1205 18:14:37.423142 107913 analysis_predictor.cc:475] ======= optimize end =======
[18:14:53]	[Step 1/1] --- Running warmup iteration for quantization
[18:14:53]	[Step 1/1] W1205 18:14:37.528687 107913 naive_executor.cc:45] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[18:14:53]	[Step 1/1] W1205 18:14:37.528708 107913 naive_executor.cc:47] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[18:14:53]	[Step 1/1] W1205 18:14:37.528713 107913 naive_executor.cc:50] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference

test_qat_int8_vgg19_mkldnn(Failed)

[18:19:10]	[Step 1/1] 94/98 Test #825: test_qat_int8_vgg19_mkldnn ................***Failed   23.54 sec
[18:19:10]	[Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[18:19:10]	[Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: QAT FP32 & INT8 prediction run.
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: QAT model: /root/.cache/inference_demo/int8v2/VGG19_QAT/model
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: Batch size: 25
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: Batch number: 2
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: Accuracy drop threshold: 0.1.
[18:19:10]	[Step 1/1] 2019-12-05 18:18:49,599-INFO: --- QAT FP32 prediction start ---
[18:19:10]	[Step 1/1] Child killed

test_analyzer_qat_performance_benchmark(OTHER_FAULT)

[18:15:12]	[Step 1/1] W1205 18:14:41.332576 109699 naive_executor.cc:47] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[18:15:12]	[Step 1/1] W1205 18:14:41.332579 109699 naive_executor.cc:50] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
[18:15:12]	[Step 1/1] W1205 18:15:05.995213 109699 naive_executor.cc:45] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[18:15:12]	[Step 1/1] W1205 18:15:05.996029 109699 naive_executor.cc:47] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[18:15:12]	[Step 1/1] W1205 18:15:05.996034 109699 naive_executor.cc:50] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference

test_analyzer_int8_mobilenet_ssd(OTHER_FAULT)

[18:14:52]	[Step 1/1] --- Running analysis [ir_graph_to_program_pass]
[18:14:52]	[Step 1/1] I1205 18:14:36.681308 109630 analysis_predictor.cc:475] ======= optimize end =======
[18:14:52]	[Step 1/1] I1205 18:14:36.682399 109630 tester_helper.h:376] Thread 0, number of threads 1, run 1 times...
[18:14:52]	[Step 1/1] W1205 18:14:37.378950 109630 naive_executor.cc:45] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[18:14:52]	[Step 1/1] W1205 18:14:37.378979 109630 naive_executor.cc:47] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[18:14:52]	[Step 1/1] W1205 18:14:37.378983 109630 naive_executor.cc:50] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference

PR_CI_Manylinux_Coverage_CPU

This job use cmake .. -DWITH_GPU=OFF.
http://ci.paddlepaddle.org/viewLog.html?buildId=238814&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverageCpu

test_qat_int8_resnet101_mkldnn

[17:12:47]	[Step 1/1] 2019-12-05 17:11:55,736-INFO: --- QAT FP32 prediction start ---
[17:12:47]	[Step 1/1] 2019-12-05 17:12:09,631-INFO: batch 1, acc1: 0.9200, acc5: 0.9600, latency: 496.9521 ms, fps: 2.01
[17:12:47]	[Step 1/1] 2019-12-05 17:12:18,439-INFO: batch 2, acc1: 0.7200, acc5: 0.9200, latency: 338.0427 ms, fps: 2.96
[17:12:47]	[Step 1/1] 2019-12-05 17:12:18,725-INFO: Total inference run time: 21.55 s
[17:12:47]	[Step 1/1] 2019-12-05 17:12:18,821-INFO: --- QAT INT8 prediction start ---
[17:12:47]	[Step 1/1] 2019-12-05 17:12:43,925-INFO: batch 1, acc1: 0.8400, acc5: 1.0000, latency: 175.9410 ms, fps: 5.68
[17:12:47]	[Step 1/1] W1205 17:12:46.972718 137610 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly
[17:12:47]	[Step 1/1] W1205 17:12:46.972774 137610 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
[17:12:47]	[Step 1/1] W1205 17:12:46.972779 137610 init.cc:214] The detail failure signal is:
[17:12:47]	[Step 1/1] 
[17:12:47]	[Step 1/1] W1205 17:12:46.972784 137610 init.cc:217] *** Aborted at 1575565966 (unix time) try "date -d @1575565966" if you are using GNU date ***
[17:12:47]	[Step 1/1] W1205 17:12:46.974493 137610 init.cc:217] PC: @                0x0 (unknown)
[17:12:47]	[Step 1/1] W1205 17:12:46.974908 137610 init.cc:217] *** SIGSEGV (@0x7f4dd02c2378) received by PID 137610 (TID 0x7f4e2e5c0700) from PID 18446744072907137912; stack trace: ***
[17:12:47]	[Step 1/1] W1205 17:12:46.975927 137610 init.cc:217]     @     0x7f4e2dd7c390 (unknown)
[17:12:47]	[Step 1/1] W1205 17:12:46.975975 137610 init.cc:217]     @     0x7f4dd02c2378 (unknown)
[17:12:47]	[Step 1/1] Segmentation fault
[17:12:47]	[Step 1/1] 
[17:12:47]	[Step 1/1] 
[17:12:47]	[Step 1/1] 99% tests passed, 1 tests failed out of 92
[17:12:47]	[Step 1/1] 
[17:12:47]	[Step 1/1] Total Test time (real) = 153.34 sec

The text was updated successfully, but these errors were encountered:

luotao1 · 2019-12-06T07:57:51Z

Child killed

This error occurs when we add pyramid_hash_op in #20822. There are a lot of unit-tests fails on Child killed, see http://ci.paddlepaddle.org/viewLog.html?buildId=203802&tab=buildLog&buildTypeId=Paddle_PrCiCoverage&logTab=tree&filter=all, but we don't find the reason.

We skip the error by compile pyramid_hash_op only on WITH_COVERAGE=OFF.

Paddle/paddle/fluid/operators/CMakeLists.txt

Lines 56 to 57 in 8996652

    
           if(WITH_COVERAGE OR NOT WITH_AVX OR WIN32) 
        
               SET(OP_MKL_DEPS ${OP_MKL_DEPS} pyramid_hash_op)

luotao1 · 2019-12-09T03:28:01Z

PR_CI_Manylinux_Coverage

This job use cmake .. -DWITH_GPU=ON.

test_qat_int8_vgg19_mkldnn
- Dec 7th: http://ci.paddlepaddle.org/viewLog.html?buildId=239921&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverage&tab=buildLog&_focus=13080

[18:22:53]	[Step 1/1] 95/99 Test #828: test_qat_int8_vgg19_mkldnn ...................***Failed   24.51 sec
[18:22:53]	[Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[18:22:53]	[Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: QAT FP32 & INT8 prediction run.
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: QAT model: /root/.cache/inference_demo/int8v2/VGG19_QAT/model
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: Batch size: 25
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: Batch number: 2
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: Accuracy drop threshold: 0.1.
[18:22:53]	[Step 1/1] 2019-12-06 18:22:31,083-INFO: --- QAT FP32 prediction start ---
[18:22:53]	[Step 1/1] Child killed

test_analyzer_qat_performance_benchmark
- Dec 7th: http://ci.paddlepaddle.org/viewLog.html?buildId=239921&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverage&tab=buildLog&_focus=13080
- Dec 8th: http://ci.paddlepaddle.org/viewLog.html?buildId=240079&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverage&tab=buildLog&_focus=11176

[18:18:25]	[Step 1/1] I1206 18:18:04.640463 115555 analysis_predictor.cc:475] ======= optimize end =======
[18:18:25]	[Step 1/1] I1206 18:18:04.643754 115555 tester_helper.h:376] Thread 0, number of threads 1, run 1 times...
[18:18:25]	[Step 1/1] E1206 18:18:04.713287 115555 analysis_predictor.cc:330] feed names from program do not have name: [image] from specified input
[18:18:25]	[Step 1/1] W1206 18:18:04.713332 115555 naive_executor.cc:45] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[18:18:25]	[Step 1/1] W1206 18:18:04.713340 115555 naive_executor.cc:47] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[18:18:25]	[Step 1/1] W1206 18:18:04.713343 115555 naive_executor.cc:50] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference

test_qat_int8_resnet101_mkldnn
- Dec 8th: http://ci.paddlepaddle.org/viewLog.html?buildId=240079&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverage&tab=buildLog&_focus=12791

[18:18:26]	[Step 1/1] 95/98 Test #823: test_qat_int8_resnet101_mkldnn ................***Failed   23.36 sec
[18:18:26]	[Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[18:18:26]	[Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,725-INFO: QAT FP32 & INT8 prediction run.
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,726-INFO: QAT model: /root/.cache/inference_demo/int8v2/ResNet101_QAT/model
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,726-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,726-INFO: Batch size: 25
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,726-INFO: Batch number: 2
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,726-INFO: Accuracy drop threshold: 0.1.
[18:18:26]	[Step 1/1] 2019-12-07 18:18:04,726-INFO: --- QAT FP32 prediction start ---
[18:18:26]	[Step 1/1] Child killed

test_qat_int8_mobilenetv1_mkldnn
- Dec 8th: http://ci.paddlepaddle.org/viewLog.html?buildId=240079&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverage&tab=buildLog&_focus=12641

[18:18:17]	[Step 1/1]       Start 825: test_qat_int8_mobilenetv1_mkldnn
[18:18:21]	[Step 1/1] 95/99 Test #836: test_graph ...................................***Failed    4.43 sec
[18:18:21]	[Step 1/1] Traceback (most recent call last):
[18:18:21]	[Step 1/1]   File "test_graph.py", line 20, in <module>
[18:18:21]	[Step 1/1]     import paddle
[18:18:21]	[Step 1/1]   File "/paddle/build/python/paddle/__init__.py", line 30, in <module>
[18:18:21]	[Step 1/1]     import paddle.dataset
[18:18:21]	[Step 1/1]   File "/paddle/build/python/paddle/dataset/__init__.py", line 28, in <module>
[18:18:21]	[Step 1/1]     import paddle.dataset.mq2007
[18:18:21]	[Step 1/1]   File "/paddle/build/python/paddle/dataset/mq2007.py", line 30, in <module>
[18:18:21]	[Step 1/1]     import rarfile
[18:18:21]	[Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/rarfile.py", line 2950, in <module>
[18:18:21]	[Step 1/1]     _check_unrar_tool()
[18:18:21]	[Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/rarfile.py", line 2931, in _check_unrar_tool
[18:18:21]	[Step 1/1]     custom_check([ORIG_UNRAR_TOOL], True)
[18:18:21]	[Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/rarfile.py", line 2823, in custom_check
[18:18:21]	[Step 1/1]     p = custom_popen(cmd)
[18:18:21]	[Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/rarfile.py", line 2813, in custom_popen
[18:18:21]	[Step 1/1]     creationflags=creationflags)
[18:18:21]	[Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/subprocess.py", line 394, in __init__
[18:18:21]	[Step 1/1]     errread, errwrite)
[18:18:21]	[Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/subprocess.py", line 938, in _execute_child
[18:18:21]	[Step 1/1]     self.pid = os.fork()
[18:18:21]	[Step 1/1] OSError: [Errno 12] Cannot allocate memory

PR_CI_Manylinux_Coverage_CPU

This job use cmake .. -DWITH_GPU=OFF.

test_qat_int8_resnet101_mkldnn
- Dec 7th: http://ci.paddlepaddle.org/viewLog.html?buildId=239878&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverageCpu&tab=buildLog&_focus=11792

[17:17:09]	[Step 1/1] 93/93 Test #781: test_qat_int8_resnet101_mkldnn ................***Failed   42.61 sec
[17:17:09]	[Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[17:17:09]	[Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: QAT FP32 & INT8 prediction run.
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: QAT model: /root/.cache/inference_demo/int8v2/ResNet101_QAT/model
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: Batch size: 25
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: Batch number: 2
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: Accuracy drop threshold: 0.1.
[17:17:09]	[Step 1/1] 2019-12-06 17:16:28,023-INFO: --- QAT FP32 prediction start ---
[17:17:09]	[Step 1/1] 2019-12-06 17:16:39,954-INFO: batch 1, acc1: 0.9200, acc5: 0.9600, latency: 411.1581 ms, fps: 2.43
[17:17:09]	[Step 1/1] 2019-12-06 17:16:46,611-INFO: batch 2, acc1: 0.7200, acc5: 0.9200, latency: 254.0701 ms, fps: 3.94
[17:17:09]	[Step 1/1] 2019-12-06 17:16:46,837-INFO: Total inference run time: 17.21 s
[17:17:09]	[Step 1/1] 2019-12-06 17:16:46,928-INFO: --- QAT INT8 prediction start ---
[17:17:09]	[Step 1/1] 2019-12-06 17:17:06,482-INFO: batch 1, acc1: 0.8400, acc5: 1.0000, latency: 156.0834 ms, fps: 6.41
[17:17:09]	[Step 1/1] W1206 17:17:08.886281 157379 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly
[17:17:09]	[Step 1/1] W1206 17:17:08.886318 157379 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
[17:17:09]	[Step 1/1] W1206 17:17:08.886323 157379 init.cc:214] The detail failure signal is:
[17:17:09]	[Step 1/1] 
[17:17:09]	[Step 1/1] W1206 17:17:08.886330 157379 init.cc:217] *** Aborted at 1575652628 (unix time) try "date -d @1575652628" if you are using GNU date ***
[17:17:09]	[Step 1/1] W1206 17:17:08.888025 157379 init.cc:217] PC: @                0x0 (unknown)
[17:17:09]	[Step 1/1] W1206 17:17:08.888304 157379 init.cc:217] *** SIGSEGV (@0x7fc50733b000) received by PID 157379 (TID 0x7fc565437700) from PID 120827904; stack trace: ***
[17:17:09]	[Step 1/1] W1206 17:17:08.889425 157379 init.cc:217]     @     0x7fc564bf3390 (unknown)
[17:17:09]	[Step 1/1] W1206 17:17:08.889477 157379 init.cc:217]     @     0x7fc50733affd (unknown)
[17:17:09]	[Step 1/1] Segmentation fault
[17:17:09]	[Step 1/1]

bingyanghuang · 2019-12-09T05:56:31Z

@ddokupil Can you reproduce this random fails in local machine?

ddokupil · 2019-12-09T09:23:35Z

They are passing on my configuration (8180 + GPU)

lidanqing-intel · 2019-12-11T07:58:12Z

I notice in the building configuration, FLAGS_fraction_of_gpu_memory_to_use=0.15. Usually GPU use bigger memory allocation. Like in this issue #6268

What about setting to 0.92 and go through CI and see if there is failure difference

ddokupil · 2019-12-11T14:56:16Z

We tried several ways to build it with following cmake:
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_DSO=ON -DWITH_GPU=ON -DWITH_AMD_GPU=OFF -DWITH_DISTRIBUTE=OFF -DWITH_MKL=ON -DWITH_NGRAPH=ON -DWITH_AVX=ON -DNOAVX_CORE_FILE= -DWITH_GOLANG=OFF -DCUDA_ARCH_NAME=All -DCUDA_ARCH_BIN= -DWITH_PYTHON=ON -DCUDNN_ROOT=/usr/ -DWITH_TESTING=ON -DWITH_COVERAGE=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DWITH_CONTRIB=ON -DWITH_INFERENCE_API_TEST=ON -DWITH_GRPC=OFF

But with no luck. We keep getting
[ 96%] Built target test_dot /usr/bin/ar: /data/ddokupil/Paddle/build/paddle/fluid/inference/libpaddle_fluid_origin.a: File truncated paddle/fluid/inference/CMakeFiles/paddle_fluid_origin.dir/build.make:6913: recipe for target 'paddle/fluid/inference/libpaddle_fluid_origin.a' failed make[2]: *** [paddle/fluid/inference/libpaddle_fluid_origin.a] Error 1 make[2]: *** Deleting file 'paddle/fluid/inference/libpaddle_fluid_origin.a' CMakeFiles/Makefile2:86765: recipe for target 'paddle/fluid/inference/CMakeFiles/paddle_fluid_origin.dir/all' failed make[1]: *** [paddle/fluid/inference/CMakeFiles/paddle_fluid_origin.dir/all] Error 2 make[1]: *** Waiting for unfinished jobs....

If we disable GPU everything builds fine but doesn't reproduce the issue.

lidanqing-intel · 2019-12-11T14:58:00Z

If we disable GPU, build with -DWITH_GPU=OFF -DWITH_COVERAGE=ON -DWITH_NGRAPH=ON, all int8 UT tests passed on our local machine. So we can not reproduce the test_qat_int8_resnet101_mkldnn failure when GPU OFF. Could we somehow run on baidu's machine?

luotao1 · 2019-12-12T06:05:16Z

@ddokupil @lidanqing-intel Our CI runs test in parallel, i.e. -e CTEST_PARALLEL_LEVEL=4
http://ci.paddlepaddle.org/viewLog.html?buildId=243741&tab=buildLog&buildTypeId=Paddle_PrCiCoverage&logTab=tree&filter=all

bingyanghuang · 2019-12-12T06:06:21Z

libpaddle_fluid_origin.a

@ddokupil you can refer this issue #14775 and this issue #17832

lidanqing-intel · 2019-12-12T11:10:34Z

When GPU ON, We partly reproduced some failures (all qat_int8 passed, all test_analyzer_int8_modelname failed, because of same error). We suspect that the UT used some common libraries or file where GPU settings there. Now we are trying to fix.

    Start 178: test_analyzer_qat_performance_benchmark
178: Test command: /data/ddokupil/Paddle/build/paddle/fluid/inference/tests/api/test_analyzer_qat_image_classification "ARGS" "--fraction_of_gpu_memory_to_use=0.5" "--fp32_model=/data/ddokupil/Paddle/build/third_party/inference_demo/int8v2/ResNet50_qat_perf/ResNet50_qat_perf/float" "--int8_model=/data/ddokupil/Paddle/build/third_party/inference_demo/int8v2/ResNet50_qat_perf_int8/ResNet50_qat_perf_int8" "--infer_data=/data/ddokupil/Paddle/build/third_party/inference_demo/int8v2/data.bin" "--batch_size=50" "--paddle_num_threads=4" "--with_accuracy_layer=false" "--iterations=2"
178: Environment variables:
178:  FLAGS_cudnn_deterministic=true
178: Test timeout computed to be: 600
178: [==========] Running 1 test from 1 test case.
178: [----------] Global test environment set-up.
178: [----------] 1 test from Analyzer_qat_image_classification
178: [ RUN      ] Analyzer_qat_image_classification.quantization
178: WARNING: Logging before InitGoogleLogging() is written to STDERR
178: E1212 09:59:53.723207 124261 analysis_config.cc:307] EnableMKLDNN() only works when IR optimization is enabled.
178: E1212 09:59:53.723310 124261 analysis_config.cc:307] EnableMKLDNN() only works when IR optimization is enabled.
178: I1212 09:59:53.723361 124261 analyzer_qat_image_classification_tester.cc:79] Total images in file: 100
178: I1212 09:59:53.758296 124261 tester_helper.h:696] FP32 & INT8 prediction run: batch_size 50, warmup batch size 100.
178: I1212 09:59:53.758322 124261 tester_helper.h:699] --- FP32 prediction start ---
178: I1212 09:59:53.758329 124261 tester_helper.h:94] AnalysisConfig {
178: unknown file: Failure
178: C++ exception with description "
178:
178: --------------------------------------------
178: C++ Call Stacks (More useful to developers):
178: --------------------------------------------
178:
178: ----------------------
178: Error Message Summary:
178: ----------------------
178: Error: id must less than GPU count
178:   [Hint: Expected id < GetCUDADeviceCount(), but received id:0 >= GetCUDADeviceCount():0.] at (/data/ddokupil/Paddle/paddle/fluid/platform/gpu_info.cc:216)
178: " thrown in the test body.
178: [  FAILED  ] Analyzer_qat_image_classification.quantization (57 ms)
178: [----------] 1 test from Analyzer_qat_image_classification (57 ms total)
178:
178: [----------] Global test environment tear-down
178: [==========] 1 test from 1 test case ran. (57 ms total)
178: [  PASSED  ] 0 tests.
178: [  FAILED  ] 1 test, listed below:
178: [  FAILED  ] Analyzer_qat_image_classification.quantization
178:
178:  1 FAILED TEST
1/1 Test #178: test_analyzer_qat_performance_benchmark ...***Failed    0.10 sec
0% tests passed, 1 tests failed out of 1
Total Test time (real) =   0.13 sec
The following tests FAILED:
        178 - test_analyzer_qat_performance_benchmark (Failed)

wojtuss · 2019-12-16T12:01:19Z

@luotao1 ,
In the logs there are entries saying that there are problems with memory allocation, e.g.

Out of memory error on GPU 0. Cannot allocate 24.000000B memory on GPU 0, available memory is only 13.832520GB.

or

OSError: [Errno 12] Cannot allocate memory

To investigate the issue further please send us the output of the dmesg command after the build with failing tests.
Also, could you please run reproduction using only a single thread (probably with CTEST_PARALLEL_LEVEL=1)? That would limit the possibility of interfering memory problems with other tests.

luotao1 · 2019-12-16T12:16:59Z

Sorry, we could not use single thread(CTEST_PARALLEL_LEVEL=1), since it will cause the CI elapsed time longer.

send us the output of the dmesg command after the build with failing tests.

Sorry, we have disable these unit-tests in #21696, and our nightly tests are on develop branch.

lidanqing-intel · 2020-06-30T10:39:09Z

Moved to other machines that have bigger GPU memory

luotao1 added int8 Intel random-test labels Dec 6, 2019

bingyanghuang assigned wojtuss Dec 6, 2019

luotao1 assigned wzzju Dec 9, 2019

lidanqing-intel mentioned this issue Dec 18, 2019

[WIP] For printing more logs 6148 CI failures #21802

Closed

lidanqing-intel mentioned this issue Jun 16, 2020

test_concat_mkldnn_op and test_transpose_mkldnn_op random fail #25013

Closed

lidanqing-intel closed this as completed Jun 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int8 unit-test fails on 6148 machine #21594

int8 unit-test fails on 6148 machine #21594

luotao1 commented Dec 6, 2019

luotao1 commented Dec 6, 2019

luotao1 commented Dec 9, 2019

bingyanghuang commented Dec 9, 2019 •

edited

Loading

ddokupil commented Dec 9, 2019

lidanqing-intel commented Dec 11, 2019 •

edited

Loading

ddokupil commented Dec 11, 2019 •

edited

Loading

lidanqing-intel commented Dec 11, 2019 •

edited

Loading

luotao1 commented Dec 12, 2019

bingyanghuang commented Dec 12, 2019 •

edited

Loading

lidanqing-intel commented Dec 12, 2019 •

edited

Loading

wojtuss commented Dec 16, 2019 •

edited

Loading

luotao1 commented Dec 16, 2019

lidanqing-intel commented Jun 30, 2020

int8 unit-test fails on 6148 machine #21594

int8 unit-test fails on 6148 machine #21594

Comments

luotao1 commented Dec 6, 2019

PR_CI_Manylinux_Coverage

PR_CI_Manylinux_Coverage_CPU

luotao1 commented Dec 6, 2019

luotao1 commented Dec 9, 2019

PR_CI_Manylinux_Coverage

PR_CI_Manylinux_Coverage_CPU

bingyanghuang commented Dec 9, 2019 • edited Loading

ddokupil commented Dec 9, 2019

lidanqing-intel commented Dec 11, 2019 • edited Loading

ddokupil commented Dec 11, 2019 • edited Loading

lidanqing-intel commented Dec 11, 2019 • edited Loading

luotao1 commented Dec 12, 2019

bingyanghuang commented Dec 12, 2019 • edited Loading

lidanqing-intel commented Dec 12, 2019 • edited Loading

wojtuss commented Dec 16, 2019 • edited Loading

luotao1 commented Dec 16, 2019

lidanqing-intel commented Jun 30, 2020

bingyanghuang commented Dec 9, 2019 •

edited

Loading

lidanqing-intel commented Dec 11, 2019 •

edited

Loading

ddokupil commented Dec 11, 2019 •

edited

Loading

lidanqing-intel commented Dec 11, 2019 •

edited

Loading

bingyanghuang commented Dec 12, 2019 •

edited

Loading

lidanqing-intel commented Dec 12, 2019 •

edited

Loading

wojtuss commented Dec 16, 2019 •

edited

Loading