-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PaddlePaddle 2.6.0 buglist, part 1 #60882
Comments
@onecatcn , #!/bin/bash
set -x
cd paddle/build
ctest --output-on-failure -R test_cuda_graph_partial_graph_static_run
ctest --output-on-failure -R test_graph_reindex
ctest --output-on-failure -R test_cuda_graphed_layer
ctest --output-on-failure -R test_unique
ctest --output-on-failure -R test_weight_decay
ctest --output-on-failure -R test_unique_static_build
ctest --output-on-failure -R test_post_training_quantization_resnet50
ctest --output-on-failure -R test_communicator_half_async
ctest --output-on-failure -R test_trt_convert_scatter
ctest --output-on-failure -R test_trt_convert_assign
ctest --output-on-failure -R test_trt_convert_lookup_table
ctest --output-on-failure -R test_post_training_quantization_mobilenetv1
ctest --output-on-failure -R test_trt_convert_yolo_box #超時 log: 下列是在 develop 分支已被修正,但在release/2.6.0沒有,需要 cherry-pick
|
希望能高優處裡以下單測:
|
我已经找负责人排查了,会尽快解决 |
已提PR 61284修复下面单测 |
test_layer_norm_op_static_build (Failed) |
提交了#61591 |
we are not able to reproduce the failures in follow 2 tests: |
@onecatcn |
test_communicator_half_async Fix. |
|
We attempted to replicate it in the A100 environment, but it was not successful. Could you please confirm if there is any merged repair code? |
test_semi_auto_parallel_hybrid_strategy 在本地复测,release/2.6分支可能出现曹氏问题。Docker 容器需要设置足够大的shared_memory,否则NCCL通信可能报错 |
After discussion, the |
bug描述 Describe the Bug
使用Ampere GPU 或 Hopper GPU執行單測有多個錯誤
目前先整理 24 個錯誤:
PaddlePaddle 2.6.0 buglist - part 1.xlsx
其他补充信息 Additional Supplementary Information
Paddle version: 2.6.0
Paddle With CUDA: True
OS: ubuntu 22.04
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
CMake version: version 3.25.1
Libc version: glibc 2.35
Python version: 3.10.12
CUDA version: 12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
cuDNN version: 8.9.7
Nvidia driver version: 535.129.03
Nvidia driver List:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB
The text was updated successfully, but these errors were encountered: