[Build] CUDA 12.5 Build ERROR in MOE/cutlass for sm=90 #20924

tianleiwu · 2024-06-04T22:57:14Z

Describe the issue

I tried to build in Windows with CUDA 12.5 for sm=90. It seems that there are build errors.

Urgency

None

Target platform

windows 11

Build script

build.bat --cmake_generator "Visual Studio 17 2022" --config Release --build_wheel --cmake_extra_defines "CMAKE_CUDA_ARCHITECTURES=61;70;75;80;90" --parallel --build_shared_lib --use_cuda --cuda_version "12.5" --cuda_home "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5" --cudnn_home "C:\CuDNN\9.1.1.17_cuda12" --use_tensorrt --tensorrt_home "C:\TensorRT\10.0.1.6.cuda-12.4"

Error / output

E:\git\onnxruntime\build\Windows\Release>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\nvcc.exe" --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu -I"E:\git\onnxruntime\build\Windows\Release_deps\utf8_range-src" -IE:\git\onnxruntime\include\onnxruntime -IE:
git\onnxruntime\include\onnxruntime\core\session -I"E:\git\onnxruntime\build\Windows\Release_deps\pytorch_cpuinfo-src\include" -IE:\git\onnxruntime\build\Windows\Release -IE:\git\onnxruntime\onnxruntime -I"E:\git\onnxruntime\build\Windows\Release_deps\abseil_cpp-src" -I"E:\git\onnxruntime\build\Windows\Release_deps\safeint-src" -I"E:\git\onnxruntime\bui
ld\Windows\Release_deps\gsl-src\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\date-src\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\onnx-src" -I"E:\git\onnxruntime\build\Windows\Release_deps\onnx-build" -I"E:\git\onnxruntime\build\Windows\Release_deps\protobuf-src\src" -I"E:\git\onnxruntime\build\Windows\Release_deps\flatbuff
ers-src\include" -IC:\CuDNN\9.1.1.17_cuda12\include -I"E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\examples" -I"E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\tools\util\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\eigen-src" -I"C:\TensorRT\10.0.
1.6.cuda-12.4\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\include" --keep-dir onnxrunt.2968DD78\x64\Release -maxrregcount=0 --machine 64 --compile -cudart shared --expt-relaxed-constexpr --
Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,s
m_75] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Werror all-warnings -Xcompiler="/EHsc -Ob2 -Zi /utf-8 /sdl /experimental:external /external:W0 /external:templates- /external:IE:/git/onnxruntime/cmake /external:IE:/git/o
nnxruntime/build/Windows/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127" -D_WINDOWS -DNDEBUG -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_MEMORY
_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D_SILENCE_EXPERIMENTAL_FILE
SYSTEM_DEPRECATION_WARNING=1 -D"CMAKE_INTDIR="Release"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -
DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D_SILENCE_EXPERI
MENTAL_FILESYSTEM_DEPRECATION_WARNING=1 -D"CMAKE_INTDIR="Release"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2 /FS /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\moe_gemm_kernels_fp32_fp32.obj "E:\git\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_g
emm_kernels_fp32_fp32.cu"
E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\include\cute/arch/cluster_sm90.hpp(101): error #940-D: missing return statement at end of non-void function "cute::cluster_grid_dims" [E:\git\onnxruntime\build\Windows\Release\onnxruntime_providers_cuda.vcxproj]
}
^

Remark: The warnings can be suppressed with "-diag-suppress "

E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\include\cute/arch/cluster_sm90.hpp(120): error #940-D: missing return statement at end of non-void function "cute::cluster_id_in_grid" [E:\git\onnxruntime\build\Windows\Release\onnxruntime_providers_cuda.vcxproj]
}
^

E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\include\cute/arch/cluster_sm90.hpp(101): error #940-D: missing return statement at end of non-void function "cute::cluster_grid_dims" [E:\git\onnxruntime\build\Windows\Release\onnxruntime_providers_cuda.vcxproj]
}
^

Remark: The warnings can be suppressed with "-diag-suppress "

E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\include\cute/arch/cluster_sm90.hpp(120): error #940-D: missing return statement at end of non-void function "cute::cluster_id_in_grid" [E:\git\onnxruntime\build\Windows\Release\onnxruntime_providers_cuda.vcxproj]
}
^

E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\include\cute/arch/cluster_sm90.hpp(101): error #940-D: missing return statement at end of non-void function "cute::cluster_grid_dims" [E:\git\onnxruntime\build\Windows\Release\onnxruntime_providers_cuda.vcxproj]
}
^

Remark: The warnings can be suppressed with "-diag-suppress "

E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\include\cute/arch/cluster_sm90.hpp(120): error #940-D: missing return statement at end of non-void function "cute::cluster_id_in_grid" [E:\git\onnxruntime\build\Windows\Release\onnxruntime_providers_cuda.vcxproj]
}
^

2 errors detected in the compilation of "E:/git/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_fp16.cu".
moe_gemm_kernels_fp16_fp16.cu
2 errors detected in the compilation of "E:/git/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp16_uint4.cu".
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.5.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\nvcc.exe" --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu
-I"E:\git\onnxruntime\build\Windows\Release_deps\utf8_range-src" -IE:\git\onnxruntime\include\onnxruntime -IE:\git\onnxruntime\include\onnxruntime\core\session -I"E:\git\onnxruntime\build\Windows\Release_deps\pytorch_cpuinfo-src\include" -IE:\git\onnxruntime\build\Windows\Release -IE:\git\onnxruntime\onnxruntime -I"E:\git\onnxruntime\build\Windows\Releas
e_deps\abseil_cpp-src" -I"E:\git\onnxruntime\build\Windows\Release_deps\safeint-src" -I"E:\git\onnxruntime\build\Windows\Release_deps\gsl-src\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\date-src\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\onnx-src" -I"E:\git\onnxruntime\build\Windows\Release_deps\onnx-build" -I"E:\git\onnxru
ntime\build\Windows\Release_deps\protobuf-src\src" -I"E:\git\onnxruntime\build\Windows\Release_deps\flatbuffers-src\include" -IC:\CuDNN\9.1.1.17_cuda12\include -I"E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\examples" -I"E:\git\onnxruntime\build\Windows\Release_deps\cutla
ss-src\tools\util\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\eigen-src" -I"C:\TensorRT\10.0.1.6.cuda-12.4\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\include" --keep-dir onnxrunt
.2968DD78\x64\Release -maxrregcount=0 --machine 64 --compile -cudart shared --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_61,code=[compute_61,sm_61] --
generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Werror all-warnings -Xcompiler="/EHsc -Ob2 -Zi /utf-8 /sdl /experimen
tal:external /external:W0 /external:templates- /external:IE:/git/onnxruntime/cmake /external:IE:/git/onnxruntime/build/Windows/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127" -D_WINDOWS -DNDEBUG -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGD
I -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN
HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D_SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING=1 -D"CMAKE_INTDIR="Release"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS
-DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES
-DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D_SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING=1 -D"CMAKE_INTDIR="Release"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2 /FS /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\moe
gemm_kernels_fp16_fp16.obj "E:\git\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp16_fp16.cu"" exited with code 2. [E:\git\onnxruntime\build\Windows\Release\onnxruntime_providers_cuda.vcxproj]
moe_gemm_kernels_fp16_uint4.cu
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.5.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\nvcc.exe" --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu
-I"E:\git\onnxruntime\build\Windows\Release_deps\utf8_range-src" -IE:\git\onnxruntime\include\onnxruntime -IE:\git\onnxruntime\include\onnxruntime\core\session -I"E:\git\onnxruntime\build\Windows\Release_deps\pytorch_cpuinfo-src\include" -IE:\git\onnxruntime\build\Windows\Release -IE:\git\onnxruntime\onnxruntime -I"E:\git\onnxruntime\build\Windows\Releas
e_deps\abseil_cpp-src" -I"E:\git\onnxruntime\build\Windows\Release_deps\safeint-src" -I"E:\git\onnxruntime\build\Windows\Release_deps\gsl-src\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\date-src\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\onnx-src" -I"E:\git\onnxruntime\build\Windows\Release_deps\onnx-build" -I"E:\git\onnxru
ntime\build\Windows\Release_deps\protobuf-src\src" -I"E:\git\onnxruntime\build\Windows\Release_deps\flatbuffers-src\include" -IC:\CuDNN\9.1.1.17_cuda12\include -I"E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\examples" -I"E:\git\onnxruntime\build\Windows\Release_deps\cutla
ss-src\tools\util\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\eigen-src" -I"C:\TensorRT\10.0.1.6.cuda-12.4\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\include" --keep-dir onnxrunt
.2968DD78\x64\Release -maxrregcount=0 --machine 64 --compile -cudart shared --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_61,code=[compute_61,sm_61] --
generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Werror all-warnings -Xcompiler="/EHsc -Ob2 -Zi /utf-8 /sdl /experimen
tal:external /external:W0 /external:templates- /external:IE:/git/onnxruntime/cmake /external:IE:/git/onnxruntime/build/Windows/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127" -D_WINDOWS -DNDEBUG -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGD
I -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN
HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D_SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING=1 -D"CMAKE_INTDIR="Release"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS
-DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES
-DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D_SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING=1 -D"CMAKE_INTDIR="Release"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2 /FS /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\moe
gemm_kernels_fp16_uint4.obj "E:\git\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp16_uint4.cu"" exited with code 2. [E:\git\onnxruntime\build\Windows\Release\onnxruntime_providers_cuda.vcxproj]
2 errors detected in the compilation of "E:/git/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp32_fp32.cu".
moe_gemm_kernels_fp32_fp32.cu
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.5.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\bin\nvcc.exe" --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu
-I"E:\git\onnxruntime\build\Windows\Release_deps\utf8_range-src" -IE:\git\onnxruntime\include\onnxruntime -IE:\git\onnxruntime\include\onnxruntime\core\session -I"E:\git\onnxruntime\build\Windows\Release_deps\pytorch_cpuinfo-src\include" -IE:\git\onnxruntime\build\Windows\Release -IE:\git\onnxruntime\onnxruntime -I"E:\git\onnxruntime\build\Windows\Releas
e_deps\abseil_cpp-src" -I"E:\git\onnxruntime\build\Windows\Release_deps\safeint-src" -I"E:\git\onnxruntime\build\Windows\Release_deps\gsl-src\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\date-src\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\onnx-src" -I"E:\git\onnxruntime\build\Windows\Release_deps\onnx-build" -I"E:\git\onnxru
ntime\build\Windows\Release_deps\protobuf-src\src" -I"E:\git\onnxruntime\build\Windows\Release_deps\flatbuffers-src\include" -IC:\CuDNN\9.1.1.17_cuda12\include -I"E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\cutlass-src\examples" -I"E:\git\onnxruntime\build\Windows\Release_deps\cutla
ss-src\tools\util\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\eigen-src" -I"C:\TensorRT\10.0.1.6.cuda-12.4\include" -I"E:\git\onnxruntime\build\Windows\Release_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\include" --keep-dir onnxrunt
.2968DD78\x64\Release -maxrregcount=0 --machine 64 --compile -cudart shared --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c++17 --generate-code=arch=compute_61,code=[compute_61,sm_61] --
generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] --generate-code=arch=compute_80,code=[compute_80,sm_80] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Werror all-warnings -Xcompiler="/EHsc -Ob2 -Zi /utf-8 /sdl /experimen
tal:external /external:W0 /external:templates- /external:IE:/git/onnxruntime/cmake /external:IE:/git/onnxruntime/build/Windows/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127" -D_WINDOWS -DNDEBUG -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGD
I -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN
HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D_SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING=1 -D"CMAKE_INTDIR="Release"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS
-DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES
-DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D_SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING=1 -D"CMAKE_INTDIR="Release"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2 /FS /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\moe
gemm_kernels_fp32_fp32.obj "E:\git\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp32_fp32.cu"" exited with code 2. [E:\git\onnxruntime\build\Windows\Release\onnxruntime_providers_cuda.vcxproj]

Visual Studio Version

Enterprise 2022 (64-bit) 17.9.7

GCC / Compiler Version

No response

@wangyems

### Description Upgrade cutlass to 3.5 to fix build errors using CUDA 12.4 or 12.5 in Windows - [x] Upgrade cutlass to 3.5.0. - [x] Fix flash attention build error with latest cutlass header files and APIs. This fix is provided by @wangyems. - [x] Update efficient attention to use new cutlass fmha interface. - [x] Patch cutlass to fix `hrsqrt` not found error for sm < 53. - [x] Disable TF32 Staged Accumulation to fix blkq4_fp16_gemm_sm80_test build error for cuda 11.8 to 12.3. - [x] Disable TRT 10 deprecate warnings. The following are not included in this PR: * TRT provider replaces the deprecated APIs. * Fix blkq4_fp16_gemm_sm80_test build error for cuda 12.4 or 12.5. This test is not built by default unless you add `--cmake_extra_defines onnxruntime_ENABLE_CUDA_EP_INTERNAL_TESTS=ON` in build command. To integrate to rel-1.18.1: Either bring in other changes (like onnx 1.16.1), or generate manifest and upload a new ONNX Runtime Build Time Deps artifact based on rel-1.18.1. ### Motivation and Context #19891 #20924 #20953

tianleiwu added the build build issues; typically submitted using template label Jun 4, 2024

github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:TensorRT issues related to TensorRT execution provider platform:windows issues related to the Windows platform labels Jun 4, 2024

tianleiwu mentioned this issue Jun 5, 2024

[CUDA] upgrade cutlass to 3.5.0 #20940

Merged

6 tasks

tianleiwu closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Build] CUDA 12.5 Build ERROR in MOE/cutlass for sm=90 #20924

[Build] CUDA 12.5 Build ERROR in MOE/cutlass for sm=90 #20924

tianleiwu commented Jun 4, 2024 •

edited

Loading

[Build] CUDA 12.5 Build ERROR in MOE/cutlass for sm=90 #20924

[Build] CUDA 12.5 Build ERROR in MOE/cutlass for sm=90 #20924

Comments

tianleiwu commented Jun 4, 2024 • edited Loading

Describe the issue

Urgency

Target platform

Build script

Error / output

Visual Studio Version

GCC / Compiler Version

tianleiwu commented Jun 4, 2024 •

edited

Loading