[BUG] Unable to build against CUDA 12.4 without #1403

mc-nv · 2024-03-14T21:12:02Z

Describe the bug
Unable to compile cutlass source code against CUDA 12.4

Steps/Code to reproduce bug

PS C:\workspace\cutlass> history

  Id CommandLine
  -- -----------
   1 md workspace
   2 cd workspace
   3 git clone v3.1.0 https://github.com/NVIDIA/cutlass.git
   4 cd cutlass
   5 cmake -B build
   6 cmake --build build

Getting error:

error #940-D: missing return statement at end of non-void function "cute::cluster_grid_dims" 
error #940-D: missing return statement at end of non-void function "cute::cluster_id_in_grid"

Expected behavior
Shouldn't compile without issues.

Environment details (please complete the following information):
Docker, Bare metal
BUILDTOOLS_VERSION:17.9.34622.214
CMAKE_VERSION:3.27.1
CUDA_VERSION:12.4.0
CUDNN_VERSION:9.0.0.312
PYTHON_VERSION:3.8.10
TENSORRT_VERSION:8.6.1.6
VCPGK_VERSION:2023.11.20

Additional context
microsoft/onnxruntime#19891

The text was updated successfully, but these errors were encountered:

thakkarV · 2024-03-14T21:16:37Z

Hi! This is for CUTLASS version 3.1 which was released quite a few months ago (before the release of CUDA 12.4). Are you able to repro this with CUTLASS 3.4? It seems like you are also building on Windows, and our support for windows builds has improved quite a bit since 3.1.

mc-nv · 2024-03-14T21:45:49Z

My latest build on Windows within same environment stuck for several hours against main.
And trow errors like:

C:\workspace\cutlass\include\cute/layout.hpp(746): catastrophic error : out of memory [C:\workspace\cutlass\b
uild\tools\library\cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.vcxproj]
        return bw_coalesce<I-1>(old_shape, old_stride, new_shape, new_stride);
        ^

  1 catastrophic error detected in the compilation of "C:/workspace/cutlass/build/tools/library/cutlass_libra
  ry_gemm_sm90_void_s64x128x16gemm_f16_objs.unity.daa0ddf64e7b.cu".
  Compilation terminated.
  cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.unity.daa0ddf64e7b.cu
C:\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.4.targets(799,9): error MSB3721: The comm
and ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\nvcc.exe"  --use-local-env -ccbin "C:\Buil
dTools\VC\Tools\MSVC\14.39.33519\bin\HostX64\x64" -x cu   -IC:\workspace\cutlass\include -IC:\workspace\cutla
ss\build\include -I\include -I\examples -IC:\workspace\cutlass\tools\library\include -IC:\workspace\cutlass\t
ools\util\include -IC:\workspace\cutlass\tools\library\src -IC:\workspace\cutlass\build\tools\library\include
 -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\include" -I"C:\Program Files\NVIDIA GPU Computin
g Toolkit\CUDA\v12.4\include"     --keep-dir cutlass_.55939AC6\x64\Debug  -maxrregcount=0   --machine 64 --co
mpile -cudart static -std=c++17 --generate-code=arch=compute_70,code=[sm_70] --generate-code=arch=compute_70,
code=[compute_70] --generate-code=arch=compute_72,code=[sm_72] --generate-code=arch=compute_72,code=[compute_
72] --generate-code=arch=compute_75,code=[sm_75] --generate-code=arch=compute_75,code=[compute_75] --generate
-code=arch=compute_80,code=[sm_80] --generate-code=arch=compute_80,code=[compute_80] --generate-code=arch=com
pute_86,code=[sm_86] --generate-code=arch=compute_86,code=[compute_86] --generate-code=arch=compute_87,code=[
sm_87] --generate-code=arch=compute_87,code=[compute_87] --generate-code=arch=compute_89,code=[sm_89] --gener
ate-code=arch=compute_89,code=[compute_89] --generate-code=arch=compute_90,code=[sm_90] --generate-code=arch=
compute_90,code=[compute_90] --generate-code=arch=compute_90a,code=[sm_90a] --generate-code=arch=compute_90a,
code=[compute_90a] --expt-relaxed-constexpr -Xcompiler="/EHsc /Zc:__cplusplus /bigobj -Zi -Ob0 /wd4819 /fp:st
rict" -g  -D_WINDOWS -DCUTLASS_VERSIONS_GENERATED -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_TEST_LEVEL=0 -
DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1 -DCUTLASS_DEBUG_TRACE_
LEVEL=0 -D"CMAKE_INTDIR=\"Debug\"" -D_MBCS -D"CMAKE_INTDIR=\"Debug\"" -Xcompiler "/EHsc /W3 /nologo /Od /FS /
Zi /RTC1 /MDd /GR" -Xcompiler "/Fdcutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir\Debug\cutlass_li
brary_gemm_sm90_void_s64x128x16gemm_f16_objs.pdb" -o cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.d
ir\Debug\cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.unity.daa0ddf64e7b.obj "C:\workspace\cutlass\
build\tools\library\cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.unity.daa0ddf64e7b.cu"" exited wit
h code 1. [C:\workspace\cutlass\build\tools\library\cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.vc
xproj]

thakkarV · 2024-03-14T22:20:33Z

That's an out of memory error so likely an issue with the compiler or the system used to build the kernels? CC @mhoemmen

mc-nv · 2024-03-15T04:49:28Z

I got following error output from ONNX Runtime build trying to engage the "cutlass" as a CMake submodule against CUDA 12.4:

C:\workspace\Release\_deps\cutlass-src\include\cutlass/gemm/threadblock/default_mma_core_sm80.h(2495): error : expression mus
t have a constant value [C:\ort-118-cuda124-trt10-latestCutlass\Release\onnxruntime_providers_cuda.vcxproj]
      static const int LaneN = cutlass::const_min(numElementsB, ThreadTileN);
                               ^
  C:\workspace\Release\_deps\cutlass-src\include\cutlass/gemm/threadblock/default_mma_core_sm80.h(2495): note

thakkarV · 2024-03-16T22:16:56Z

CC @hwu36 and @mhoemmen

mhoemmen · 2024-03-20T05:35:21Z

@mc-nv Thanks for the error report! We'll take a look at this.

yf711 · 2024-04-11T19:28:53Z

Hi @mhoemmen any update on this?

mhoemmen · 2024-04-11T21:13:02Z

@yf711 wrote:

I got following error output from ONNX Runtime build trying to engage the "cutlass" as a CMake submodule against CUDA 12.4:

I'm not sure what it means to "engage the 'cutlass' as a CMake submodule against CUDA 12.4."

Could you please post all the CMake options that were given to CUTLASS? Without the list of CMake options, it will be a lot harder for us to try to reproduce this.

github-actions · 2024-05-11T22:04:36Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

chilo-ms · 2024-05-23T00:49:16Z

Could you please post all the CMake options that were given to CUTLASS? Without the list of CMake options, it will be a lot harder for us to try to reproduce this.

Hi @mhoemmen

OnnxRuntime's CUDA EP includes cutlass header directories to its cmake target:

    ...
    include(cutlass)
    target_include_directories(${target} PRIVATE ${cutlass_SOURCE_DIR}/include ${cutlass_SOURCE_DIR}/examples ${cutlass_SOURCE_DIR}/tools/util/include)
    ...

https://github.com/microsoft/onnxruntime/blob/main/cmake/external/cutlass.cmake
https://github.com/microsoft/onnxruntime/blob/main/cmake/onnxruntime_providers_cuda.cmake#L214

So some of its header files can include cutlass header files
moe_gemm_kernels_template.h

The compiler errors were shown when building the CUDA EP target.
You can either see the compiler flags from the onnxruntime_providers_cuda.cmake above or command line log below.

With cutlass 3.1.0:

Warning about missing return statement – can be worked around with the –compile_no_warning_as_error flag

C:/Users/lochi/repos/onnxruntime/build/Windows/Release/_deps/cutlass-src/include\cute/arch/cluster_sm90.hpp(101): error #940-D: missing return statement at end of non-void function "cute::cluster_grid_dims" [C:\Users\lochi\repos\onnxruntime\build\Windows\Release\onnxruntime_prov
iders_cuda.vcxproj]
    }
    ^

  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

C:/Users/lochi/repos/onnxruntime/build/Windows/Release/_deps/cutlass-src/include\cute/arch/cluster_sm90.hpp(120): error #940-D: missing return statement at end of non-void function "cute::cluster_id_in_grid" [C:\Users\lochi\repos\onnxruntime\build\Windows\Release\onnxruntime_pro
viders_cuda.vcxproj]
    }
    ^

  2 errors detected in the compilation of "C:/Users/lochi/repos/onnxruntime/onnxruntime/contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_fp32_fp32.cu".
  moe_gemm_kernels_fp32_fp32.cu
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.4.targets(799,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\bin\nvcc.exe"  --use-local-env -ccbin "C:\Program Files\Microsof
t Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.40.33807\bin\HostX64\x64" -x cu   -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\utf8_range-src" -IC:\Users\lochi\repos\onnxruntime\include\onnxruntime -IC:\Users\lochi\repos\onnxruntime\include\onnxruntime\core\s
ession -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\pytorch_cpuinfo-src\include" -IC:\Users\lochi\repos\onnxruntime\build\Windows\Release -IC:\Users\lochi\repos\onnxruntime\onnxruntime -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\abseil_cpp-
src" -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\safeint-src" -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\gsl-src\include" -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\date-src\include" -I"C:\Users\lochi\repos\onnxruntim
e\build\Windows\Release\_deps\onnx-src" -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\onnx-build" -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\protobuf-src\src" -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\flatbuffers-src\i
nclude" -I"C:\Users\lochi\Downloads\cudnn-windows-x86_64-8.9.7.29_cuda12-archive\cudnn-windows-x86_64-8.9.7.29_cuda12-archive\include" -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\cutlass-src\include" -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_
deps\cutlass-src\examples" -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\cutlass-src\tools\util\include" -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\eigen-src" -I"C:\Users\lochi\tensorrt\TensorRT-10.0.1.6.Windows10.win10.cuda-12.4\TensorRT-1
0.0.1.6\include" -I"C:\Users\lochi\repos\onnxruntime\build\Windows\Release\_deps\mp11-src\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.4\include"     --keep-dir onnxrunt.B5EE5B2F\x64\Relea
se  -maxrregcount=0   --machine 64 --compile -cudart shared --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe --diag_suppress=bad_friend_decl -Xcudafe --diag_suppress=unsigned_compare_with_zero -Xcudafe --diag_suppress=expr_has_no_effect -include algorithm -std=c+
+17 --generate-code=arch=compute_60,code=[compute_60,sm_60] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] --generate-code=arch=compute_80,code=[compute_80,sm
_80] --generate-code=arch=compute_86,code=[compute_86,sm_86] --generate-code=arch=compute_90,code=[compute_90,sm_90] -Xcudafe --diag_suppress=conversion_function_not_usable --threads 1 -Werror all-warnings -Xcompiler="/EHsc -Ob2 /utf-8 /sdl /experimental:external /external:W0 /e
xternal:templates- /external:IC:/Users/lochi/repos/onnxruntime/cmake /external:IC:/Users/lochi/repos/onnxruntime/build/Windows/Release /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4251 /wd4201 /wd4324 /wd5054 /w15038 /wd4834 /wd4127"   -D_WINDOWS -DNDEBUG -DCPUINFO_SUPPORTED_PLATF
ORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFINES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX
_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D_SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_
WARNING=1 -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -D_WINDLL -D_MBCS -DEIGEN_HAS_C99_MATH -DCPUINFO_SUPPORTED -DNDEBUG -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_USE_THREADS -DDISABLE_CUSPARSE_DEPRECATED -DPLATFORM_WINDOWS -DNOGDI -DNOMINMAX -D_USE_MATH_DEFI
NES -D_SILENCE_ALL_CXX17_DEPRECATION_WARNINGS -DUSE_CUDA=1 -DUSE_MEMORY_EFFICIENT_ATTENTION=1 -DUSE_TENSORRT=1 -DONLY_C_LOCALE=0 -DONNX_NAMESPACE=onnx -DONNX_ML=1 -DONNX_USE_LITE_PROTO=1 -D__ONNX_NO_DOC_STRINGS -DWIN32_LEAN_AND_MEAN -DORT_ENABLE_STREAM -DEIGEN_MPL2_ONLY -DEIGEN_
HAS_CONSTEXPR -DEIGEN_HAS_VARIADIC_TEMPLATES -DEIGEN_HAS_CXX11_MATH -DEIGEN_HAS_CXX11_ATOMIC -DEIGEN_STRONG_INLINE=inline -D_SILENCE_EXPERIMENTAL_FILESYSTEM_DEPRECATION_WARNING=1 -D"CMAKE_INTDIR=\"Release\"" -Donnxruntime_providers_cuda_EXPORTS -Xcompiler "/EHsc /W4 /nologo /O2
/FS   /MD /GR" -Xcompiler "/Fdonnxruntime_providers_cuda.dir\Release\vc143.pdb" -o onnxruntime_providers_cuda.dir\Release\moe_gemm_kernels_fp32_fp32.obj "C:\Users\lochi\repos\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp32_fp32.cu"" exited with code 2.
[C:\Users\lochi\repos\onnxruntime\build\Windows\Release\onnxruntime_providers_cuda.vcxproj]

With cutlass 3.5.0:

More build issues that are actual errors vs warnings found in 3.1.0, using --compile_no_warning has no affect

C:/Users/lochi/repos/onnxruntime/build/Windows/Release/_deps/cutlass-src/include\cutlass/epilogue/threadblock/output_tile_thread_map.h(242): error : expression must have a constant value [C:\Users\lochi\repos\onnxruntime\build\Windows\Release\onnxruntime_providers_cuda.vcxproj]
        (Detail::kTargetAccessRows > Detail::kShapeRow ?
        ^
  C:/Users/lochi/repos/onnxruntime/build/Windows/Release/_deps/cutlass-src/include\cutlass/epilogue/threadblock/output_tile_thread_map.h(246): note #2703-D: cannot call non-constexpr function "cutlass::const_min" (declared at line 739 of C:/Users/lochi/repos/onnxruntime/build/Wi
  ndows/Release/_deps/cutlass-src/include\cutlass/fast_math.h)
            const_min(kWarpSize, kMemoryAccessSize / (kElementsPerAccess * kElementSize / 8))
            ^
            detected during:
              instantiation of class "cutlass::epilogue::threadblock::detail::RowArrangement<Shape, WarpsRemaining, ElementsPerAccess, ElementSize, true> [with Shape=cutlass::epilogue::threadblock::OutputTileShape<128, 4, 4, 1, 1>, WarpsRemaining=1, ElementsPerAccess=8, ElementS
  ize=16]" at line 365
              instantiation of class "cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<Shape_, Count_, Threads, ElementsPerAccess, ElementSize>::Detail [with Shape_=cutlass::epilogue::threadblock::OutputTileShape<128, 4, 4, 1, 1>, Count_=cutlass::epilogue::threadblock:
  :OutputTileShape<1, 2, 1, 1, 2>, Threads=128, ElementsPerAccess=8, ElementSize=16]" at line 380
              instantiation of class "cutlass::epilogue::threadblock::OutputTileOptimalThreadMap<Shape_, Count_, Threads, ElementsPerAccess, ElementSize> [with Shape_=cutlass::epilogue::threadblock::OutputTileShape<128, 4, 4, 1, 1>, Count_=cutlass::epilogue::threadblock::OutputT
  ileShape<1, 2, 1, 1, 2>, Threads=128, ElementsPerAccess=8, ElementSize=16]" at line 140 of C:/Users/lochi/repos/onnxruntime/build/Windows/Release/_deps/cutlass-src/include\cutlass/epilogue/threadblock/default_epilogue_volta_tensor_op.h
              instantiation of class "cutlass::epilogue::threadblock::DefaultEpilogueVoltaTensorOp<Shape_, WarpMmaTensorOp_, PartitionsK, OutputOp_, ElementsPerAccess, ScatterD, PermuteDLayout> [with Shape_=cutlass::gemm::GemmShape<32, 128, 64>, WarpMmaTensorOp_=cutlass::gemm::w
  arp::MmaVoltaTensorOp<cutlass::gemm::GemmShape<32, 32, 64>, cutlass::half_t, cutlass::layout::RowMajorVoltaTensorOpMultiplicandCrosswise<16, 64>, cutlass::half_t, cutlass::layout::ColumnMajorVoltaTensorOpMultiplicandCrosswise<16, 64>, float, cutlass::layout::RowMajor, cutlass:
  :gemm::warp::MmaTensorOpPolicy<cutlass::arch::Mma<cutlass::gemm::GemmShape<16, 16, 4>, 32, cutlass::half_t, cutlass::layout::RowMajor, cutlass::half_t, cutlass::layout::ColumnMajor, float, cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>, cutlass::MatrixShape<1, 1>>, _
  _nv_bool>, PartitionsK=1, OutputOp_=cutlass::epilogue::thread::LinearCombinationRelu<cutlass::half_t, 8, float, float, cutlass::epilogue::thread::ScaleType::Default, cutlass::FloatRoundStyle::round_to_nearest>, ElementsPerAccess=8, ScatterD=false, PermuteDLayout=cutlass::layou
  t::NoPermute]" at line 747 of C:/Users/lochi/repos/onnxruntime/build/Windows/Release/_deps/cutlass-src/include\cutlass/gemm/kernel/default_gemm.h
              instantiation of class "cutlass::gemm::kernel::DefaultGemm<ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, cutlass::layout::RowMajor, ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm70, ThreadblockShape, WarpShape, cut
  lass::gemm::GemmShape<8, 8, 4>, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator, SharedMemoryClear, GatherA, GatherB, ScatterD, PermuteDLayout, PermuteALayout, PermuteBLayout, void> [with ElementA=cutlass::half_t, LayoutA=cutlass::layout::RowMajor, kAlignmentA=
  8, ElementB=cutlass::uint4b_t, LayoutB=cutlass::layout::ColumnMajor, kAlignmentB=8, ElementC=cutlass::half_t, ElementAccumulator=float, ThreadblockShape=cutlass::gemm::GemmShape<32, 128, 64>, WarpShape=cutlass::gemm::GemmShape<32, 32, 64>, EpilogueOutputOp=cutlass::epilogue::t
  hread::LinearCombinationRelu<cutlass::half_t, 8, float, float, cutlass::epilogue::thread::ScaleType::Default, cutlass::FloatRoundStyle::round_to_nearest>, ThreadblockSwizzle=cutlass::gemm::threadblock::GemmBatchedIdentityThreadblockSwizzle, SplitKSerial=true, Operator=cutlass:
  :arch::OpMultiplyAdd, SharedMemoryClear=cutlass::gemm::SharedMemoryClearOption::kNone, GatherA=false, GatherB=false, ScatterD=false, PermuteDLayout=cutlass::layout::NoPermute, PermuteALayout=cutlass::layout::NoPermute, PermuteBLayout=cutlass::layout::NoPermute]" at line 239 of
   C:/Users/lochi/repos/onnxruntime/build/Windows/Release/_deps/cutlass-src/include\cutlass/gemm/kernel/default_gemm_grouped.h
              [ 4 instantiation contexts not shown ]
              instantiation of "void ort_fastertransformer::dispatch_moe_gemm_to_cutlass<T,WeightType,arch,EpilogueTag,<unnamed>>(const T *, const WeightType *, const T *, const T *, T *, int64_t *, int64_t, int64_t, int64_t, int, ort_fastertransformer::CutlassGemmConfig, int, i
  nt, cudaStream_t, int *) [with T=half, WeightType=cutlass::uint4b_t, arch=cutlass::arch::Sm70, EpilogueTag=ort_fastertransformer::EpilogueOpDefaultReLU, <unnamed>=(void *)nullptr]" at line 389 of C:/Users/lochi/repos/onnxruntime/onnxruntime\contrib_ops/cuda/moe/ft_moe/moe_gemm
  _kernels_template.h
              instantiation of "void ort_fastertransformer::MoeGemmRunner<T, WeightType>::dispatch_to_arch<EpilogueTag>(const T *, const WeightType *, const T *, const T *, T *, int64_t *, int64_t, int64_t, int64_t, int, ort_fastertransformer::CutlassGemmConfig, cudaStream_t, in
  t *) [with T=half, WeightType=cutlass::uint4b_t, EpilogueTag=ort_fastertransformer::EpilogueOpDefaultReLU]" at line 422 of C:/Users/lochi/repos/onnxruntime/onnxruntime\contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h
              instantiation of "void ort_fastertransformer::MoeGemmRunner<T, WeightType>::profile_gemm<EpilogueTag>(const T *, const WeightType *, const T *, const T *, T *, int64_t *, int64_t, int64_t, int64_t, int, cudaStream_t, int64_t) [with T=half, WeightType=cutlass::uint4
  b_t, EpilogueTag=ort_fastertransformer::EpilogueOpDefaultReLU]" at line 469 of C:/Users/lochi/repos/onnxruntime/onnxruntime\contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h
              instantiation of "void ort_fastertransformer::MoeGemmRunner<T, WeightType>::run_gemm<EpilogueTag>(const T *, const WeightType *, const T *, const T *, T *, int64_t *, int64_t, int64_t, int64_t, int, cudaStream_t) [with T=half, WeightType=cutlass::uint4b_t, Epilogue
  Tag=ort_fastertransformer::EpilogueOpDefaultReLU]" at line 484 of C:/Users/lochi/repos/onnxruntime/onnxruntime\contrib_ops/cuda/moe/ft_moe/moe_gemm_kernels_template.h
              instantiation of "void ort_fastertransformer::MoeGemmRunner<T, WeightType>::moe_gemm_bias_act(const T *, const WeightType *, const T *, const T *, T *, int64_t *, int64_t, int64_t, int64_t, int, ort_fastertransformer::ActivationType, cudaStream_t) [with T=half, Wei
  ghtType=cutlass::uint4b_t]" at line 29 of C:\Users\lochi\repos\onnxruntime\onnxruntime\contrib_ops\cuda\moe\ft_moe\moe_gemm_kernels_fp16_uint4.cu

mhoemmen · 2024-05-24T21:25:17Z

A colleague has been communicating offline to see if setting the /Zc:__cplusplus flag helps. I'm awaiting news on that.

tianleiwu · 2024-06-05T14:19:33Z

/Zc:__cplusplus flag resolved the issue. Thanks.

mhoemmen · 2024-06-06T03:44:45Z

Thanks @tianleiwu ! : - )

@hwu36 It looks like we can close this issue.

github-actions · 2024-07-10T16:05:38Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

mc-nv added ? - Needs Triage bug Something isn't working labels Mar 14, 2024

mnicely removed the ? - Needs Triage label Mar 25, 2024

github-actions bot added the inactive-30d label May 11, 2024

github-actions bot removed the inactive-30d label May 23, 2024

tianleiwu mentioned this issue Jun 5, 2024

[CUDA] upgrade cutlass to 3.5.0 microsoft/onnxruntime#20940

Merged

6 tasks

github-actions bot added the inactive-30d label Jul 10, 2024

jakirkham mentioned this issue Jul 25, 2024

Upgrade to XGBoost 2.1.0 conda-forge/xgboost-feedstock#167

Merged

3 tasks

mnicely closed this as completed Jul 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unable to build against CUDA 12.4 without #1403

[BUG] Unable to build against CUDA 12.4 without #1403

mc-nv commented Mar 14, 2024

thakkarV commented Mar 14, 2024 •

edited

Loading

mc-nv commented Mar 14, 2024 •

edited

Loading

thakkarV commented Mar 14, 2024 •

edited

Loading

mc-nv commented Mar 15, 2024

thakkarV commented Mar 16, 2024

mhoemmen commented Mar 20, 2024

yf711 commented Apr 11, 2024

mhoemmen commented Apr 11, 2024

github-actions bot commented May 11, 2024

chilo-ms commented May 23, 2024

mhoemmen commented May 24, 2024

tianleiwu commented Jun 5, 2024 •

edited

Loading

mhoemmen commented Jun 6, 2024

github-actions bot commented Jul 10, 2024

[BUG] Unable to build against CUDA 12.4 without #1403

[BUG] Unable to build against CUDA 12.4 without #1403

Comments

mc-nv commented Mar 14, 2024

thakkarV commented Mar 14, 2024 • edited Loading

mc-nv commented Mar 14, 2024 • edited Loading

thakkarV commented Mar 14, 2024 • edited Loading

mc-nv commented Mar 15, 2024

thakkarV commented Mar 16, 2024

mhoemmen commented Mar 20, 2024

yf711 commented Apr 11, 2024

mhoemmen commented Apr 11, 2024

github-actions bot commented May 11, 2024

chilo-ms commented May 23, 2024

mhoemmen commented May 24, 2024

tianleiwu commented Jun 5, 2024 • edited Loading

mhoemmen commented Jun 6, 2024

github-actions bot commented Jul 10, 2024

thakkarV commented Mar 14, 2024 •

edited

Loading

mc-nv commented Mar 14, 2024 •

edited

Loading

thakkarV commented Mar 14, 2024 •

edited

Loading

tianleiwu commented Jun 5, 2024 •

edited

Loading