Ptx assembly aborted due to errors #58491

carlosgalvezp · 2022-10-20T06:53:58Z

Hi!

We are bumping Clang to commit 1ae33bf, and we find that it crashes building CUDA code with this error trace:

ptxas /tmp/patch-4eaef1/patch-sm_61.s, line 3885; fatal   : Parsing error near '.': syntax error
ptxas fatal   : Ptx assembly aborted due to errors
clang: �[0;1;31merror: ptxas command failed with exit code 255 (use -v to see invocation)
clang version 16.0.0 (https://github.com/llvm/llvm-project.git 1ae33bf42680b156fe0f5cd6163bf24ef45d8cd3)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: external/llvm/usr/bin

Is this a known problem?

The text was updated successfully, but these errors were encountered:

carlosgalvezp · 2022-10-20T12:10:48Z

It has to do with the __func__ function, and only when compiling in debug mode -g. Minimal repro with CUDA 11.7 Clang trunk:

#include <cstdio>

__global__ void foo_kernel()
{
    printf("%s", __func__);
}

void foo()
{
    foo_kernel<<<10, 1>>>();
}

clang --cuda-path=/path/to/cuda-11.7 -c  -g --cuda-gpu-arch=sm_75 -o foo.cu.o foo.cu

carlosgalvezp · 2022-10-20T14:27:37Z

Bisecting brings me here: 7aa1fa0
FYI @hctim @dwblaikie @rnk @adrian-prantl

Artem-B · 2022-10-20T17:33:31Z

https://godbolt.org/z/8bMYcf1z7

The debug info directive that ptxas does not like is on line 655:

.b64 __func__._Z10foo_kernelv

It should've been __func___$__Z10foo_kernelv. Apparently NVPTX's name normalizer didn't get applied to the symbol name in debug info.

A work-around would be to disable GPU-side debug info with -Xarch_device -g0

carlosgalvezp · 2022-10-21T06:23:24Z

Thanks for the quick help! Will try the workaround :)

vangohao · 2024-01-03T09:02:15Z

https://godbolt.org/z/8bMYcf1z7

The debug info directive that ptxas does not like is on line 655:
.b64 __func__._Z10foo_kernelv
It should've been __func___$__Z10foo_kernelv. Apparently NVPTX's name normalizer didn't get applied to the symbol name in debug info.

A work-around would be to disable GPU-side debug info with -Xarch_device -g0

Is this issue solved? I am encountering this issue with clang and llvm 17.0.6

tuero · 2024-03-06T03:26:17Z

I'm encountering a similar issue with clang version 18.0.0git (https://github.com/llvm/llvm-project.git b7376c319630a6b8395f3df5a46ba73e8fe29ea9), where debug builds fail when using __PRETTY_FUNCTION__

tambry · 2024-03-15T12:30:58Z

Minimal repro:

echo '__attribute__((device)) void foo(){__PRETTY_FUNCTION__;}' | clang -cc1 -triple nvptx64-nvidia-cuda -S -fcuda-is-device -debug-info-kind=constructor -fno-dwarf-directory-asm -Wno-everything -x cuda | ptxas -

Removing -debug-info-kind=constructor works around this.

Artem-B · 2024-03-15T17:31:33Z

It looks like another case of LLVM generating symbol names with a dot in it and sneaking through our attempts to normalize such names:

 .global .align 1 .b8 __PRETTY_FUNCTION___$__Z3foov[11] = {118, 111, 105, 100, 32, 102, 111, 111, 40, 41};
...

.b64 __PRETTY_FUNCTION__._Z3foov

The variable itself does have . mangled, but the reference from debug info does not.

Switching to line-only debug info would work around the issue, too.

tambry · 2024-03-19T15:32:25Z

Looked into this quite a bit. It seems the name gets embedded in a debug DIE during the annotation-remarks pass in getOrCreateGlobalVariableDIE()→addLocationAttribute()→addOpAddress(). Somehow there end up being 2 MCSymbols related to __PRETTY_FUNCTION__._Z3foov and nvptx-assign-valid-global-names renames the general one, but not the one that was embedded into the DIE...

After spending already too much time looking into this and not understanding enough about the guts of the LLVM debug information infrastructure I took the easy way out:

Generate pre-defined lvalue names without dots

`.` should be converted to `_$_` by the nvptx-assign-valid-global-names pass as `ptxas` doesn't support dots.
But during the ASMPrinter initialization the global variable name gets embedded in a debug DIE.
There somehow end up being two different `MCSymbol`s for the global variable with only the main one being renamed.

Bug: https://github.com/llvm/llvm-project/issues/58491
--- a/clang/lib/CodeGen/CGExpr.cpp
+++ b/clang/lib/CodeGen/CGExpr.cpp
@@ -3277,7 +3277,12 @@ LValue CodeGenFunction::EmitPredefinedLV
     FnName = FnName.substr(1);
   StringRef NameItems[] = {
       PredefinedExpr::getIdentKindName(E->getIdentKind()), FnName};
-  std::string GVName = llvm::join(NameItems, NameItems + 2, ".");
+  std::string GVName;
+  if (CGM.getLangOpts().CUDA && CGM.getLangOpts().CUDAIsDevice) {
+    GVName = llvm::join(NameItems, NameItems + 2, "_$_");
+  } else {
+    GVName = llvm::join(NameItems, NameItems + 2, ".");
+  }
   if (auto *BD = dyn_cast_or_null<BlockDecl>(CurCodeDecl)) {
     std::string Name = std::string(SL->getString());
     if (!Name.empty()) {

Artem-B · 2024-03-19T17:22:41Z

I think we've dealt with a similar issue in the dwarf debug info before. Let me see if I can find it.

Artem-B · 2024-03-20T20:43:54Z

I think I had 2e7e097 in mind, but it may not be helpful here as it was dealing with the concept of private prefixes. Here the symbol which causes the problem is a . used as a separator.

I believe we did discuss invalid symbol issues in the past, but I do not think it ever went anywhere.
E.g. the discussion on https://reviews.llvm.org/D40573 still seems to be somewhat relevant.
Especially this bit:

This is silly. This bug has been open for so long that nvidia could've just fixed their toolchain by now to accept dots in symbol names.

Back to figuring out how to fix this instance.

But during the ASMPrinter initialization the global variable name gets embedded in a debug DIE.
There somehow end up being two different MCSymbols for the global variable with only the main one being renamed.

Oh, well. Looks like we may need to do it the hard way and teach nvptx-assign-valid-global-names how to deal with the symbols in debug info. It would still be dealing with the consequences, but at least the mess would be contained in one place.

@alexey-bataev Would you happen to have any idea on what would be the best way to get DWARF's symbol references mangled the same way we mangle other symbols in NVPTX?

alexey-bataev · 2024-03-20T20:50:10Z

I always thought that we need to handle it in the frontend. But it is only my thought, feel free to discard it.

Artem-B · 2024-03-20T21:20:59Z

Avoiding such symbols in the front-end is would avoid some of the issues (granted, including this one), but a symbol with a dot may materialize within LLVM itself. Granted, it may not happen often in practice. It's also possible that such symbol cloning would not be affected by this issue (e.g. if, unlike this case, debug info would point to the same MCSymbol for the cleaned up name).

Here are the options I see:

Get NVIDIA to change ptxas and allow a more sensible set of characters in identifiers. The problem is that it's not going to help us for a long time, as we need to deal with ptxas versions that are out there already.
Change LLVM to use something other than . when it needs to create identifiers. This has consequences for ABI. E.g. host/device symbols will get mangled differently. That would be a problem.
Because of the above, this name cleanup may need to be applied selectively on multiple targets (NVPTX + supported host architectures, currently x86 and ARM). E.g. we'll want to apply it to all symbols on the GPU, and to all symbols that need to have the same name across host/GPU boundary. E.g. kernels and other GPU-side symbols we may need to refer to from the host.
cleanup the names in the front-end. This is a very narrow workaround for a subset of these 'illegal character' issues. Should be enough to deal with this case, but I do not like it because it's not front-end's job to know about the quirks of something many abstraction levels below it. Front-end should be contrained by the contract between it and LLVM. If the symbol is valid for LLVM, how it gets lowered into target assembly is LLVM's responsibility.
Teach nvptx-assign-valid-global-names how to fix symbol names in associated debug info. I think that may be the least bad trade-off we may have at the moment. The caveat is that I have no idea how much effort it would take.

@dwblaikie If we rename a global symbol how hard is that to find and update references to the symbol from debug info. I suspect we already do that somewhere in LLVM. Can you point me in the right direction?

dwblaikie · 2024-03-21T01:16:20Z

Not sure if existing instances of this (as you say, abi would mostly make it impossible to change symbol names effectively)

But if you want to try it - the disubprogram attached to the function, if it has the mangled name (maybe it doesn't, maybe it just depends on the actual symbol name of the llvm::function in which case you wouldn't have to do anything for debuginfo) - that should be updated.

tambry · 2024-03-21T12:23:30Z

The DISubprogram name referred to the correct MCSymbol* and was correct AFAIK. The problematic name instead seemed to be attached to the !17 debug annotation on the ret.

source_filename = "-"
target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

@"__PRETTY_FUNCTION___$__Z3foov" = private unnamed_addr constant [11 x i8] c"void foo()\00", align 1, !dbg !0

; Function Attrs: convergent mustprogress noinline nounwind optnone
define dso_local void @_Z3foov() #0 !dbg !14 {
entry:
  ret void, !dbg !17
}

attributes #0 = { convergent mustprogress noinline nounwind optnone "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-features"="+ptx32" }

!llvm.dbg.cu = !{!8}
!llvm.linker.options = !{}
!llvm.module.flags = !{!10, !11, !12}
!llvm.ident = !{!13}

!0 = !DIGlobalVariableExpression(var: !1, expr: !DIExpression())
!1 = distinct !DIGlobalVariable(scope: null, file: !2, line: 1, type: !3, isLocal: true, isDefinition: true)
!2 = !DIFile(filename: "<stdin>", directory: "/home/raul.tambre")
!3 = !DICompositeType(tag: DW_TAG_array_type, baseType: !4, size: 88, elements: !6)
!4 = !DIDerivedType(tag: DW_TAG_const_type, baseType: !5)
!5 = !DIBasicType(name: "char", size: 8, encoding: DW_ATE_signed_char)
!6 = !{!7}
!7 = !DISubrange(count: 11)
!8 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus_14, file: !2, producer: "Clebian clang version 19.0.0", isOptimized: false, runtimeVersion: 0, emissionKind: FullDebug, globals: !9, splitDebugInlining: false, nameTableKind: None)
!9 = !{!0}
!10 = !{i32 2, !"Debug Info Version", i32 3}
!11 = !{i32 1, !"wchar_size", i32 4}
!12 = !{i32 4, !"nvvm-reflect-ftz", i32 0}
!13 = !{!"Clebian clang version 19.0.0"}
!14 = distinct !DISubprogram(name: "foo", linkageName: "_Z3foov", scope: !2, file: !2, line: 1, type: !15, scopeLine: 1, flags: DIFlagPrototyped, spFlags: DISPFlagDefinition, unit: !8)
!15 = !DISubroutineType(types: !16)
!16 = !{null}
!17 = !DILocation(line: 1, column: 56, scope: !14)

I managed to write something to reach that instruction, but not how to reach the DIE that had the wrong name embedded already. Seems to be an abstraction layer away and inaccessible in such a pass. It seemed having nvptx-assign-valid-global-names run as one the first passes before the DIE is created might work.

dwblaikie · 2024-03-21T16:45:19Z

Sorry, I'm not following that last comment - the DISubprogram is the same one from the Function and from the DILocation.

I take it this renaming isn't done at the IR level, OK - so it's not about updating the DISubprogram itself to match a change to the Function, but later than that.

Sure enough then - DwarfUnit::applySubprogramDefinitionAttributes calls addLinkageName - I guess it'd need some awkward mapping in DwarfDebug of DISubprogram back to llvm::Function... I don't feel good about that, maybe there's some other way to handle it, but you could at least prototype that.

tambry · 2024-03-21T22:19:50Z

Sorry, I'm not following that last comment - the DISubprogram is the same one from the Function and from the DILocation.

I guess was aiming at that you can't get the DILocation from the DISubprogram, but rather have to iterate the instructions to find the return instruction with the appropriate debug annotation. At least it seemed to me so, but chewing through the API and abstractions was difficult when I did try. 🙂

dwblaikie · 2024-03-25T17:00:14Z

Sorry, I'm not following that last comment - the DISubprogram is the same one from the Function and from the DILocation.

I guess was aiming at that you can't get the DILocation from the DISubprogram, but rather have to iterate the instructions to find the return instruction with the appropriate debug annotation. At least it seemed to me so, but chewing through the API and abstractions was difficult when I did try. 🙂

Ah, yes, DILocations aren't accesible top-down from the DISubprogram, only bottom-up from the DISubprogram's Function's instructions.

aminiussi · 2024-07-10T11:29:42Z

Hello, I have a similar issue with llvm 18.1.8 and CUDA 12.5. Is that expected ?

The failing line is

.b64 __PRETTY_FUNCTION__._ZN7fargOCA6nsnextILm1EEEmmm

and the message:

alainm@jarvis:/scrach/alainm/repos/fargOCA/gpu/seq/llvm/dbg$ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:14:54_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
alainm@jarvis:/scrach/alainm/repos/fargOCA/gpu/seq/llvm/dbg$ptxas -v  ./disk-cuda-nvptx64-nvidia-cuda-sm_89.s
ptxas ./disk-cuda-nvptx64-nvidia-cuda-sm_89.s, line 148598; fatal   : Parsing error near '.': syntax error
ptxas fatal   : Ptx assembly aborted due to errors
alainm@jarvis:/scrach/alainm/repos/fargOCA/gpu/seq/llvm/dbg$

It is also failing with main branch:

alainm@jarvis:/scrach/alainm/repos/fargOCA/gpu/seq/llvm/dbg$/opt/llvm-main/bin/clang++ --version
clang version 19.0.0git (https://github.com/llvm/llvm-project.git 08ce14732d528ab70309f334446d39782f2f07c0)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/llvm-main/bin
alainm@jarvis:/scrach/alainm/repos/fargOCA/gpu/seq/llvm/dbg$$ptxas disk-cuda-nvptx64-nvidia-cuda-sm_89.s
ptxas disk-cuda-nvptx64-nvidia-cuda-sm_89.s, line 145450; fatal   : Parsing error near '.': syntax error
ptxas fatal   : Ptx assembly aborted due to errors
alainm@jarvis:/scrach/alainm/repos/fargOCA/gpu/seq/llvm/dbg$

Artem-B · 2024-07-10T21:14:17Z

Unfortunately, the issue is still there, and we still do not have a good fix. Disabling GPU-side dwarf debug info with -Xarch_device -g0 is the best workaround we have at the moment.

tambry · 2024-07-11T10:58:21Z

@Artem-B Would you object to applying this workaround to master until someone actually dives into the guts of the DI subsystem to find the bug?

Artem-B · 2024-07-11T18:47:56Z

Doing it on clang side would depend on the name mangling implementation details in NVPTX back-end. I think a better approach would be to try intercepting printouts of .bN constants and mangle the ones that look like a symbol. While it's not the best time/place to do that mangling (we do need to figure out how to fix it wherever the unmangled name lives, and fix it there), but it would be doing the right thing in principle, using the same mangling we're applying to the symbols produced by NVPTX.

aminiussi · 2024-07-11T20:59:27Z

@Artem-B Would you object to applying this workaround to master until someone actually dives into the guts of the DI subsystem to find the bug?

Fine by me. Thanks

sidarth-narayanan-csi · 2024-08-06T20:08:21Z

Hello,

We are facing the same issue, was the workaround fix already merged into the master ? If yes, what is the version number for reference ?

Thank you

Artem-B · 2024-08-06T21:07:29Z

This problem is not fixed yet. You may work around by disabling GPU-side debug info with -Xarch_device -g0

sidarth-narayanan-csi · 2024-08-06T21:58:43Z

@Artem-B Thank you for the response

Until now debug info was printing the symbols names as-is and that resulted in invalid PTX when the symbols contained characters that are incalid for PTX. E.g. `__PRETTY_FUNCTION.something` Debug info is somewhat disconnected from the symbols themselves, so the regular "NVPTXAssignValidGlobalNames" pass can't easily fix them. As the "plan B" this patch catches printout of debug symbols and fixes them, as needed. One gotcha is that the same code path is used to print the names of debug info sections. Those section names do start with a '.debug'. The dot in those names is nominally illegal in PTX, but the debug section names with a dot are accepted as a special case. The downside of this change is that if someone ever has a `.debug*` symbol that needs to be referred to from the debug info, that label will be passed through as-is, and will still produce broken PTX output. If/when we run into a case where we need it to work, we could consider only passing through specific debug section names, or add a mechanist allowing us to tell section names apart from regular symbols. Fixes llvm#58491

…113216) Until now debug info was printing the symbols names as-is and that resulted in invalid PTX when the symbols contained characters that are invalid for PTX. E.g. `__PRETTY_FUNCTION.something` Debug info is somewhat disconnected from the symbols themselves, so the regular "NVPTXAssignValidGlobalNames" pass can't easily fix them. As the "plan B" this patch catches printout of debug symbols and fixes them, as needed. One gotcha is that the same code path is used to print the names of debug info sections. Those section names do start with a '.debug'. The dot in those names is nominally illegal in PTX, but the debug section names with a dot are accepted as a special case. The downside of this change is that if someone ever has a `.debug*` symbol that needs to be referred to from the debug info, that label will be passed through as-is, and will still produce broken PTX output. If/when we run into a case where we need it to work, we could consider only passing through specific debug section names, or add a mechanism allowing us to tell section names apart from regular symbols. Fixes #58491

carlosgalvezp added cuda crash-on-valid labels Oct 20, 2022

Artem-B self-assigned this Oct 21, 2022

This was referenced Jul 26, 2023

Add clang-16 test runner alpaka-group/alpaka#1971

Merged

Clang 15 as CUDA compiler does not work (at least for the test cases) alpaka-group/alpaka#1906

Closed

Artem-B mentioned this issue Jan 9, 2024

cuda compilation fails when compiling in debug mode #77009

Closed

Artem-B mentioned this issue Aug 14, 2024

[CUDA] device side asserts crash clang when using -g #101819

Closed

Artem-B mentioned this issue Oct 21, 2024

[NVPTX] mangle symbols in debug info to conform to PTX restrictions. #113216

Merged

Artem-B closed this as completed in #113216 Oct 22, 2024

EugeneZelenko added the backend:NVPTX label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ptx assembly aborted due to errors #58491

Ptx assembly aborted due to errors #58491

carlosgalvezp commented Oct 20, 2022

carlosgalvezp commented Oct 20, 2022 •

edited

Loading

carlosgalvezp commented Oct 20, 2022 •

edited

Loading

Artem-B commented Oct 20, 2022

carlosgalvezp commented Oct 21, 2022

vangohao commented Jan 3, 2024

tuero commented Mar 6, 2024 •

edited

Loading

tambry commented Mar 15, 2024

Artem-B commented Mar 15, 2024

tambry commented Mar 19, 2024 •

edited

Loading

Artem-B commented Mar 19, 2024

Artem-B commented Mar 20, 2024 •

edited

Loading

alexey-bataev commented Mar 20, 2024

Artem-B commented Mar 20, 2024

dwblaikie commented Mar 21, 2024

tambry commented Mar 21, 2024

dwblaikie commented Mar 21, 2024

tambry commented Mar 21, 2024

dwblaikie commented Mar 25, 2024

aminiussi commented Jul 10, 2024 •

edited

Loading

Artem-B commented Jul 10, 2024

tambry commented Jul 11, 2024

Artem-B commented Jul 11, 2024

aminiussi commented Jul 11, 2024

sidarth-narayanan-csi commented Aug 6, 2024

Artem-B commented Aug 6, 2024

sidarth-narayanan-csi commented Aug 6, 2024

Ptx assembly aborted due to errors #58491

Ptx assembly aborted due to errors #58491

Comments

carlosgalvezp commented Oct 20, 2022

carlosgalvezp commented Oct 20, 2022 • edited Loading

carlosgalvezp commented Oct 20, 2022 • edited Loading

Artem-B commented Oct 20, 2022

carlosgalvezp commented Oct 21, 2022

vangohao commented Jan 3, 2024

tuero commented Mar 6, 2024 • edited Loading

tambry commented Mar 15, 2024

Artem-B commented Mar 15, 2024

tambry commented Mar 19, 2024 • edited Loading

Artem-B commented Mar 19, 2024

Artem-B commented Mar 20, 2024 • edited Loading

alexey-bataev commented Mar 20, 2024

Artem-B commented Mar 20, 2024

dwblaikie commented Mar 21, 2024

tambry commented Mar 21, 2024

dwblaikie commented Mar 21, 2024

tambry commented Mar 21, 2024

dwblaikie commented Mar 25, 2024

aminiussi commented Jul 10, 2024 • edited Loading

Artem-B commented Jul 10, 2024

tambry commented Jul 11, 2024

Artem-B commented Jul 11, 2024

aminiussi commented Jul 11, 2024

sidarth-narayanan-csi commented Aug 6, 2024

Artem-B commented Aug 6, 2024

sidarth-narayanan-csi commented Aug 6, 2024

carlosgalvezp commented Oct 20, 2022 •

edited

Loading

carlosgalvezp commented Oct 20, 2022 •

edited

Loading

tuero commented Mar 6, 2024 •

edited

Loading

tambry commented Mar 19, 2024 •

edited

Loading

Artem-B commented Mar 20, 2024 •

edited

Loading

aminiussi commented Jul 10, 2024 •

edited

Loading