-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DebugInfo] Malformed DWARF produced during LTO build #48134
Comments
Yep, looks like a bug to me. Slightly smaller repro (& also reproduces on x86, and with full (not only -r) linking): 1.cpp: 2.cpp: struct HHH { int main() { As for fixing it, I think the most likely culprit/place to start would be the logic that places the definition of "eee" (or "main" in my example) in the wrong CU. I don't think there's any good reason for that to happen. (I was hoping to tickle a possibly more severe version of this bug by making eee file-scoped static (because then moving it between CUs could be really problematic in other ways - if there was another static function with the same name in that other CU, etc) but seems something different happens in that case that doesn't trigger the bug - not sure why, don't think we do anything significantly different in the DWARF for file-scoped static entities) |
The definition of HHH migrating from 2.cpp to 1.cpp seems odd to me. |
It's the presence of the global that's causing it to be emitted in 1.cpp. I imagine something similar to the subprogram happens to the type: it just gets created in the "current" CU.
Yup,
I don't believe ctor homing is turned on by default yet, I see -debug-info-kind=limited and -fuse-ctor-homing is absent in my clang -### commands. |
(CC'ing modocache as he reported reverting D70350 downstream after experiencing difficulties with it). |
Possible patch: https://reviews.llvm.org/D94976 |
Type definitions (& declarations) are deduplicated at the LLVM IR level based on their mangled name to reduce IR (& resulting DWARF) size. This is expected/desired - perhaps slightly not-what-the-source-says (but that's always going to be the case - if the type was defined in both files, we still wouldn't want to produce definitions in both due to size concerns). |
& yeah, nothing to do with ctor or other (vtable, etc) homing strategies - IR debug info metadata type deduplication would be done even with -fstandalone-debug (ie: without any type homing enabled). For what it's worth: Type homing is a purely frontend/IR generation feature and LLVM has no knowledge of or explicit support for type homing. IR Type deduplication does rely on the frontend providing a mangled name in the type description metadata - this is how the frontend signals to LLVM that the type is subject to the ODR and so LLVM can drop duplicates based on name alone. |
Hi Tom, could you pull-up / cherry-pick the fix for this ticket, ef0dcb5 [0], into the LLVM 12 release branch? In rare circumstances LTO will produce a file with illegal DWARF syntax. I don't think it's needed for rc1. [0] https://reviews.llvm.org/rGef0dcb506300dc9644e8000c6028d14214be9d97 |
Actually, hold on this, there's a report in D95622 that it's causing a crash. I'd still like this to be in llvm-12, but best to wait a little bit. |
Sitrep: D94976 continues to run into trouble. There's an increased risk that this won't make llvm-12. Technically this isn't a regression because the buggy behaviour is in llvm-11, but emitting broken DWARF is far from ideal, of course. |
I'm getting the impression that cross-CU DIE emission is being done in an |
Practically, we can't - CUs are built as functions are code generated - and not all the functions for a given CU are grouped together. I guess we could change that, but it'd be a significant redesign of DWARF building code & I'm not sure it'd significantly improve this issue.
Perhaps an aside, but FWIW, that part of the architecture already exists - we don't compute the DIE offsets immediately (because we go back and add more attributes to DIEs later on, for instance - so we can't know offsets) - so when we created a cross-DIE reference, one DIE's attribute refers/points to the other DIE then later on when they're emitted it's two-pass, one commits offsets (says "we're not changing anything, so figure out how big everything is/what the offsets are) and then emitting the bytes where the attribute emission can query the final offset of the DIE being referred to and use that.
There are some other benefits if we were to move to using relocations for a lot more DWARF output, including for DIE offsets. (would make for more editable/legible DWARF assembly) but probably not much difference to this situation.
Yep, I think it'd be best to not try to push this into LLVM 12 - as we've seen, there's a lot of nuance in this sort of code. |
Oh, I shuold also say it'd be complicated to emit a whole CU in a standalone way - in general CU IR Metadata doesn't contain a comprehensive list of its contents. In the interests of reducing/eliding dead DWARF, most of the entities refer to the CU as their context, but without the CU listing/knowing about that item. eg: a simple inlining. So the only way SP1 is reachable is through func2 |
What's the status of this bug? Have the problems with the fix been resolved? |
Hi Tom,
They have not, and it seems unlikely that it'll be sufficiently addressed and tested in time for 12.0. Technically this isn't a regression -- the behaviour behind this bug was present in llvm-11, and it appears to trigger rarely. We only ran into it on a very large C++ code base when using LTO. It can likely be left until 12.0.1, David agrees in comment 13 An alternative is to revert e08f205, which unmasked this problem, and is what we're doing downstream. I believe that would decrease the quality of DWARF5 call site information, @aprantl @vsk would you be able to give an opinion on how you'd weight reverting versus sticking with the current behaviour? |
This bug was not resolved in time for the 12.0.0 release, so it will have to wait for 12.0.1. If you feel this is a high-priority bug that should be fixed for 12.0.0, please re-add release-12.0.0 to the Blocks field and add a comment explaining why this is high-priority. |
This bug still repros in the latest upstream compiler |
To respond to Jeremy's question (from comment #16): reverting D70350 might regress the debugging experience for some of our low-level projects that build with LTO, but it's not something to rule out, esp. if D94976 won't land any time soon (and, of course, given that malformed DWARF is no good). |
I suspect D94976 isn't going anywhere soon -- there seems to be a significant amount of context encoded in "the current unit" that I don't fully understand yet. FTR, we've reverted D70350 downstream as it affects a codebase that's important to us. I'd suggest that we revert in llvm main, seeing how it's affecting more than one person/org, and broken DWARF is more of a blocker than reduced debug experience (alas). |
Uploaded a revert-and-regression-test patch at https://reviews.llvm.org/D107076 |
Ultimately this was reverted in d4ce9e4 and cherry-picked into llvm-13 via bug 51426. |
mentioned in issue #48661 |
Extended Description
The reproducer below causes LLVM master (93592b7) to produce malformed DWARF when built with "-O3 -g -flto". Running
llvm-dwarfdump -verify
complains "invalid DIE reference", and our debugger becomes displeased. Reverting e08f205 / D70350, "Allow cross-CU references of subprogram definitions", fixes it for us.Sources:
$ cat -n 1.cpp
1 struct HHH;
2
3 struct xxx {
4 HHH yyy();
5 } zzz;
6
$ cat -n 2.cpp
1 int ggg;
2
3 struct HHH {
4 template int ccc(bbb) {
5 while (--ggg)
6 ;
7 }
8 };
9
10 void eee() {
11 HHH fff;
12 fff.ccc([] {});
13 }
14
Built thus (the target is important it seems):
$ clang++ -O3 -g -flto -w -c 1.cpp -o 1.o -target x86_64-scei-ps4
$ clang++ -O3 -g -flto -w -c 2.cpp -o 2.o -target x86_64-scei-ps4
$ ld.lld 1.o 2.o -o 3.out -r
$ llvm-dwarfdump -verify 3.out
Produces:
--------8<--------
error: invalid DIE reference 0x00000046. Offset is in between DIEs:
0x000000a1: DW_TAG_inlined_subroutine
DW_AT_abstract_origin (0x00000046)
DW_AT_low_pc (0x0000000000000000)
DW_AT_high_pc (0x0000000000000002)
DW_AT_call_file ("1.cpp")
DW_AT_call_line (12)
-------->8--------
I'm attaching an IR reproducer, derived from "ld.lld -save-temps" and the combined-and-optimized IR it produces. Feeding that IR into llc also produces an invalid DIE reference.
I've done some prodding, and while to me the DWARF emission code is a twisty turny maze of passages all alike, I believe the problem is that the DIE for the "eee" function is placed in the wrong compile unit. Here's a chain of events:
Normally this wouldn't be a problem. I believe that at the end of DwarfDebug::endFunctionImpl, normally a duplicate DIE would be created for "eee", because a cross-CU subprogram wouldn't be permitted. However, with e08f205 a cross-CU subprogram is permitted.
All the child scopes of "eee" are created before "eee" is created, they expect to be in the CU that the DISubprogram for "eee" specifies, 2.cpp, and they pick references of form "DW_FORM_ref4". They're then attached to a subprogram DIE in a different CU, breaking those forms of references. If you use llvm-dwarfdump -v, you can see that the offset of the DW_AT_abstract_origin above would be correct, if it was in the correct CU (hat-tip James Henderson).
I don't really have a proposal for how to fix this right now, DWARF emission is way out of my comfort zone.
The text was updated successfully, but these errors were encountered: