Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DebugInfo] Malformed DWARF produced during LTO build #48134

Closed
jmorse opened this issue Jan 18, 2021 · 22 comments
Closed

[DebugInfo] Malformed DWARF produced during LTO build #48134

jmorse opened this issue Jan 18, 2021 · 22 comments
Labels
bugzilla Issues migrated from bugzilla llvm:codegen wrong-debug

Comments

@jmorse
Copy link
Member

jmorse commented Jan 18, 2021

Bugzilla Link 48790
Resolution FIXED
Resolved on Aug 23, 2021 14:00
Version trunk
OS Linux
Blocks #48661
Attachments LLVM-IR to reproduce when fed to llc
CC @adrian-prantl,@dwblaikie,@gregbedwell,@JDevlieghere,@kyulee-com,@modocache,@pogo59,@pogo59,@tstellar,@vedantk,@wjristow
Fixed by commit(s) ef0dcb5

Extended Description

The reproducer below causes LLVM master (93592b7) to produce malformed DWARF when built with "-O3 -g -flto". Running llvm-dwarfdump -verify complains "invalid DIE reference", and our debugger becomes displeased. Reverting e08f205 / D70350, "Allow cross-CU references of subprogram definitions", fixes it for us.

Sources:

$ cat -n 1.cpp
1 struct HHH;
2
3 struct xxx {
4 HHH yyy();
5 } zzz;
6

$ cat -n 2.cpp
1 int ggg;
2
3 struct HHH {
4 template int ccc(bbb) {
5 while (--ggg)
6 ;
7 }
8 };
9
10 void eee() {
11 HHH fff;
12 fff.ccc([] {});
13 }
14

Built thus (the target is important it seems):
$ clang++ -O3 -g -flto -w -c 1.cpp -o 1.o -target x86_64-scei-ps4
$ clang++ -O3 -g -flto -w -c 2.cpp -o 2.o -target x86_64-scei-ps4
$ ld.lld 1.o 2.o -o 3.out -r
$ llvm-dwarfdump -verify 3.out

Produces:
--------8<--------
error: invalid DIE reference 0x00000046. Offset is in between DIEs:

0x000000a1: DW_TAG_inlined_subroutine
DW_AT_abstract_origin (0x00000046)
DW_AT_low_pc (0x0000000000000000)
DW_AT_high_pc (0x0000000000000002)
DW_AT_call_file ("1.cpp")
DW_AT_call_line (12)
-------->8--------

I'm attaching an IR reproducer, derived from "ld.lld -save-temps" and the combined-and-optimized IR it produces. Feeding that IR into llc also produces an invalid DIE reference.

I've done some prodding, and while to me the DWARF emission code is a twisty turny maze of passages all alike, I believe the problem is that the DIE for the "eee" function is placed in the wrong compile unit. Here's a chain of events:

  • The DIE for "HHH" is placed in the CU for 1.cpp, because the "zzz" global var is emitted there.
  • The DIE for "ccc" is created in the context of "HHH", CU is 1.cpp
  • The template argument for "ccc", a lambda type, is created
  • The scope for that template argument is needed: so a DIE "eee" is created.
  • That DIE for "eee" is in the "current" unit, 1.cpp

Normally this wouldn't be a problem. I believe that at the end of DwarfDebug::endFunctionImpl, normally a duplicate DIE would be created for "eee", because a cross-CU subprogram wouldn't be permitted. However, with e08f205 a cross-CU subprogram is permitted.

All the child scopes of "eee" are created before "eee" is created, they expect to be in the CU that the DISubprogram for "eee" specifies, 2.cpp, and they pick references of form "DW_FORM_ref4". They're then attached to a subprogram DIE in a different CU, breaking those forms of references. If you use llvm-dwarfdump -v, you can see that the offset of the DW_AT_abstract_origin above would be correct, if it was in the correct CU (hat-tip James Henderson).

I don't really have a proposal for how to fix this right now, DWARF emission is way out of my comfort zone.

@dwblaikie
Copy link
Collaborator

Yep, looks like a bug to me.

Slightly smaller repro (& also reproduces on x86, and with full (not only -r) linking):

1.cpp:
struct HHH;
HHH *zzz;

2.cpp:
void attribute((optnone)) attribute((nodebug)) f1() { }

struct HHH {
template
static int attribute((always_inline)) ccc() {
f1();
}
};

int main() {
struct local { };
HHH::ccc();
}

As for fixing it, I think the most likely culprit/place to start would be the logic that places the definition of "eee" (or "main" in my example) in the wrong CU. I don't think there's any good reason for that to happen.

(I was hoping to tickle a possibly more severe version of this bug by making eee file-scoped static (because then moving it between CUs could be really problematic in other ways - if there was another static function with the same name in that other CU, etc) but seems something different happens in that case that doesn't trigger the bug - not sure why, don't think we do anything significantly different in the DWARF for file-scoped static entities)

@pogo59
Copy link
Collaborator

pogo59 commented Jan 19, 2021

The definition of HHH migrating from 2.cpp to 1.cpp seems odd to me.
Compiled separately, 1.cpp would obviously have only a forward declaration.
I take it with LTO, 2.cpp ends up with a cross-CU reference to the HHH in
1.cpp? Possibly a side-effect of ctor homing? (I haven't looked at the
insides of this at all, I'm just doing my usual black-box guessing based on
the symptoms.)

@jmorse
Copy link
Member Author

jmorse commented Jan 19, 2021

The definition of HHH migrating from 2.cpp to 1.cpp seems odd to me.
Compiled separately, 1.cpp would obviously have only a forward declaration.

It's the presence of the global that's causing it to be emitted in 1.cpp. I imagine something similar to the subprogram happens to the type: it just gets created in the "current" CU.

I take it with LTO, 2.cpp ends up with a cross-CU reference to the HHH in
1.cpp?

Yup,

Possibly a side-effect of ctor homing?

I don't believe ctor homing is turned on by default yet, I see -debug-info-kind=limited and -fuse-ctor-homing is absent in my clang -### commands.

@jmorse
Copy link
Member Author

jmorse commented Jan 19, 2021

(CC'ing modocache as he reported reverting D70350 downstream after experiencing difficulties with it).

@jmorse
Copy link
Member Author

jmorse commented Jan 19, 2021

Possible patch: https://reviews.llvm.org/D94976

@dwblaikie
Copy link
Collaborator

The definition of HHH migrating from 2.cpp to 1.cpp seems odd to me.
Compiled separately, 1.cpp would obviously have only a forward declaration.
I take it with LTO, 2.cpp ends up with a cross-CU reference to the HHH in
1.cpp? Possibly a side-effect of ctor homing? (I haven't looked at the
insides of this at all, I'm just doing my usual black-box guessing based on
the symptoms.)

Type definitions (& declarations) are deduplicated at the LLVM IR level based on their mangled name to reduce IR (& resulting DWARF) size. This is expected/desired - perhaps slightly not-what-the-source-says (but that's always going to be the case - if the type was defined in both files, we still wouldn't want to produce definitions in both due to size concerns).

@dwblaikie
Copy link
Collaborator

& yeah, nothing to do with ctor or other (vtable, etc) homing strategies - IR debug info metadata type deduplication would be done even with -fstandalone-debug (ie: without any type homing enabled).

For what it's worth: Type homing is a purely frontend/IR generation feature and LLVM has no knowledge of or explicit support for type homing.

IR Type deduplication does rely on the frontend providing a mangled name in the type description metadata - this is how the frontend signals to LLVM that the type is subject to the ODR and so LLVM can drop duplicates based on name alone.

@jmorse
Copy link
Member Author

jmorse commented Jan 28, 2021

Hi Tom, could you pull-up / cherry-pick the fix for this ticket, ef0dcb5 [0], into the LLVM 12 release branch? In rare circumstances LTO will produce a file with illegal DWARF syntax.

I don't think it's needed for rc1.

[0] https://reviews.llvm.org/rGef0dcb506300dc9644e8000c6028d14214be9d97

@jmorse
Copy link
Member Author

jmorse commented Jan 28, 2021

Hi Tom, could you pull-up / cherry-pick the fix for this ticket

Actually, hold on this, there's a report in D95622 that it's causing a crash. I'd still like this to be in llvm-12, but best to wait a little bit.

@jmorse
Copy link
Member Author

jmorse commented Feb 10, 2021

Sitrep: D94976 continues to run into trouble. There's an increased risk that this won't make llvm-12. Technically this isn't a regression because the buggy behaviour is in llvm-11, but emitting broken DWARF is far from ideal, of course.

@pogo59
Copy link
Collaborator

pogo59 commented Feb 10, 2021

I'm getting the impression that cross-CU DIE emission is being done in an
ad-hoc on-demand way, and keeping the contexts straight runs into all sorts
of corner cases.
Would we be better off reorganizing it so we emit one CU at a time, and
keep some sort of side/work list of 'stuff some other CU needs'?
This would require deferring filling in the actual value of the reference
because we don't know the section offset yet; either as something to be
backpatched when we get around to emitting the other DIE, or by using
relocations and letting the linker deal with it.
The cross-CU references would all be DW_FORM_ref_addr so the size is known.

@dwblaikie
Copy link
Collaborator

I'm getting the impression that cross-CU DIE emission is being done in an
ad-hoc on-demand way, and keeping the contexts straight runs into all sorts
of corner cases.
Would we be better off reorganizing it so we emit one CU at a time,

Practically, we can't - CUs are built as functions are code generated - and not all the functions for a given CU are grouped together. I guess we could change that, but it'd be a significant redesign of DWARF building code & I'm not sure it'd significantly improve this issue.

and
keep some sort of side/work list of 'stuff some other CU needs'?
This would require deferring filling in the actual value of the reference
because we don't know the section offset yet;

Perhaps an aside, but FWIW, that part of the architecture already exists - we don't compute the DIE offsets immediately (because we go back and add more attributes to DIEs later on, for instance - so we can't know offsets) - so when we created a cross-DIE reference, one DIE's attribute refers/points to the other DIE then later on when they're emitted it's two-pass, one commits offsets (says "we're not changing anything, so figure out how big everything is/what the offsets are) and then emitting the bytes where the attribute emission can query the final offset of the DIE being referred to and use that.

either as something to be
backpatched when we get around to emitting the other DIE, or by using
relocations and letting the linker deal with it.

There are some other benefits if we were to move to using relocations for a lot more DWARF output, including for DIE offsets. (would make for more editable/legible DWARF assembly) but probably not much difference to this situation.

Sitrep: D94976 continues to run into trouble. There's an increased risk that
this won't make llvm-12. Technically this isn't a regression because the
buggy behaviour is in llvm-11, but emitting broken DWARF is far from ideal,
of course.

Yep, I think it'd be best to not try to push this into LLVM 12 - as we've seen, there's a lot of nuance in this sort of code.

@dwblaikie
Copy link
Collaborator

Oh, I shuold also say it'd be complicated to emit a whole CU in a standalone way - in general CU IR Metadata doesn't contain a comprehensive list of its contents. In the interests of reducing/eliding dead DWARF, most of the entities refer to the CU as their context, but without the CU listing/knowing about that item.

eg: a simple inlining.
CU1 and CU2 metadata exist but refer to no entities.
SP2 refers to CU2 as its unit.
SP1 refers to CP1 as its unit.
There's an llvm::Function func2.
func2's subprogram field refers to SP2.
some instructions in func2 have a debug location that refers to SP1

So the only way SP1 is reachable is through func2

@tstellar
Copy link
Collaborator

What's the status of this bug? Have the problems with the fix been resolved?

@jmorse
Copy link
Member Author

jmorse commented Feb 16, 2021

Hi Tom,

What's the status of this bug? Have the problems with the fix been resolved?

They have not, and it seems unlikely that it'll be sufficiently addressed and tested in time for 12.0.

Technically this isn't a regression -- the behaviour behind this bug was present in llvm-11, and it appears to trigger rarely. We only ran into it on a very large C++ code base when using LTO. It can likely be left until 12.0.1, David agrees in comment 13

An alternative is to revert e08f205, which unmasked this problem, and is what we're doing downstream. I believe that would decrease the quality of DWARF5 call site information, @​aprantl @​vsk would you be able to give an opinion on how you'd weight reverting versus sticking with the current behaviour?

@tstellar
Copy link
Collaborator

This bug was not resolved in time for the 12.0.0 release, so it will have to wait for 12.0.1.

If you feel this is a high-priority bug that should be fixed for 12.0.0, please re-add release-12.0.0 to the Blocks field and add a comment explaining why this is high-priority.

@kyulee-com
Copy link
Contributor

This bug still repros in the latest upstream compiler
We've seen this case (the same verification error) frequently in LTO build for large applications with XCode.
I don't have a cutdown but backing out e08f205 / D70350 seems to fix the issue for us. I wonder whether there is a follow-up on this bug?

@vedantk
Copy link
Collaborator

vedantk commented Jul 9, 2021

To respond to Jeremy's question (from comment #​16): reverting D70350 might regress the debugging experience for some of our low-level projects that build with LTO, but it's not something to rule out, esp. if D94976 won't land any time soon (and, of course, given that malformed DWARF is no good).

@jmorse
Copy link
Member Author

jmorse commented Jul 15, 2021

I suspect D94976 isn't going anywhere soon -- there seems to be a significant amount of context encoded in "the current unit" that I don't fully understand yet.

FTR, we've reverted D70350 downstream as it affects a codebase that's important to us. I'd suggest that we revert in llvm main, seeing how it's affecting more than one person/org, and broken DWARF is more of a blocker than reduced debug experience (alas).

@jmorse
Copy link
Member Author

jmorse commented Jul 29, 2021

Uploaded a revert-and-regression-test patch at https://reviews.llvm.org/D107076

@jmorse
Copy link
Member Author

jmorse commented Aug 23, 2021

Ultimately this was reverted in d4ce9e4 and cherry-picked into llvm-13 via bug 51426.

@tstellar
Copy link
Collaborator

mentioned in issue #48661

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla llvm:codegen wrong-debug
Projects
None yet
Development

No branches or pull requests

6 participants