Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLVM upgrades #7115

Closed
wants to merge 6 commits into from
Closed

Conversation

alexcrichton
Copy link
Member

This is a reopening of #6713

This is still blocked on windows failures. I'll re-push try once the existing crisis has passed.

@alexcrichton
Copy link
Member Author

As before, it fails with a mysterious failure when running passes.

If I rebase brson/llvm/wip on top of chapuni/llvm/master (~a month of changes), there's a silent error 5 failure.

Not really sure what's going on here. If we don't upgrade LLVM, we don't get some rusti JIT fixes. If we do upgrade LLVM, windows breaks entirely. I don't have access to any windows machines, so I can't track it down to see what's going on :(

@alexcrichton
Copy link
Member Author

To keep this updated, I've rebased on top of llvm-3.3 and pointed it at my own fork (one extra patch of mine + rebased patches from brson's branch).

A recent pass through try-win and there's still a segfault immediately in stage2.

@alexcrichton
Copy link
Member Author

After doing some more debugging, the failure only happens when optimizations are enabled, and it's currently a segfault in llvm::MCAssembler::Finish where gdb didn't give more of a stack trace, so I suspected stack clobbering.

Sure enough, I disabled fast_ffi support in rustc, and it compiled just fine on windows (presumably tests passing as well). I'm not too familiar with how fast_ffi works, nor if this would be related to its integration with LLVM or not.

@pcwalton, I think you implemented fast_ffi, so perhaps this is just a rebase gone awry with recent LLVM versions? Also @brson would you might have info as well?

bors added a commit that referenced this pull request Jul 5, 2013
This is a reopening of #6713

This is still blocked on windows failures. I'll re-push try once the existing crisis has passed.
@alexcrichton
Copy link
Member Author

It appears my understanding of fast_ffi is wrong. This definitely appears to be stack clobbering, however: http://buildbot.rust-lang.org/builders/auto-linux-32-opt/builds/275/steps/compile/logs/stdio

@vadimcn
Copy link
Contributor

vadimcn commented Aug 1, 2013

@alexcrichton: I think I've tracked down this segfault. Here's the callstack I am getting: https://gist.github.com/vadimcn/9e9d5333a99f08cddfa5
If I am interpreting the stk_seg structure in question correctly, the size of the stack segment was only 38912 bytes (and stk->is_big is false as well). The [fast_ffi] annotation should have caused LLVMRunPassManager to be executed on a large stack, correct? If so, there's something wrong with stack switching.

Also, building with RUST_MIN_STACK=102400 works without any issues, which also hints towards stack being too small.

@alexcrichton
Copy link
Member Author

It turns out that the problem was that the fixedstacksegment attribute wasn't actually getting set on the functions, accounting for the stack clobbering hilarity that ensued.

The reason that it wasn't being set is caused by #8199. The FixedStackSegment enum attribute was set with a discriminant of 1 << 41 and it was then rewritten to be 0. When we told llvm to set the attribute of 0, it happily said "ok, it's set!" and nothing was actually set.

I've added a workaround in the meantime, but @thestinger mentioned that it would probably be a good idea to update to the latest llvm in the meantime, so I'm going to attempt to do that tonight. If that turns out to be more trouble than its worth, then at least the state right now should be a working state!

@alexcrichton
Copy link
Member Author

As of right now, the unoptimized version of rustc is failing on an unoptimized test, and can be reproduced with this test case:

fn main() {
    test_nd_dr((8,  3), 2);
}

fn test_nd_dr((n, d): (i8,i8), r: i8) {
    let separate_div_rem = (n / d, n % d);
    let combined_div_rem = n.div_rem(&d);

    let (_, b) = separate_div_rem;
    assert_eq!(b, r);
}

Still trying to figure out why this is failing. The odd part is that it doesn't fail if built with optimizations...

@luqmana
Copy link
Member

luqmana commented Aug 3, 2013

After a bunch of investigation, it seems like LLVM is wrong. I first looked at the IR (-S --emit-llvm) and while it was overly verbose it was not wrong. Similarly for the assembly (-S) (or so I thought at the time). Then after stepping through the code with gdb, we (@zecozephyr [jj2baile on irc] & I) noticed that the emitted machine code was different from the assembly output -S. Turns out the assembly had movb %ah, (%r8) which is not a valid x86 instruction, since it is impossible to use the 8-bit registers ah, bh, ch, and dh in an instruction along with any register that requires a REX prefix. They get remapped to different registers (spl, dil, bpl, and sil), and hence we get movb %spl, (%r8).

The relevant bit:

idivb   %sil                    ; n % d
movq    -72(%rbp), %r8                   
movb    %ah, (%r8)              ; store the remainder

So we manually replaced the offending machine code in the binary (after adding a few nops with asm!) with:

idivb   %sil                    ; n % d
movq    -72(%rbp), %r8          
shrw    $8, %ax                 ; move remainder to lower 8 bits
movb    %al, (%r8)              ; store the remainder

and that worked. So, there are at least 2 problems. One is that it generates invalid assembly and the other is that it generates valid machine code that's semantically incorrect. (i.e. replacing %ah with %spl thereby storing random garbage.)

So basically, it seems like an LLVM codegen bug.

@zecozephyr
Copy link
Contributor

See the second or so paragraph of: http://x86asm.net/articles/x86-64-tour-of-intel-manuals/#General-purpose-Registers

That is probably how this came about. Attempting to use the REX prefix to address a register in [r8, r15] whilst simultaneously specifying %ah. This changes the meaning of the instruction in this case, so instead we get %spl.

@luqmana
Copy link
Member

luqmana commented Aug 3, 2013

From #llvm:

< Luqman> We seem to be running into a weird codegen issue upon trying to upgrade the llvm build we use with rust. we noticed this because one of our tests were failing upon trying to upgrade
< Luqman> looking at the llvm IR & assembly output it seems ok (if a bit verbose) at first https://gist.github.com/luqmana/6146443
< Luqman> until you notice that there's one instruction in the assembly output, movb %ah, (%r8)
< Luqman> which is not a valid x86 instruction as i understand it, and disassembling the binary it becomes mov %spl, (%r8)
< Luqman> which, while valid, is semantically wrong
...
< roxfan> Luqman: yep, looks like a bug to me
< roxfan> assembler shouldn't accept it either

@alexcrichton
Copy link
Member Author

Wow, you guys are awesome! From what I had been looking at that completely makes sense. I've lost the binary that was failing for me so I can't check that it was that exact instruction, but it was definitely %al wasn't being saved after the idiv.

Regardless, this may get a bit difficult now. It looks like the release-3.3 of llvm isn't going to work for us, and the current head also doesn't work for us. We'll need to find a good LLVM commit to rebase our own patches on top of...

@luqmana
Copy link
Member

luqmana commented Aug 4, 2013

So after bisecting between alexcrichton/llvm@a121e24 and alexcrichton/llvm@b1adb4d the culprit seems to be llvm-mirror/llvm@5012548. Which makes sense, since that patch adds support for div/idiv to x86 FastISel but doesn't teach it to handle using %AH properly in a REX instruction.

@luqmana
Copy link
Member

luqmana commented Aug 4, 2013

So, it turns out we were hitting http://llvm.org/bugs/show_bug.cgi?id=16105 which has already been fixed upstream (llvm-mirror/llvm@cc64dc6).

* LLVM now has a C interface to LLVMBuildAtomicRMW
* The exception handling support for the JIT seems to have been dropped
* Various interfaces have been added or headers have changed
At the same time create a more robust wrapper to try to prevent this type of
issue from cropping up in the future.
@alexcrichton
Copy link
Member Author

To keep this updated, bors bounced LLVM head as of last night due to some assertion being tripped compiling libstd, so I'm now going to go back to trying to get it working with 3.3

bors added a commit that referenced this pull request Aug 4, 2013
This is a reopening of #6713

This is still blocked on windows failures. I'll re-push try once the existing crisis has passed.
@bors bors closed this Aug 4, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants