-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add failing example with regex #56
Add failing example with regex #56
Conversation
329a7a5
to
3a54f05
Compare
One other data point I found, when running in Citra the log gets spammed with a ton of these:
The calls that are failing:
Which correspond to |
I haven't looked closely yet, but stack overflow maybe? |
Huh, good hunch! I rebuilt with So I guess the question is – what's the best way to document this or avoid it in the future? I could see this being an easy stumbling block for users to hit – should we just recommend |
I believe there are ways to actually lock the opt-level to a minimum (or at least a default value). My question is whether we actually want to do that? We absolutely will need to have a Wiki at some point, so this could just be something to add to it. The stack size isn’t variable, not in ctru’s standards at least. |
I'm still interested in debugging the crash to get a real root cause, since the stack overflow thing was just the first thing that comes to mind when I hear of memory issues :). |
Hmm, big problems. I tried running a very simple test with the Edit: even weirder, the error isn’t cast in the main from an overflow, but in the Edit 2: I don’t even know if what I’m doing makes sense, but the closing of the app is in Edit 3: found absolutely nothing. I don't have the expertise to look into the built exec, but I can tell the problem only arises when a specific function (which has nothing to do with system calls) is linked. It may be an issue with something in the backstage, but I don't know. I'll just put this off for now, tell me if you find anything relative to your issue. |
Related to my issue: the
In this text, the only reason I can see for the mistake is Edit: @AzureMarker @ian-h-chamberlain I pushed my simple example in the |
I see all the same issues with
Of course, after borrowing, the borrow state gets changed, but the value itself stays zero (second green marker): I did check the Also, do we need this commit in
|
Actually, it looks like it might be trying to write above/before the stack limit? See how some of the registers start with 0x800. |
Yes we do, as it enables std’s thread locals implementation. I don’t have your compilation problems, are you sure your toolchain is fine? |
Yeah, I noticed this as well in my testing last night, and I think it's a good thread to pull on. I found these two tools that might help us track it down, but also might need a little extra work to build for our target: Re: AzureMarker/rust-horizon@59aacda – I was under the impression it was only needed in the branches where we have |
Ok, I was able to get Here's what it's emitting for debug mode as the largest stack sizes (I just took the top several):
No Release mode for comparison is different, but doesn't point to much for
Not much to go off, but an extra couple of data points at least... |
I’ve tried editing the memory to be that way before the read, but the result is the same. These two issues we are having both have in common only the difference between What are other differences, in a fully clean and working environment, between these two build modes? In my specific case, my issues seem to be generated once a specific function call is present (and specifically not called) in the main function. The stack isn’t yet affected by my work (being the case where my main doesn’t even run), and it looks to be barely a difference in the stack of the @ian-h-chamberlain could you tell me the size of the final executable sent to 3dslink? |
@Meziu what's the function that triggers the issue for you when it gets linked in? |
b417e32#diff-a02ab88c764a3d8abd0605777eb0ae6bde90d3c1c8d81ff972f8e4c9447fc6eaR19 This line breaks my program, yet it doesn't ever run, since the abort happens in the |
In my case, the "bad" debug build is also a bit larger, but that is to be expected for debug vs release builds, I would think?
|
Interesting... my example in https://github.com/ian-h-chamberlain/console_3ds that I mentioned in #58 seems to have this issue as well. I was running it fine (albeit slow) in debug mode before, but now I get a crash in debug mode. I haven't changed toolchains or anything. One thing I did try changing was the The backtrace, interestingly has a null pointer for
This seems a bit strange to me, does it match anything you've seen in your crashes? I have to wonder if we're introducing undefined behavior somewhere in In this case the executables are way more different in size:
I will still continue to investigate... |
I've been looking at the disassembly ( The issue is immediate to the eye: @AzureMarker @ian-h-chamberlain Any ideas at all at why it could be happening? This may be the same issue as the Since this looks to be related to the compiler/llvm we may try opening a Zulip thread. |
My first thought is that it got replaced with an (was busy this past week - sorry for the silence) |
It would be interesting to look at the MIR for that code. I'll try to work on that soon. |
Great. Still, I don’t think it is an abort, as the code never aborts in my example, and it wouldn’t be able to know whether it needs to panic at compile-time. |
Yeah, I was thinking it might be useful to look at the generated IR as well ( Regarding the abort call, I would expect a I also wondered if this might simply be inline data rather than instructions? as it seems like |
I don’t know what it actually means by |
Info I forgot to mention: trying out different |
Here's the MIR for
And the LLVM-IR for good measure:
Edit: and the assembly from Rust:
|
Hmm, I’ve tried doing some research in the @AzureMarker Let’s open a Zulip thread about this. (I’m busy today, so you can do it if you want). |
I've been a bit busy as well, but there's also a few things I want to try before making a thread (we also need to make some sort of summary so they know what we're talking about). For example, maybe the 3DS has a really small thread local store? I also want to inspect the binaries to see if the initial state of the thread locals is the same. Edit: I see @ian-h-chamberlain has already started a thread for the regex issue (with no replies): |
Even if that was true, it’d have no importance in this issue. The problem is caused before any actual I understand wanting to search for other possibilities ourselves (and am pleased by any help you bring in), but for me this research hit a wall, so I would find external help very useful. |
Yeah, if you two want to bump or add to that discussion there's always a chance someone else will notice it and chime in (probably scrolled off the backlog for most people by now). Sorry for radio silence from my end recently, but meanwhile I got my example down to this (almost minimal, there might be more a little more I could do) reproduction: use regex::RegexBuilder;
const RE: &str = r"(?P<key>.+)=(?P<value>.+)";
fn main() {
pthread_3ds::init();
linker_fix_3ds::init();
let builder = RegexBuilder::new(RE);
let _regex = builder.build();
} I've been trying to debug the LLVM IR following the rustc dev guide and have at least gotten to a point where some N number of LLVM optimization passes generates a segfault but MAX_N does not (for I'm not sure in this case it has anything to do with thread-locals, but I'll see where the investigation leads me. Unfortunately it seems unusual that a miscompilation might be fixed by an optimization pass rather than caused by it, but that's currently my best guess for what's happening. Hoping to get a chance to nail it down this weekend, will post back here once I've tried. |
Yeah, it looks to be more or less the same issue, differing only in where the problem generates from. Quite interesting indeed. |
Ok, I've made some progress! I was able to find an optimization pass after which FLAGS=(
-C opt-level=1
-C debuginfo=0
-Z verify-llvm-ir=yes
) Final bisect was Rustc invocation that failed with SIGSEGV:rustc \
--crate-name regex \
--edition=2018 /Users/ianchamberlain/.cargo/registry/src/github.com-1ecc6299db9ec823/regex-1.5.5/src/lib.rs \
--error-format=json \
--json=diagnostic-rendered-ansi,future-incompat \
--crate-type lib \
--emit=dep-info,metadata,link \
-C embed-bitcode=no \
-C debuginfo=2 \
--cfg 'feature="aho-corasick"' \
--cfg 'feature="default"' \
--cfg 'feature="memchr"' \
--cfg 'feature="perf"' \
--cfg 'feature="perf-cache"' \
--cfg 'feature="perf-dfa"' \
--cfg 'feature="perf-inline"' \
--cfg 'feature="perf-literal"' \
--cfg 'feature="std"' \
--cfg 'feature="unicode"' \
--cfg 'feature="unicode-age"' \
--cfg 'feature="unicode-bool"' \
--cfg 'feature="unicode-case"' \
--cfg 'feature="unicode-gencat"' \
--cfg 'feature="unicode-perl"' \
--cfg 'feature="unicode-script"' \
--cfg 'feature="unicode-segment"' \
-C metadata=d43bcfb0bfe156b4 \
-C extra-filename=-d43bcfb0bfe156b4 \
--out-dir /Users/ianchamberlain/Documents/Development/3ds/crash-repro/target/armv6k-nintendo-3ds/debug/deps \
--target armv6k-nintendo-3ds \
-L dependency=/Users/ianchamberlain/Documents/Development/3ds/crash-repro/target/armv6k-nintendo-3ds/debug/deps \
-L dependency=/Users/ianchamberlain/Documents/Development/3ds/crash-repro/target/debug/deps \
--extern aho_corasick=/Users/ianchamberlain/Documents/Development/3ds/crash-repro/target/armv6k-nintendo-3ds/debug/deps/libaho_corasick-32b97d304112fb85.rmeta \
--extern memchr=/Users/ianchamberlain/Documents/Development/3ds/crash-repro/target/armv6k-nintendo-3ds/debug/deps/libmemchr-43ddaf7574a56279.rmeta \
--extern regex_syntax=/Users/ianchamberlain/Documents/Development/3ds/crash-repro/target/armv6k-nintendo-3ds/debug/deps/libregex_syntax-7ba7dfda0b6f2707.rmeta \
--cap-lints allow \
-C opt-level=1 \
-C debuginfo=0 \
-Z verify-llvm-ir=yes \
-C llvm-args=-opt-bisect-limit=294912 I've been trying to follow some of the steps in https://gist.github.com/luqmana/be1af5b64d2cda5a533e3e23a7830b44 to debug further, but haven't had too much luck with the emitted llvm-ir ( However, I did also find something else interesting: the issue does not reproduce if I set I found this resource which references cross-language LTO in Firefox (which I think is built with LLVM). In particular this paragraph seems relevant:
In the case of I suspect, and hope to prove, that building std using
I'll keep trying to prove my theory more precisely, but meanwhile if @Meziu @AzureMarker you can try building with either |
Thanks for the update. I just tried both of those RUSTFLAGS values in debug mode and it didn't fix the rapier-physics issue for me :/. |
Looks like it doesn’t. I haven’t looked into the inner workings for any differences, but just running the program yields the same behaviour. |
After a bit more testing, I think the codegen units / LTO may be a red herring after all. I tried rebuilding std with the same flags and it didn't seem to matter, so I might go back to the drawing board and try to bisect the hard way. It is interesting that I was able to bisect a segfault in rustc but seems like it's not directly related to the codegen issue we are seeing. |
Ok, got some more details! I think this might be the real cause of the difference between debug + release mode builds. After bisecting, I found that the segfault occurred only when a Before stack-coloring# Machine code for function _ZN12regex_syntax3ast5parse16ParserI$LT$P$GT$19parse_with_comments17hb84b2d14834d7bfbE: IsSSA, TracksLiveness
Frame Objects:
fi#0: size=1, align=4, at location [SP]
fi#1: size=60, align=8, at location [SP]
fi#2: size=60, align=8, at location [SP]
fi#3: size=60, align=8, at location [SP]
fi#4: size=60, align=8, at location [SP]
fi#5: size=8, align=4, at location [SP]
fi#6: size=8, align=4, at location [SP]
fi#7: size=132, align=8, at location [SP]
fi#8: size=144, align=8, at location [SP]
fi#9: size=64, align=4, at location [SP]
fi#10: size=60, align=8, at location [SP]
fi#11: size=36, align=8, at location [SP]
fi#12: size=136, align=4, at location [SP]
fi#13: size=132, align=8, at location [SP]
fi#14: size=132, align=8, at location [SP]
fi#15: size=68, align=8, at location [SP]
fi#16: size=56, align=8, at location [SP]
fi#17: size=132, align=8, at location [SP]
fi#18: size=36, align=8, at location [SP]
fi#19: size=68, align=4, at location [SP]
fi#20: size=64, align=8, at location [SP]
fi#21: size=12, align=8, at location [SP]
fi#22: size=36, align=8, at location [SP]
fi#23: size=68, align=4, at location [SP]
fi#24: size=64, align=8, at location [SP]
fi#25: size=12, align=8, at location [SP]
fi#26: size=36, align=8, at location [SP]
fi#27: size=68, align=4, at location [SP]
fi#28: size=64, align=8, at location [SP]
fi#29: size=12, align=8, at location [SP]
fi#30: size=36, align=8, at location [SP]
fi#31: size=68, align=4, at location [SP]
fi#32: size=64, align=8, at location [SP]
fi#33: size=132, align=8, at location [SP]
fi#34: size=132, align=4, at location [SP]
fi#35: size=128, align=8, at location [SP]
fi#36: size=128, align=8, at location [SP]
fi#37: size=36, align=8, at location [SP]
fi#38: size=68, align=4, at location [SP]
fi#39: size=64, align=8, at location [SP]
fi#40: size=36, align=8, at location [SP]
fi#41: size=68, align=4, at location [SP]
fi#42: size=64, align=8, at location [SP]
fi#43: size=36, align=8, at location [SP]
fi#44: size=68, align=4, at location [SP]
fi#45: size=64, align=8, at location [SP]
fi#46: size=24, align=8, at location [SP]
fi#47: size=36, align=8, at location [SP]
fi#48: size=24, align=4, at location [SP]
fi#49: size=4, align=4, at location [SP] After stack-coloring# Machine code for function _ZN12regex_syntax3ast5parse16ParserI$LT$P$GT$19parse_with_comments17hb84b2d14834d7bfbE: IsSSA, TracksLiveness
Frame Objects:
fi#0: dead
fi#1: dead
fi#2: dead
fi#3: dead
fi#4: dead
fi#5: dead
fi#6: dead
fi#7: size=132, align=8, at location [SP]
fi#8: size=144, align=8, at location [SP]
fi#9: dead
fi#10: dead
fi#11: dead
fi#12: dead
fi#13: dead
fi#14: size=132, align=8, at location [SP]
fi#15: dead
fi#16: size=56, align=8, at location [SP]
fi#17: dead
fi#18: dead
fi#19: dead
fi#20: dead
fi#21: dead
fi#22: dead
fi#23: dead
fi#24: dead
fi#25: dead
fi#26: dead
fi#27: dead
fi#28: dead
fi#29: dead
fi#30: dead
fi#31: dead
fi#32: dead
fi#33: dead
fi#34: dead
fi#35: dead
fi#36: dead
fi#37: dead
fi#38: dead
fi#39: dead
fi#40: dead
fi#41: dead
fi#42: dead
fi#43: dead
fi#44: dead
fi#45: dead
fi#46: dead
fi#47: size=36, align=8, at location [SP]
fi#48: dead
fi#49: dead This optimizes down the stack (for this particular function) from 3169 (!) to 500 bytes, which when combined with other stack frames seems like it must be enough to blow the stack. As far as I can tell, there's nothing actually being miscompiled here, it's just that the debug build options ( The good news is, we have a way to change the stack size at compile time! #[no_mangle]
static __stacksize__: usize = 64 * 1024; // default is 32k In my case, this was enough to prevent the crash! We may want a more conservative default like 2MB (the pthread default on linux, I think) for more programs to work correctly, but I'm also not sure if it makes sense to set this in Let me know if someone is able to try this with |
Well, having it at 2MB makes sense since it is also the stack size a new thread would have. I’ll try running Edit: nope, nothing. The result is exactly the same... We should try debugging |
Hmm, I tried the Should we open a separate issue / discussion to address that issue, since the change in #59 doesn't seem to resolve it? This PR can probably be closed, at least. |
Hey @AzureMarker and @Meziu !
I found a strange example that I can reliably reproduce a segfault with seemingly safe code, and I'm wondering if you have any ideas to help debug it as I've gotten a bit stuck.
Running gdb attached to a device I get this as the location of the crash: https://github.com/rust-lang/regex/blob/master/src/exec.rs#L299
GDB backtrace
The
self
pointer seems to be optimized out at the call site, so I'm wondering what kind of address violation might be happening here... perhaps UB introduced by some call before the regex builder?Let me know if you have any ideas or suggestions for tracking down this kind of issue. I couldn't find any references to issues like this upstream, etc. so I'm a bit stumped.
Edit: Note, I'm using a rebased version of
feature/horizon-threads
onto latest upstream master, but it seemed to reproduce on a different toolchain version I had as well. Going to to try and rebuild my toolchain again to see if it makes any difference, in case this is a miscompilation kind of thing...