Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crashing core.thread.fiber unittest for AArch64. #4648

Merged
merged 3 commits into from
May 10, 2024

Conversation

JohanEngelen
Copy link
Member

Related to #4613

@kinke I think it is better to disable individual unittests like this, than to disable the whole file in the CI script. I spent quite some time on trying to debug this failing test, thinking it was due to musl libc, but instead I found out from #4613 that it is a general AArch64 issue (indeed, also fails on my mac). I would have liked to see that in the source file, rather than hidden in CI configuration. Similar to how we tag known failures of a lit tests inside that actual lit test. This simply ensures that users (and I ;-)) can expect to successfully run the testsuite as normal (without having to know an exclude list).

@kinke
Copy link
Member

kinke commented May 6, 2024

Please note that some of these kludges date back to when druntime was a separate submodule, so that changes weren't this trivial. And it's also not that long ago that we know whether we are compiled with optimizations on (version D_Optimized), which can now be used to skip tests with enabled optimizations only, like this one here, which you'd currently skip for debug tests too (plus not failing consistently, only sporadically IIRC).

Sometimes I have no idea which unittest fails for some module, that was the case here IIRC, no usable info in the CI logs. And then often it's not clear whether it's due to some particular CI config, a specific LLVM version (the math gammafunction unittests on macOS arm64 now being green since LLVM 18) etc.

I'm not a huge fan of disabling genuinely failing tests though - they are real failures and might be encountered in production too, and if there's just a little output line in a wall of log lines, I think it's easy to overlook and just assume that it's all working. Edit: Instead of, in the best case, trying to help to fix the last few failures. :)

@JohanEngelen
Copy link
Member Author

It was not meant as a complaint, just trying to improve things / add knowledge of exactly which test is failing. We should upstream this change, such that other druntime authors also see that the test is failing (I think the problem is migrating threads on a platform for which migration is bad).

In any case, it would have already helped me really a lot if I had known that core.thread.fiber is broken for AArch64. We should document that inside the file, rather than in a CI script. Can be done like in this PR, or with a comment + disabling in CI script.

@JohanEngelen
Copy link
Member Author

One benefit of this selective test disable (vs disabling whole file in CI script) is that all the other tests in the file are being run and further regression is noticed.

@kinke
Copy link
Member

kinke commented May 6, 2024

We should upstream this change, such that other druntime authors also see that the test is failing (I think the problem is migrating threads on a platform for which migration is bad).

I'm seeing this from the perspective of a package manager / power user building and running the tests on a selected platform. Who's just interested in seeing if all is green (edit: well, and what/how much is failing), not looking for little 'hey, test is disabled' output lines to figure out that something like migrating fibers may not actually work on that platform.

The only reason for these CI exceptions is to be notified on further regressions. Being able to narrow it down to individual tests is / would be nice, but ideally really only for CI, not for package managers etc. looking for proper tests.

And without restricting this to D_Optimized as mentioned above (and ideally LDC for visibility unless really upstreaming this), this wouldn't run this unittest for debug tests anymore, which is currently done (note the $ at the end of the regex, so that only the release tests are excluded).

@JohanEngelen
Copy link
Member Author

We should upstream this change, such that other druntime authors also see that the test is failing (I think the problem is migrating threads on a platform for which migration is bad).

I'm seeing this from the perspective of a package manager / power user building and running the tests on a selected platform. Who's just interested in seeing if all is green (edit: well, and what/how much is failing), not looking for little 'hey, test is disabled' output lines to figure out that something like migrating fibers may not actually work on that platform.

This is only hypothetical, no? I don't think there are many (if any) package managers, and they just have to accept that LDC/DMD/.. is half broken and disable the test without knowing what actually is broken. Case in point: at weka we use LDC for AArch64, fibers are used, but not migrated, which I guess for your argument would mean that fibers should have been banned on the data point that the whole of core.thread.fiber is failing. More so: currently aarch64 fibers are not tested in release mode (obviously needed for performance); I was a bit surprised about that actually (my fault). However, there is no choice here: LDC must be used on AArch64, otherwise there is no product.

Nevertheless, there could be better ways of displaying which tests are known to be failing. This is (one of the many) shortcomings of D's built-in unittests.

The only reason for these CI exceptions is to be notified on further regressions. Being able to narrow it down to individual tests is / would be nice, but ideally really only for CI, not for package managers etc. looking for proper tests.

In this case they shouldn't be using D ;-) (there are many tests locally disabled for certain configs)

And without restricting this to D_Optimized as mentioned above (and ideally LDC for visibility unless really upstreaming this), this wouldn't run this unittest for debug tests anymore, which is currently done (note the $ at the end of the regex, so that only the release tests are excluded).

Easy to fix! Will fix it.

@kinke
Copy link
Member

kinke commented May 7, 2024

It all boils down to how severe a failure is/appears. You seem to see this particular unittest failure/crash as a minor issue, something that can be disabled and hidden away in some runtime output (e.g., your line isn't visible in the CI logs; additionally, there's no context, no line number etc.). Finding out about this problem is anything but trivial if one isn't working on druntime and grepping for FIXME etc., wanting to check the AArch64 status on some specific OS.

My gut feeling however is that this can be a severe problem - after all, this test works just fine on x86, but somehow crashes on AArch64 (at least sometimes) when enabling optimizations. Maybe it's just that the test is bad and needs a fixup, maybe it's the fiber code that forgets to save and restore some registers on AArch64, which would be very bad.

@JohanEngelen
Copy link
Member Author

It all boils down to how severe a failure is/appears. You seem to see this particular unittest failure/crash as a minor issue, something that can be disabled and hidden away in some runtime output (e.g., your line isn't visible in the CI logs; additionally, there's no context, no line number etc.). Finding out about this problem is anything but trivial if one isn't working on druntime and grepping for FIXME etc., wanting to check the AArch64 status on some specific OS.

You are right that it is not very discoverable, but neither is any of are test results: there are very few people who run testsuites (I never ran a testsuite of a C/C++/Python/... compiler, but use them all the time); and I think there is no-one who checks results of testsuites; I'm assuming everybody just downloads the LDC compiler and that's it.
It took me several hours to discover that this test failure is already known :( so that was a pretty bad experience (thinking it related to musl, e.g. stack size, in some way). A simple comment at start of unittest would at least already help a little, so I'll add that in any case.
Also: the test sporadically fails, so even if someone runs the testsuite (doing it now), it is quite likely it won't be detected either.

My gut feeling however is that this can be a severe problem - after all, this test works just fine on x86, but somehow crashes on AArch64 (at least sometimes) when enabling optimizations. Maybe it's just that the test is bad and needs a fixup, maybe it's the fiber code that forgets to save and restore some registers on AArch64, which would be very bad.

The test is testing migration of fibers, which is explicitly marked as unsafe by druntime and you have to explicitly call allowMigration (whose description screams: don't do this). I think the test is bad, but can't tell exactly what part is bad (apart from migrating and accessing TLS data).

What I'm more worried about is that we are not testing anything related to fibers and AArch64 (unoptimized tests for this stuff is not very relevant, imho). Indeed, if fibers forget to save/restore registers, currently that is not tested at all (!), whereas it could be if only this test is disabled.

@kinke
Copy link
Member

kinke commented May 7, 2024

You are right that it is not very discoverable, but neither is any of are test results: there are very few people who run testsuites (I never ran a testsuite of a C/C++/Python/... compiler, but use them all the time); and I think there is no-one who checks results of testsuites; I'm assuming everybody just downloads the LDC compiler and that's it.

That's fortunately not entirely true: #4613 :)

It took me several hours to discover that this test failure is already known :( so that was a pretty bad experience (thinking it related to musl, e.g. stack size, in some way). A simple comment at start of unittest would at least already help a little, so I'll add that in any case.

I'm sorry, but again, I didn't know that it was this unittest that crashes. And IIRC, it fails pretty often on Linux, but very seldomly on macOS (I've just re-added that exception for GHA after seeing some failures lately). All I knew was that this module sporadically crashes on AArch64 with -O. I agree that a comment/FIXME in the source is nice. As is hopefully clear by now, I'm only worried about discoverability of unclear failures - in my view, there's a very strong distinction between disabling a test for CI ('required' for making it easier to discover further regressions), and disabling a test altogether.

What I'm more worried about is that we are not testing anything related to fibers and AArch64 (unoptimized tests for this stuff is not very relevant, imho). Indeed, if fibers forget to save/restore registers, currently that is not tested at all (!), whereas it could be if only this test is disabled.

Again, I didn't want to set up an emulator or abuse my phone for trying to hunt down the problem and see which unittest fails. And definitely not trying to hunt it down via CI, especially if it isn't consistent.

Now that you figured out which test fails, that's progress and very much appreciated. If only we could restrict this test-exclusion to CI... ;)

@dnadlinger
Copy link
Member

I just saw this scroll by – what a blast from the past.

I think the test is bad, but can't tell exactly what part is bad (apart from migrating and accessing TLS data).

You are probably aware of this, but the issue is described here: #666 It is target-dependent, as the way TLS variables are referenced (and the way those references are emitted/optimised by the LLVM backend) depends on the target ABI.

@dnadlinger
Copy link
Member

Corollary: If the TLS caching issue does not explain the test failure (which can be verified by inspection of TLS codegen on AArch64), then the test indeed detects a real bug.

@kinke
Copy link
Member

kinke commented May 7, 2024

Oh hey David! wave - thx for the link; I misremembered this as being an unclear issue that Joakim tried to deal with. If it's really down to TLS and that 'accidentally' working on x86, then I'd have no problem with disabling the test.

@dnadlinger
Copy link
Member

Yes, but at a glance, it does not seems like that test actually relies on TLS. Thus, allowMigration() should in fact be safe, and it is a genuine issue. (I was just going off Johan's comment earlier, but I am not sure where TLS would actually be used.)

@dnadlinger
Copy link
Member

dnadlinger commented May 7, 2024

There would appear to be a bug in the test code, though: This store
https://github.com/ldc-developers/ldc/blob/40ad5f7583fe90a85fd675bfc0cfc287d565c94a/runtime/druntime/src/core/thread/fiber.d#L2323C21-L2323C29
needs to have release semantics. It does by default on x86_64 as far as the processor is concerned, i.e. unless the compiler were to move it around, but not on AArch64.

Edit: Ah, wait, didn't D specify sequentially consistent ordering for shared accesses by default? That's not it, then.

@JohanEngelen
Copy link
Member Author

I can quite easily test this on my mac, it indeed does not crash often but usually does the first time it is compiled with slightly different settings (-O2, -O3). Already tried __gshared on fibs which indeed does not help.

@dnadlinger
Copy link
Member

fibs is stack memory anyway, though, and not TLS?

@JohanEngelen
Copy link
Member Author

fibs is stack memory anyway, though, and not TLS?

I am dumb :( staring at this for too long. Thanks :)

@kinke
Copy link
Member

kinke commented May 7, 2024

Making that an atomicStore explicitly might not hurt. What might be interesting is #666 (comment), preventing optimizations for key fiber functions:

We now have core.volatile, so I wonder if the sm_this accesses could be done through volatileLoad. So far the only workaround that works 3/4s of the time is to disable inlining of switchIn, switchOut, and getThis. Setting optimization level of these (and maybe some others) functions to -O0 has a 100% success rate, but that is a non-standard extension.

@JohanEngelen
Copy link
Member Author

There would appear to be a bug in the test code, though: This store https://github.com/ldc-developers/ldc/blob/40ad5f7583fe90a85fd675bfc0cfc287d565c94a/runtime/druntime/src/core/thread/fiber.d#L2323C21-L2323C29 needs to have release semantics. It does by default on x86_64 as far as the processor is concerned, i.e. unless the compiler were to move it around, but not on AArch64.

Edit: Ah, wait, didn't D specify sequentially consistent ordering for shared accesses by default? That's not it, then.

This appears not to be the case:
locks[idx] = false; --> store i8 0, ptr %2, align 1 Just a normal store.
locks[idx].atomicStore(false); --> store atomic i8 0, ptr %2 seq_cst, align 1

However, the latter does not fix the crashes.

@JohanEngelen
Copy link
Member Author

Making that an atomicStore explicitly might not hurt. What might be interesting is #666 (comment), preventing optimizations for key fiber functions:

We now have core.volatile, so I wonder if the sm_this accesses could be done through volatileLoad. So far the only workaround that works 3/4s of the time is to disable inlining of switchIn, switchOut, and getThis. Setting optimization level of these (and maybe some others) functions to -O0 has a 100% success rate, but that is a non-standard extension.

switchIn and switchOut already have a pragma(inline, false). I've now also applied it to getThis. This indeed seems to fix the problem entirely on aarch64 alpine linux. Before it was almost 100% crash, now 0%.

@kinke
Copy link
Member

kinke commented May 7, 2024

Oh wow, progress. I was expecting we need @optStrategy("none") according to Iain's comment. Doesn't look like there's much to gain by optimizing these after a quick glance, and there are comments saying this has to be executed in exactly this order etc.

@dnadlinger
Copy link
Member

It seems like my original workaround, which included a comment (ldc-developers/druntime@db7f8ee), was dropped at some point.

@kinke
Copy link
Member

kinke commented May 7, 2024

Oh, that might likely have been my bad, probably when core.thread was split up into a package, with a bigger refactoring.

@JohanEngelen JohanEngelen changed the title Disable crashing core.thread.fiber unittest for AArch64. Fix crashing core.thread.fiber unittest for AArch64. May 7, 2024
@dnadlinger
Copy link
Member

This appears not to be the case: locks[idx] = false; --> store i8 0, ptr %2, align 1 Just a normal store. locks[idx].atomicStore(false); --> store atomic i8 0, ptr %2 seq_cst, align 1

However, the latter does not fix the crashes.

This might be worth following up on: personally, I think default rw operations to shared data like this are a design mistake, but IIRC the spec did say something about accesses being atomic.

@dnadlinger
Copy link
Member

dnadlinger commented May 7, 2024

Oh, that might likely have been my bad, probably when core.thread was split up into a package, with a bigger refactoring.

Sure, no worries – and in any case, seeing as Johan added it to Fiber.getThis(), not ThreadBase.getThis(), the surrounding code might have changed in a way requiring a careful review anyway.

In either case, an updated comment should probably mention the strategy here: We aim to do the minimum necessary such that druntime task switching itself doesn't break because of TLS caching (as verified by the unit tests), but users are on their own as far as their code is concerned (hence the scary comments).

@@ -2320,7 +2323,7 @@ unittest
fibs[idx].call();
cont |= fibs[idx].state != Fiber.State.TERM;
}
locks[idx] = false;
locks[idx].atomicStore(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make this release for clarity/to maybe trigger more threading bugs.

@kinke kinke enabled auto-merge (squash) May 10, 2024 20:36
@kinke
Copy link
Member

kinke commented May 10, 2024

I've restored David's comment and am merging this now, as the now native macOS arm64 job from the GHA-main workflow needs it. [I've left the atomic-store memory order alone, as the prior CAS uses the default too.]

@kinke kinke merged commit fa4f032 into ldc-developers:master May 10, 2024
23 checks passed
@JohanEngelen JohanEngelen deleted the aarch64_fiber branch May 16, 2024 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants