-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/sys/cpu: add support for ARM #33508
Comments
x/sys/cpu has still not been totally updated to match internal/cpu, which has some of the feature flags you are looking for. I will defer to @martisch's judgement on that. |
Note that changing internal/cpu seems a separate topic (it also already supports arm) and there are no plans to export those variables outside the runtime/standard library. If x/sys/cpu support for linux/arm is enough (not *BSD or windows/arm) then it should be as easy as implementing the hwCap/hwCap2 bit checks in https://github.com/golang/sys/blob/master/cpu/cpu_arm.go since hwCap/hwCap2 should already be populated on Linux by: https://github.com/golang/sys/blob/master/cpu/cpu_linux.go. I think its fine to expose all linux supported but general arm applicable flags in x/sys/cpu. Note that the above would not support android as /proc/self/auxv might not be accessible and it wont change the runtime and standard libs detection of arm features (since that uses internal/cpu and not x/sys/cpu) and the effects of setting goarm are not effected as they override any cpu package .e.g.: go/src/runtime/os_freebsd_arm.go Line 14 in 3b6216e
Line 9 in 0df9fa2
|
Thanks for the detailed response, Martin.
That's right; I'm happy to open a new issue to discuss that separately if you wish. I would propose that the internal implementation be separate (a "copy") of the Since you mention it, I was reading the CL for #28148 (see this source file), and noticed how the use of detection of NEON could be improved by checking the hardware-caps instead of using the static (As said above, I am doing run-time detection at present by parsing
I am mostly interested in Android, so support for Linux would be a win. (Apple has moved exclusively to aarch64 for some time now.) It looks like FreeBSD has added support back in 2017 (see D12290 and D12291), so it might be possible to do something similar there also. As far as I can tell, Go bring-up for windows/arm has stalled (c.f. #26148), so that would need to wait regardless?
BoringSSL provides a good reference implementation: they weak-link |
The scope of the feature request is currently unclear to me when reading the title and first post content. From the comments I would infer it is to add arm support for android. It also does not seem restricted to x/sys/cpu but part of the proposal is to change the runtimes use of goarm with dynamic detection. As internal/cpu IRC already has support for arm the runtime could already use that in some places where goarm is used (no need to copy from x/sys/cpu). However this would be a behaviour change and some platforms would still need to use goarm (to populate the cpu variables). I think that is better discussed decoupled from any x/sys/cpu support. The Go compiler uses the goarm setting to make decisions about the emitted instructions and thereby does not produce "universal binaries". Changing this would be a larger change than x/sys/cpu support and likely have larger performance and binary size implications when switched to runtime detection: go/src/cmd/compile/internal/arm/ssa.go Line 644 in f125b32
go/src/cmd/compile/internal/ssa/regalloc.go Line 599 in 8a317eb
Seems like a good candidate for using x/sys/cpu but it seems it would need a fallback in x/sys/cpu to goarm for platforms not supporting CPU feature detection via AUXV or syscalls/proc.
The Go runtime sets hwcap on arm in internal/cpu on FreeBSD by reading the auxv thats provided after argv. I dont think x/sys/cpu has direct access to that. So I think we can not simply copy that approach to x/sys/cpu. Line 385 in a62b572
No having only partial support in x/sys/cpu is fine. However to not regress in performance for other platforms we would seem to need a fallback to setting the cpu features based on goarm on those not supporting runtime detection.
Thank you. Thats a nice reference. If we can make the same work under go then that looks like an option to gain feature detection support on android in x/sys/cpu. |
Let me be more clear... This issue is about:
This issue is not about,
however a separate issue can be created that proposes:
I understand this issue, but I would never propose it. My reference to a "universal binary" was with reference to one of my projects where I detect NEON, and take a fast-path accordingly. A true universal binary -- that is, a program supporting multiple architectures -- is often achieved though shipping separate texts for each architecture in a sandwiched "fat binary" as is possible on macOS, and only exists as a proof-of-concept on Linux. I am not proposing that here.
Yes -- and sure beats developers individually using
I have not tried it, but it appears a symbol exists on libc that gets you access to the information without having to go via auxv. |
Change https://golang.org/cl/190525 mentions this issue: |
@martisch, I've submitted a CL for this issue as described by my past post, and have tested it on linux/arm and freebsd/arm. Are you able to review it? |
Sure. Thanks for working on it. I added a first round of high level comments. There is already support for linux auxv reading in cpu_linux. We should first leverage that existing detection to add the hwcap bits to variable mapping and then I think basic linux/arm support should already work. Continuing from there we can add additional support for android to work around /proc/self/auxv not necessary being accessible to fill hwcap in x/sys/cpu. Afterwards we can extend the support for arm on other platforms and wheel in help from testers with hardware on those. What makes arm special in x/sys/cpu vs other archs is that absent hwcap detection we should fall back to the minimal set of hwcap bits set that is mandated by the goarm variable. |
As far as the API is concerned, which is what the proposal process would care about, it seems that the proposal is to add a cpu.ARM that is very similar to cpu.ARM64 with appropriate modifications for the actual CPU features that might be present. /cc @tklauser @martisch @ianlancetaylor for feedback |
For the API I think we should stay consistent with the other architectures and In general (for adding other architectures) I think if the existing naming schema is followed we do not need to have an separate proposal for each API addition for additional CPU feature structs/variables in |
@martisch, I have updated the original post to include a concrete list of fields.
I would argue that having "derived" fields, such as the proposed |
Based on the discussion, this is doing for ARM what has already been done for other architectures in the package, and there have been no objections. This sounds like a likely accept. Leaving open for a week for any objections to accepting. /cc @randall77 |
Sounds good. I agree with Martin, we should just provide the raw hardware specs and let users make their own compound flags if needed. The set needed for any particular assembly is highly dependent on what that assembly needs, so even if there are a few standard ones people will be rolling their own anyway. |
Some down sides that I think combined outweigh the usefulness of providing combined feature flags:
|
Let us try to avoid conflating logical grouping of homogeneous hardware capabilities with one that is purely arbitrary. Saying "I always support integer division" (the topic of discussion above) is very different to "I support vector processing and bit manipulation", or "I support encryption and subset X of SIMD". Let's consider for a moment how these hardware-capability flags are used in practice. An algorithm needs to run and one implementation is selected from an available set based on hardware features (e.g. The amount of "work done" is often non-trivial: otherwise the developer could just let the compiler do its best without hand-tuning the instruction text. This also means that by the time the algorithm completes, the cache has been polluted and the hardware-capability flags have most likely been evicted. Think of image processing, compression, hashing, or cipher operation: they typically operate on lots of input data. Let's say the algorithm instead relied on two hardware features. Regardless of whether the API under discussion reflected the situation in two fields or one "derived" field, the end result is the same: the cache will likely have evicted them all. Whether the fields were laid out in two or more cache lines to begin with are unlikely to impact the "setup overhead" of the algorithm as a whole. Trying to optimize the hardware-capabilities cache line in the previous discussion seems rather over the top in general. When selecting hardware-dependent code paths, it is also often used practice to memoize the result in the form of a function pointer, avoiding a conditional branch. (It often, perhaps more importantly, makes the code more readable.) The suggestion to include a derived field for ARM- and thumb-supported integer division ( Either way, I'm not wedded to including this one derived field in the proposed API change. In fact, I have no intention of even using this field at all -- my interest is in consuming |
Some of the examples given are not arbitrary but real use cases of the existing API mirrored in internal/cpu. e.g.: Line 290 in 3aa7bbd
https://github.com/golang/crypto/blob/614d502a4dac94afa3a6ce146bd1736da82514c6/chacha20poly1305/chacha20poly1305_amd64.go#L23
That is a good point that would make the problem of name collisions less likely and does set IDIV apart from other combinations. Note that as far as I perceived the discussion it was generally about "derived" fields not only about IDIV.
Then it would seem we can just leave it out for now and wait for a use case. |
This may very well be a extreme scenario and therefore should only be a secondary consideration but for something like a memcpy implementation that is frequently accessed in different parts of a programming pulling in one more cache line besides the copied data (which is often small) can be shown to matter measurably in production. Given that one can always add more but not easily remove feature variables I would still suggest to stay with a leaner set for now until its known what a good criteria for the potential inclusion of "derived" flags would be. |
Marked this last week as likely accept w/ call for last comments (#33508 (comment)). |
Now that the proposal has been accepted, I will update my PR (CL) shortly. |
The CL has been updated and split into a chain of commits. @tklauser I've invited you as a reviewer and hope that you've got a little time to look at this contribution. Thank you in advance. |
Change https://golang.org/cl/197541 mentions this issue: |
Change https://golang.org/cl/197540 mentions this issue: |
Change https://golang.org/cl/197542 mentions this issue: |
Updates golang/go#33508 Change-Id: I9ea01090f5b4ac95c1a14881c305461bd4a7b5dd Reviewed-on: https://go-review.googlesource.com/c/sys/+/190525 Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Tobias Klauser <tobias.klauser@gmail.com> Reviewed-by: Martin Möhrmann <moehrmann@google.com>
(I'm one of the BoringSSL maintainers and the author of our ARM CPU detect bits.) 32-bit ARM CPU feature detection on Linux is kind of a headache with older Androids, yeah. :-( I probably wouldn't recommend trying to detect that one broken CPU (https://golang.org/cl/197541). The Android cpu-features library doesn't do this and we only ever had issues with one function. At this point the affected CPU is rare enough that I'm hoping to just remove it soon. (Chrome for Android already requires NEON support. The workaround results in us carrying extra crypto implementations for just that CPU.) As for whether you want the the tower of /proc fallbacks, I guess it depends on what versions of Android you care about and how much you care about getting NEON on those older devices. Android L and up have I'll also note that BoringSSL only pays attention to NEON and ARMv8 crypto-related bits, so some of our fallbacks may not be a good template for the other features. In particular, I don't think the ARMv8 logic in https://golang.org/cl/197540's /proc/cpuinfo parser is quite right. |
@davidben, thanks for chiming in; some comments inline below.
That's fair enough. My interest is in NEON detection on Android. However it would be nice if we also had accurate crypto detection, so that we might later introduce accelerated TLS for arm32 in Go's stdlib. It looks like support is currently limited to arm64.
I would ideally like to target KitKat (API level 19) and up. How good is the
On the ARMv8 check, I lifted it directly from this BoringSSL implementation. If you can outline what's not quite right here, and what could be improved, I'd appreciate it. |
I don't remember off-hand when the /proc/self/auxv works vs. /proc/cpuinfo. I vaguely recall it was something weird though, where some Android version or device accidentally took away /proc/self/auxv without adding getauxval?
As I said, BoringSSL only cares about NEON and ARMv8 crypto-related bits. I expect there are other optional ARMv7 features that became mandatory in ARMv8 other than just NEON. Since BoringSSL doesn't care about any ARMv7 feature other than NEON, our code is not a good template for those features. It probably makes sense to check how the kernel actually fills in /proc/cpuinfo and review against that. |
In that case, would you recommend that if you had any fallback whatsoever, you include both
That makes sense. Looking at the Armv8 Architecture Reference Manual, in the AArch32 execution state:
As you've stated, there could be more features that are mandatory in ARMv8, but I would say that the above are the ones we would generally care about (NEON + crypto). I'll update the CL to reflect this. If I've missed a feature that you care about, please let me know. |
It would be nice to have feature detection for ARM (32-bit) in
x/sys/cpu
.Concretely, a new
cpu.ARM
struct that closely resembles the existingcpu.ARM64
struct, tailored to the ARM specific hardware capabilities. The following fields are proposed, which map directly to the {HWCAP, HWCAP2} auxiliary vector values on Linux and FreeBSD:As I look around, I see code detecting CPU features based on the
runtime.goarm
value (which is set by the GOARM environment variable at link time), rather than a runtime check. This means that:runtime.goarm
is notconst
, the fast-path (e.g. using NEON) and slow-path fallback are being compiled into the binary, but only one path can ever be used. It would be nice if both paths can be used via run-time detection instead.In one of my projects, I have resorted to parsing
/proc/cpuinfo
for run-time detection of NEON, which only works on Linux. I'd love to instead use the standard library.The text was updated successfully, but these errors were encountered: