-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of misaligned loads #1611
Comments
You can do much better than that 22-instruction sequence, both in terms of instruction count and register pressure: https://godbolt.org/z/jdPssx8WM
The spec's vagueness is deliberate, as it offers implementation flexibility. As for big-endian loads, assume all general-purpose implementations will offer the Zbb extension. Only educational implementations, or highly specialized ones for which the As you point out, there isn't really anything actionable here, so I'm closing the issue. |
Strictly speaking, this snippet contains UB since you read bits outside of the allocated object. The only proper way to use this approach is through inline assembly. And it's still 11 instructions and one additional register.
Can I assume the same for Zbkb (IIUC it's a subset of Zbb)? Or alternatively, is it reasonable to assume that Zbkb is always available if Zk/Zkn is present?
My actionable request is to introduce an extension with explicit unaligned load/store operations with guaranteed "reasonable" performance. Zicclsm looks borderline useless with the current wording and, as we can see, currently it does not influence compiler's codegen. |
Zbkb has pack instructions that are not in Zbb. Which is unfortunate since they would improve some unaligned access sequences.
Compiler codegen can be controlled with -mno-scalar-strict-align on clang 18. Not sure if it made it to a gcc release yet. -mno-strict-align might work on older gcc. |
Well, obviously... the C code was for illustrative purposes.
Not necessarily, as it isn't a proper subset, but either way, it's probably best to assume
The base ISA spec already says, in so many words, that there is no appetite for that approach. riscv-isa-manual/src/rv32.adoc Line 740 in 2eac83e
Your best bet is to follow @topperc's recommendation, and/or follow in the footsteps of the glibc folks, who employ a runtime check to determine whether misaligned loads and stores are fast: https://sourceware.org/pipermail/libc-alpha/2023-February/145343.html |
Sigh... I guess I have no choice but to pile a bunch of hacks to work around this...
/start-of-rant From the software developer perspective the sheer number of "may"s in those many words is incredibly annoying (e.g. see this comment). The lack of exact guarantees and vagueness is maybe really nice for hardware developers, but makes it really hard to write portable performant code. The glibc stuff looks like yet another hack to work around the ISA vagueness and it's not applicable for hot loops common in cryptographic code. In my opinion, RISC-V got itself into a weird middle ground, which is the worst of both worlds. Software developers neither can rely on a reasonable performance of misaligned loads/stores like on x86, nor they can use explicit misaligned loads/stores (again with reasonable performance) like on MIPS (or SSE/AVX x86) in situations where need for them naturally arises. /end-of-rant |
You can get down to a 9 instruction sequence using just RV64I: andi a1, a0, -8 # round down
slli a0, a0, 3 # offset=addr*8.
ld a2, 0(a1) # load left
ld a1, 8(a1) # load right
srl a2, a2, a0 # left>>=(offset%64)
not a0, a0
slli a1, a1, 1 # right<<=1
sll a0, a1, a0 # right<<=(~offset%64)
or a0, a0, a2 # left|right If you need to load adjacent misaligned data, as is commonly the case, then you can get down to just three additional instructions on average: Additional instructions for unaligned load/stores would however be something to discuss in the scalar efficiency SIG, as it already includes suggestions for three source operand integer instructions: https://docs.google.com/spreadsheets/d/1dQYU7QQ-SnIoXp9vVvVjS6Jz9vGWhwmsdbEOF3JBwUg I was thinking about something like
|
@camel-cdr Either way, all such hacks not only use more registers, but also blatantly slower on properly aligned data, which is quite common in practice, because the code requires additional loads and ALU work. At this point, I think I will simply use |
A bit of context: I was working on scalar crypto extension support for the RustCrypto project (see RustCrypto/hashes#614) and was quite disappointed with the generated code.
Here is a simple piece of code which performs unaligned load of a 64 bit integer: https://rust.godbolt.org/z/bM5rG6zds It compiles down to 22 interdependent instructions (i.e. there is not much opportunity for CPU to execute them in parallel) and puts a fair bit of register pressure! It becomes even worse when we try to load big-endian integers (without the zbkb extension): https://rust.godbolt.org/z/TndWTK3zh (an unfortunately common occurrence in cryptographic code)
The LD instruction theoretically allows unaligned loads, but the reference is disappointingly vague about it. Behavior can range from full hardware support, followed by extremely slow emulation (IIUC slower than execution of the 22 instructions), and end with fatal trap, so portable code simply can not rely on it.
There is the Zicclsm extension, but the profiles spec is again quite vague:
It's probably why enabling Zicclsm has no influence on the snippet codegen.
Finally, my questions: is it indeed true that the 22 instructions sequence is "the way" to perform potentially misaligned 64-bit loads? Why RISC-V did not introduce explicit instructions for misaligned loads/stores in one of extensions similar to the MOVUPS instruction on x86?
I know that it's far too late to change things, but, personally, I would've preferred if the spec was stricter in this regard. For example, it could've mandated unconditional fatal trap for unaligned load/store instructions in the base set and introduced explicit unaligned load/store instructions in a separate extension.
UPD: In the reddit discussion it is mentioned in the comments that MIPS had a patent on unaligned load and store instructions, but it has expired in 2019.
The text was updated successfully, but these errors were encountered: