diff --git a/design-documents/bit-precise-types.rst b/design-documents/bit-precise-types.rst new file mode 100644 index 00000000..4743daff --- /dev/null +++ b/design-documents/bit-precise-types.rst @@ -0,0 +1,506 @@ +.. + Copyright (c) 2023, Arm Limited and its affiliates. All rights reserved. + CC-BY-SA-4.0 AND Apache-Patent-License + See LICENSE file for details + +Rationale Document for ABI related to the C23 _BitInt type. +*********************************************************** + +Preamble +======== + +Background +---------- + +This document describes the rationale behind the ABI choices made for using the +bit-precise integral types defined in C2x. These are ``_BitInt(N)`` and +``unsigned _BitInt(N)``. These are defined for integral ``N`` and each ``N`` is +a different type. + +The proposal for these types can be found in following link. +https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2763.pdf + +As the rationale in that proposal mentioned, some applications have uses for a +specific bit-width type. In the case of writing C code which can be used to +determine FPGA hardware these specific bit-width types can lead to large +performance and space savings. + +From the perspective of the Arm ABI we have some trade-offs and decisions to +make: + +- We need to choose a representation for these objects in registers. +- We need to choose a representation, size and alignment of these objects in memory. + +The main trade-offs we have identified in this case are: + +- Performance of different C-level operations. +- Whether certain hardware-level atomic operations are possible. +- Size cost of storing values in memory. +- General familiarity of programmers with the representation. + +Since this is a new type there is large uncertainty on how it will be used by +programmers in the future. Decisions we make here may also influence future +usage. Nonetheless we must make trade-off decisions with this uncertainty. The +below attempts to analyze possible use-cases to make our best guess as to how +these types may be used when targeting Arm CPU's. + + +Use-cases known of so far +------------------------- + +There seem to be two different regimes for these types. The "small" regime +where bit-precise types could be stored in a single general-purpose register, +and the "large" regime where bit-precise types must span multiple +general-purpose registers. + +Here we discuss the use-cases for bit-precise integer types that we have +identified or been alerted to so far. + + +C code to describe FPGA behavior +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A major motivating use-case for this new type is to aid writing C code which +describes the desired behavior of an FPGA. Without the availability of the new +``_BitInt`` type such C code would semantically have much wider types than +necessary when performing operations, especially given that operations on small +integral types promote their operands to ``int``. + +If these wider than necessary operations end up in the FPGA they would use many +more many more logic gates than necessary. Using ``_BitInt`` allows programmers +to write code which directly expresses what is needed. This can ensure the FPGA +description generated saves space and has better performance. + +The notable thing about this use-case is that though the C code may be run on an +Arm architecture (e.g. for testing), the most critical use is when transferred +to an FPGA (i.e. not an Arm architecture). + +That said, if the operation that this FPGA performs becomes popular there may be +a need to run the code directly on CPU's in the future. + +The requirements on Arm ABI's from this use-case are relatively small since the +main focus is around running on an FPGA. We believe it adds weight to both the +need for performance and familiarity of programmers. This belief comes from the +estimate that this may lead to bit-precise types being used in performance +critical code in the future, and that it may mean that bit-precise types are +used on Arm architectures when testing FPGA descriptions (where ease of +debugging can be prioritized). + + +24-bit Color +~~~~~~~~~~~~~ + +Some image file-types use 24-bit color. The new ``_BitInt`` type may be used to +hold such information. + +As it stands we do not know of any specific reason to use a bit-precise integral +type as opposed to a structure of three bytes for these data types. + +If used for 24-bit color we believe that the performance of standard arithmetic +operations would not be critical. This because each 24-bit pixel usually +represents three 8-bit colors so operations would unlikely be performed on the +single value as a whole. + +We also believe that if used for 24-bit color it would be helpful to specify a +size and alignment scheme such that an array of ``_BitInt(24)`` is well packed. + + +Networking Protocols +~~~~~~~~~~~~~~~~~~~~ + +Many networking protocols have packed structures in order to minimize data sent +over the wire. In order to be perfectly packed the code will need to use +bit-fields rather than bit-precise types for storage, since the bit-precise types +must be accessible and hence at least byte-aligned. + +The incentives to use bit-precise integral types for networking code would be in +order to maintain the best representation of the operation that is being +performed. + +One negative of using bit-precise integral types for networking code would be +that idioms like ``if (x + y > max_representable)`` where ``x`` and ``y`` have +been loaded from small bit-fields would no longer be viable. We have seen such +idioms for small values in networking code in the Linux kernel. These are +intuitive to write but if ``x`` and ``y`` were to bit-precise types would not +work as expected. + +If used in code handling networking protocols, our estimate is that the +arithmetic manipulation performed on such values will not be the main +performance bottleneck in networking protocols. This estimation comes from the +belief that networking is often IO bound, and that small packed values in +networking protocols tend to have limited arithmetic performed on them. + +Hence we believe that ease of debugging of values in registers may be more +critical than performance concerns in this use-case. + + +To help the compiler optimize (e.g. for auto vectorization) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The behavior that bit-precise types do not automatically promote to an ``int`` +during operations could remove some casts which are necessary for C semantics +but can obscure the intention of a users code. One place this may help is in +auto vectorization, where the compiler must be able to see through intermediate +casts in order to identify the operations being performed. + +The incentive for this use-case is an increased likelihood of the compiler +generating optimal autovectorized code. + +Points which might imply less take-up of this use-case are that the option to +use compiler intrinsics are there for programmers which want to put in extra +effort to ensure good vectorization of a loop. This means that using +bit-precise types would be a mid-range option providing less-guaranteed codegen +improvement for less effort. + +The ABI should not have much of an affect on this use-case directly, since the +optimization would be done in the target-independent part of compilers and the +eventual operations in auto vectorized code would be acting on vector machine +types. + +That said, bit-precise types would also be used in the surrounding code. Given +that in this use-case these types are added for performance reasons it seems +reasonable to guess that this concern around performance would apply to the +surrounding code as well. Hence it seems that this use-case would benefit from +choosing performance concerns. + +In this use-case the programmer would be converting a codebase using either 8 +bit integers or 16 bit integers to a bit-precise type of the same size. Such a +codebase may include calls to variadic functions (like ``printf``) in +surrounding code. Variadic functions like this may be missed when changing +types in a codebase, so it would be helpful if the bit-precise machine types +passed matched what the relevant standard integral types looked like in order to +avoid extra difficulties during the conversion. The C semantics require that +variadic arguments undergo standard integral promotions. While ``int8_t`` and +the like undergo integral promotion, ``_BitInt`` does not. Hence this use-case +would benefit from having the representation of ``_BitInt(8)`` in the PCS match +that of ``int`` and similar for the ``16`` bit and unsigned variants (which +implies having them sign- or zero-extended). + +One further point around this use-case, is that decisions which do not affect 8 +and 16 bit types would not affect this use-case. + + +For representing cryptography algorithms +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Many cryptography algorithms perform operations on large objects. It seems +to be that using a ``_BitInt(128)`` or ``_BitInt(256)`` could express +cryptographic algorithms more concisely. + +For symmetric algorithms the existing block cipher and hash algorithms do not +tend to operate on chunks this size as single integers. This seems like it will +remain the case due to CPU limitations and a desire to understand the +performance characteristics of written algorithms. + +For asymmetric algorithms something like elliptic curve cryptography seems like +it could gain readability from using the new bit-precise types. However there +would likely be concern around whether code generated from using these types is +guaranteed to use constant-time operations. + +This use-case would only be using "large" bit-precise types. Moreover all +relevant sizes are powers of two. + +Translating some more esoteric languages to C +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +At the moment there exist some high-level languages which support arbitrary +bit-width integers. Translating such languages to C would benefit from the new +C type. + +We do not know of any specific use-case within these languages other than for +cryptography algorithms as above. Hence the trade-offs in this space are +assumed to be based on the trade-offs from the cryptography use-case above. + +We estimate the use of translating a more esoteric language to C to be less +common than writing code directly in C. Hence the weighting of this use-case in +our trade-offs is correspondingly lower than others. + +Possible transparent BigNum libraries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We have heard of interest in using the new bit-precise integer types to +implement transparent BigNum libraries in C. + +Such a use-case unfortunately does not directly correspond to what kind of code +will be using this (e.g. would this be algorithmic code or I/O bound code). +Given the mention of 512x512 matrices in the comment where we heard of this we +assume that in general such a library would be CPU-bound code. + +Hence we assume that the main consideration here would be performance. + + +Summary of use-case trade-offs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In our estimation, the C to FPGA use case seems to be the most promising. We +estimate that the use in this space will be a great majority of the use of this +new type. + +Uses for cryptography, networking, and in order to help the compiler optimize +certain code seem like they are large enough to consider but not as widespread. + +For the C to FPGA use case, the majority of the use is not expected to be seen +on Arm Architectures. For helping the compiler optimize code we expect to only +see bit-precise types with sizes matching that of standard integral types. +Cryptographic uses are only expected on "large" sizes which are powers of two. +Networking uses are likely to be using bit-fields for in-memory representations. + +All use-cases would have concerns around performance and the familiarity of +representations. There does not seem to be a clear choice to prefer one or the +other. + + +Alignment and sizes +------------------- + +Options and their trade-offs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These types must be at least byte-aligned so they are addressable, and at least +rounded to a byte boundary in size for ``sizeof``. + +"Small" regime +////////////// +For the "small" regime there are 2 obvious options: + +A. Byte alignment. +B. Alignment and size "as if" stored in the next-largest Fundamental Data Type. + (Where the Fundamental Data Types are defined in the relevant PCS documents). + +Option ``A`` has the following benefits: + +- Better packing in an array of ``_BitInt(24)`` than an array of ``int32_t``. + This is more relevant for bit-precise types than others since these types have + an aesthetic similarity to bit-fields and hence programmers might expect good + packing. + +Option ``B`` has the following benefits (both following from the alignment being +greater than or equal to the size of the object in memory): + +- Avoid a performance hit since loading and storing of these "small" sized + ``_BitInt``'s will not cross cache boundaries. +- Atomic loads and stores can be made on these objects. +- The representation of bit-precise types of the same size as standard integer + types will have the same alignment and size in memory. + +In the use-cases we have identified above we did not notice any special need for +tight packing. All of the use-cases we identified would benefit from better +performance characteristics, and the use-case to help the compiler in optimizing +some code would benefit greatly from ``_BitInt(8)`` having the same alignment +and size as a ``int8_t``. + +Hence for "small" sizes we are choosing to define a ``_BitInt(N)`` size and +alignment according to the smallest Fundamental Data Type which has a bit-size +greater or equal to ``N``. Similar for ``unsigned`` versions. + + +"Large" regime +////////////// +For "large" sizes the only approach considered has been to treat these +bit-precise types as an array of ``M`` sized chunks, for some ``M``. + +There are two obvious choices for ``M``: + +A. Register sized. +B. Double-register sized. + +Option ``A`` has the following benefits: + +- This would mean that the alignment of a ``_BitInt(128)`` on AArch64 matches + that of other architectures which have already defined their ABI. This could + reduce surprises when writing portable code. +- Less space used for half of the values of ``N``. +- Multiplications on large ``_BitInt(N)`` can be logically done on the limbs of + size ``M``, which should result in a neater compiler implementation. E.g. + for AArch64 there is a ``SMULH`` which could be used as part of a + multiplication on an entire limb. + +Option ``B`` has the following benefit: + +- Would allow atomic operations on types in the range between register + and double-register sizes. + This is due to the associated extra alignment allowing operations like + ``CASP`` on aarch64 and ``LDRD`` on aarch32. Similarly this would allow + ``LDP`` and ``STP`` single-copy atomicity on architectures with the LSE2 + extension. +- On AArch32 a ``_BitInt(64)`` would have the same alignment and size as an + ``int64_t``, and on AArch64 a ``_BitInt(128)`` would have the same alignment + and size as a ``__int128``. +- Double-register sized integers match the largest Fundamental Data Types + defined in the relevant PCS architectures for both platforms. We believe + that that developers familiar with the AArch64 ABI would find this mapping + less surprising and hence make less mistakes. This also includes those + working at FFI boundaries interfacing to the C ABI. + +The "large" size use-cases we have identified so far are of power-of-two sizes. +These sizes would not benefit greatly from the positives of either of the +options presented here, with the only difference being around the implementation +of multiplication. + +Our estimate is that the benefits of option ``B`` are more useful for sizes +between register and double-register than those from option ``A``. This is not +considered a clear-cut choice, with the main point in favour of option ``A`` +being a smaller difference from other architectures psABI choices. + +Other variants are available, such as choosing alignment and size based on +register sized chunks except for the special case of the double-register sized +_BitInt. Though such variants can provide a good combination of the properties +above we judge them to have an extra complexity of definition and associated +increased likelyhood of mistakes when developers code relies on ABI choices. + +Based on the above reasoning, we would choose to define the size and alignment +of ``_BitInt(N > [register-size])`` types by treating them "as if" they are an +array of double-register sized Fundamental Data Types. + +Representation in bits +---------------------- + +There are two decisions around the representation of a "small" ``_BitInt`` that +we have identified. (1) Whether required bits are stored in the least +significant end or most significant end of a register or region in memory. (2) +Whether the "remaining" bits after rounding up to the size specified in +`Alignment and sizes`_ are specified or not. The choice of *how* "remaining" +bits would be specified would tie in to the choice made for (1). + + +Options and their trade-offs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We have identified three viable options: + +A. Required bits stored in most significant end. + Not-required bits are specified as zero at ABI boundaries. +B. Required bits stored in least significant end. + Not-required bits are unspecified at ABI boundaries. +C. Required bits stored in least significant end. + Not-required bits are specified as zero- or sign-extended. + +While it would be possible to make different requirements for bit-precise +integer types in memory vs in registers, we believe that the combined negatives +of the choice are reason enough to not look into the option. These negatives +being that code would have to perform a transformation on loading and storing +values, and that different representations in memory and registers is likely to +cause programmer confusion. + +Similarly, it would be possible to define a representation in registers that +does something like specifying bits ``[2-7]`` of a ``_BitInt(2)`` but leaves +bits ``[8-63]`` unspecified. This would seem to choose the worst of both worlds +in terms of performance, since one must both ensure "overflow" from an addition +of ``_BitInt(2)`` types does not affect the specified bits **and** ensure that +the unspecified bits above bit number 7 do not affect multiplication or division +operations. Hence we do not look at variations of this kind. + +For option ``A`` there is an extra choice around how "large" values are stored. +One could either have the "padding" bits in the least significant "chunk", or +the most significant "chunk". Having these padding bits in the least +significant chunk would mean require something like a widening cast would +require updating every "chunk" in memory, hence we assume large values of option +``A`` would be represented with the padding bits in the most significant chunk. + + +Option ``A`` has the following benefits: + +- For small values in memory, on AArch64, the operations like ``LDADD`` and + ``LD{S,U}MAX`` both work (assuming the relevant register operand is + appropriately shifted). + +- Operations ``+,-,%,==,<=,>=,<,>,<<`` all work without any extra instructions + (which is more of the common operations than other representations). + +It has the following negatives: + +- This would be a less familiar representation to programmers. Especially the + fact that a ``_BitInt(8)`` would not have the same representation in a + register as a ``char`` could cause confusion (e.g. when debugging, or writing + assembly code). This would likely be increased if other architectures that + programmers may use have a more familiar representation. + +- Operations ``*,/``, saving and loading values to memory, and casting to + another type would all require extra cost. + +- Operations ``+,-`` on "large" values (greater than one register) would require + an extra instruction to "normalize" the carry-bit. + +- If used in calls to variadic functions which were written for standard + integral types this can give surprising results. + + +Option ``B`` has the following benefits: + +- For small values in memory, the AArch64 ``LDADD`` operations work naturally. + +- Operations ``+,-,*,<<``, narrowing conversions, and loading/storing to memory + would all naturally work. + +- On AArch64 this would most likely match the expectation of developers, and + e.g. a ``_BitInt(8)`` would have the same representation as a ``char`` in + registers. + +It has the following negatives: + +- The AArch64 ``LD{S,U}MAX`` operations would not work naturally on small values + of this representation. + +- Operations ``/,%,==,<,>,<=,>=,>>`` and widening conversions on operands coming + from an ABI boundary would require masking the operands. + +- On AArch32 this could cause surprises to developers, given that on this + architecture small Fundamental Data Types are have zero- or sign-extended + extra bits. So a ``char`` would not have the same representation as a + ``_BitInt(8)`` on this architecture. + +- If used in calls to variadic functions which were written for standard + integral types this can give surprising results. + + +Option ``C`` has the following benefits: + +- For small values in memory, the AArch64 ``LD{S,U}MAX`` operations work + naturally. + +- Operations ``==,<,<=,>=,>,>>``, widening conversions, and loading/storing to + memory would all naturally work. + +- On AArch32 this could match the expectation of developers, with a + ``_BitInt(8)`` in a register matching the representation of a ``char``. + +- If used in variadic function calls, mismatches between ``_BitInt`` types and + standard integral types would not cause as much of a problem. + +It has the following negatives: + +- The AArch64 ``LDADD`` operations would not work naturally. + +- Operations ``+,-,*,<<`` would all cause the need for masking at an ABI + boundary. + +- On AArch64 this would not match the expectation of developers, with + ``_BitInt(8)`` not matching the representation of a ``char``. + +Summary, suggestion, and reasoning +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Overall it seems that option ``A`` is more performant for operations on small +values. However, when acting on "large" values (i.e. greater than the size of +one register) it loses some of that benefit. Storing to and from memory would +also come at a cost for this representation. This is also likely to be the most +surprising representation for developers on an Arm platform. + +Between option ``B`` and option ``C`` there is not a great difference in +performance characteristics. However it should be noted that option ``C`` is +the most natural extension of the AArch32 PCS rules for unspecified bits in a +register containing a small Fundamental Data Type, while option ``B`` is the +most natural extension of the similar rules in AArch64 PCS. Furthermore, option +``C`` would mean that accidental misuse of a bit-precise type instead of a +standard integral type should not cause problems, while ``B`` could give strange +values. This would be most visible with variadic functions. + +As mentioned above, both performance concerns and a familiar representation are +valuable in the use-cases that we have identified. This has made the decision +non-obvious. We have chosen to favor representation familiarity. + +Choosing between ``C`` and ``B`` is also non-obvious. It seems relatively clear +to choose option ``C`` for AArch32. We choose option ``B`` for AArch64 to +prefer that across most ABI boundaries a ``char`` and a ``_BitInt(8)`` have the +same representation, but acknowledge that this could cause surprise to +programmers when using variadic functions.