-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding specification to handle the new C2x _BitInt type #191
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments from Kristof Beyls gave me out-of-band.
Also added two comments that I noticed myself around improperly updating the ChangeLogs of the ABI documents.
aapcs32/aapcs32.rst
Outdated
| 2023Q1| 5\ :sup:`th` February 2022 | Document mapping from C2x _BitInt(N) and unsigned _BitInt(N) types| | ||
| | | to machine types. | | ||
+-------+-------------------------------------+-------------------------------------------------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On reviewing my own patch -- I guess it shouldn't include an update in this table since this is the table.
I'm now starting to think that maybe this table is just for changes in releases, and that should be made upon a release.
Would appreciate a confirmation/refutation on that from someone more in touch with the releases @stuij ?
Similar for the aapcs64.rst document.
aapcs64/aapcs64.rst
Outdated
| 2023Q1 | 5\ :sup:`th` | - Document mapping from C2x _BitInt(N) and unsigned _BitInt(N) | | ||
| | February 2022 | types to machine types. | | ||
+------------+--------------------+------------------------------------------------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above with the aapcs32.rst changelog.
As the rationale mentioned, some applications have uses for a specific bit-width | ||
type. In the case of writing C code which can be used to determine FPGA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: /As the rationale in that proposal mentions/
general familiarity of programmers with the representation. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: maybe it reads more easily if this were a 3-item bullet list?
For this particular type we are estimating that the use of ``_BitInt`` types | ||
will not be such that operations on these types are performance critical. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less of a nitpick: I'm not sure we agreed that operations on these types
will never be performance critical, which the above sentence seems to
suggest?
I'd probably start by explaining the difficulty in defining ABI
rules/understanding the trade-offs as there is almost no code currently
using _BitInt, and it is next-to-impossible to predict exactly how it
will be used most often in the future.
Nonetheless, we need to make trade-off decisions.
Therefore, let's analyze some of the characteristics and features of
this type to try and make a best guess of how these types may get used
when target Arm CPUs.
(The specifics on not much code with _BitInt being performance critical
could be repeated after the below analysis of use cases?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The justifications in N2763 mention that _BitInt
to some degree exists to allow code that is used in high-level synthesis to have better performance by not having to participate in the usual arithmetic promotions, like promoting up to int
(which is about 32 bits), and only promoting up to the other _BitInt
in size (so that adding _BitInt(4)
and _BitInt(7)
gives you a _BitInt(7)
, not a 32-bit int
, which is easier to translate to an FPGA).
At larger sizes, it's more of a nicety (being able to "natively" specify "bigints" that the compiler handles for you), but an extremely appealing nicety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just updated the rationale with extra details -- would appreciate your opinion.
Around the high-level synthesis point, my thoughts were that the resulting code would be run in the performance-critical scenario on an FPGA, and would only be run on an Arm architecture for testing or the like.
So I was thinking that the incentives on the Arm ABI coming from that use-case were not quite so focussed on performance.
The possibility that this code would end up being run "in anger" on an Arm architecture later down the line does then add some incentive towards good performance -- though not as much as if the first-order use were to be on Arm CPU's.
Does that seems reasonable from your perspective?
There seem to be two different regimes for these types. The "small" regime | ||
where bit-precise types could be stored in a single register, and the "large" | ||
regime where bit-precise types must span multiple registers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: maybe say general purpose register? Is the term "general
purpose register" used in other parts of the ABI docs?
These types must be at least byte-aligned so they are addressable, and at least | ||
rounded to a byte boundary in size for ``sizeof``. Since these types have an | ||
aesthetic similarity to bit-fields one might expect better packing in an array | ||
of ``_BitInt(24)`` than an array of ``int32_t`` types (i.e. packing as good as a | ||
byte-array). However, this would require a low alignment of such types and that | ||
would mean loading and storing of even "small" sized ``_BitInt``'s crossing | ||
cache boundaries -- leading to an unnecessary performance hit and hindering any | ||
atomic operations on these. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick on style: I'm wondering if it would be better to more explicitly
surface the pros and cons of a trade-off decision in the structure of
this document, rather than using full prose? For the example above, I'm
thinking of for example something like:
Trade-off decision 1: alignment of small-regime _BitInt
There are 2 obvious options:
option 1: 1 byte alignment
pros: list them
cons: list them
option 2: alignment rules same as 8/16/32/64-byte larger datatype (todo:
maybe improve the wording of this title/option)
pros: list them
cons: list them
Discussion of option 1 vs option 2.
Describe why we chose option 2 based on the information and estimates we
currently have available.
(I noticed you have something similar further down for "Representation
in bits". It might be good to have the same structure for analyzing
different trade-off choices to be made in this document?)
Representation in bits | ||
---------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the above title be "Representation in registers" instead of "...
in bits"?
As mentioned above, we do not expect operations on ``_BitInt`` types to be | ||
performance critical. Given that providing a productive environment for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I'm not sure if we can expect _BitInt to never be used in
performance-critical code.
If people will write code targeting FPGAs with _BitInt, presumably that
code is performance-critical (why would anyone otherwise target an FPGA
for that code?). If such code becomes popular, at some point, there may
be good reasons to use the same source code to target CPUs too. In that
case, the _BitInt code would be performance critical...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's more coherent to say "As mentioned above, we do not expect operations on _BitInt
types at large sizes to be performance critical." since the benefits of _BitInt
allowing sidestepping the arithmetic promotions disappear for sizes larger than _BitInt(32)
.
Regarding It's been a while since I've read the AAPCS64 last so apologies if this has already been addressed and the spec as-is will guarantee that, it wasn't immediately obvious from my skimming of this PR whether this was the case. I agree that I don't think these will be performance-critical types at the larger sizes, however. |
@workingjubilee Thanks for the feedback! W.r.t. the The "performance critical" statement was a bit of a clumsy glossing-over of some analysis on my part. Cheers! |
Aha, excellent!
Ah, good, this was the part I missed. Thank you! I'll be happy to look at things when you follow up! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with choosing option B and C for AArch64/AArch32 but I'd reduce the major focus on atomics since I'd be surprised if they would ever be used - it's not even clear whether they are supported by the mid-end since bitints are like bitfields and would require extra work.
|
||
- On AArch32 this could cause surprises to developers, given that on this | ||
architecture small Fundamental Data Types are have zero- or sign-extended | ||
extra bits. So a ``char`` would not have the same representation as a | ||
``_BitInt(8)`` on this architecture. | ||
|
||
- If used in calls to variadic functions which were written for standard | ||
integral types this can give surprising results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be reasonable to normalize a bitint when passed as a variadic argument without an explicit cast to int so eg. printf works fine either way. This makes it equivalent to how a char is passed as a variadic argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, honestly I'd not seriously considered this since the way that the PCS is defined is through a two step process of
"source code type" -> "fundamental data type"
then
"fundamental data type" -> "rule for PCS passing".
Now you've pointed it out it does seem reasonable.
Not sure whether the exceptional case could lead to confusion -- will invite other feedback to see what people think.
@smithp35 @stuij @rsandifo-arm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test @rearnsha (wanted to tag you, the comment box doesn't want to auto-complete your name, don't know if typing it fully out will properly link -- this is a test).
``_BitInt``'s will not cross cache boundaries. | ||
- Atomic loads and stores can be made on these objects. | ||
- The representation of bit-precise types of the same size as standard integer | ||
types will have the same alignment and size in memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That last point is really the most important. The first 2 points are not too convincing since unaligned accesses are practically for free on modern cores, and LSE2 allows atomics on unaligned data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh -- thanks!
I did not realise the cost of unaligned accesses were essentially free (useful to know).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, optimization guides mention things like "The Neoverse N2 handles most unaligned accesses without performance penalties.", so you are unlikely to notice a penalty in typical compiled code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just about the atomic RMW operations, you also (I think) want ordinary reads and writes of _BitInt variables to be single-copy-atomic and implemtable with ordinary loads and stores. This implies they should be aligned so as not to cross cache boundaries, and also not packed along with other objects (i.e. no packing a _BitInt(24) together with a _BitInt(8) in a 32-bit word).
@@ -135,16 +390,19 @@ It has the following negatives: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, mentioning atomics as the first, most important thing does not make any sense. For large bitints you can't do atomics at all (no atomics on int arr[100] either!), small bitints up to int128 could support atomics similar to how bitfields could support atomics (typically using CAS due to needing extra normalization step). And it's not like LDSMAX/LDUMAX is a popular atomic given it's not even a supported builtin in GCC...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- I had not realised the order of bullet points would imply a priority order.
The atomics were not supposed to be presented as the most important reason -- will adjust.
To make sure I'm not missing something -- what extra normalization step are you referring to?
Is it the normalisation required for the set of operations which are not OK for a given representation (e.g. if choosing option B
there would be a normalisation required for /
, but not +
) or is it something else you've noticed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I mean the standard normalization. So while you can't use LDADD with option C, you can still use a CAS or load exclusive loop with an extra zero/sign-extend. Since you can do this for every atomic operation, being able to use LDADD or LDSMAX directly might be nice but shouldn't count as an advantage. Not being able to use atomics at all would be a disadvantage for a particular layout if atomic support is essential (and I'm not convinced it is for bitint).
values. However, when acting on "large" values (i.e. greater than the size of | ||
one register) it loses some of that benefit. Storing to and from memory would | ||
also come at a cost for this representation. This is also likely to be the most | ||
surprising representation for developers on an Arm platform. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note there is no reason why a compiler couldn't do register operations in format A, while passing arguments/storing as C and assuming B when loading bitints or using parameters/return values. A seems most useful as an optimized internal representation (which is entirely private) rather than as the storage format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'll add a paragraph explicitly mentioning that the compiler could always choose to use a different representation at non ABI boundaries.
`_BitInt(N)` and `unsigned _BitInt(N)` are new integral types added for C23. These types have a bit width of `N` and define a different type for each `N`. Here we define the language mapping between these language types and the machine data types used throughout AAPCS64 through which our calling standard is defined. Along with the AAPCS32 commit, this closes ARM-software#175. The rationale for the choices in this patch are presented in a seperate commit with a rationale document. Some points of note around the language in this commit: -- This commit does not update any wording around bit-fields. The current wording under "Bit-fields subdivision" mentions that "A member of an aggregate that is a Fundamental Data Type may be subdivided into bit-fields". I do not believe this needs updating. While a _BitInt(N>128) can be subdivided into a bit-field at the *language level*, once it has been converted to a Machine Type such a subdivision would look like some number of quad-words which have not been subdivided (either fully used or fully unused) and either one or zero quad-words which have been subdivided. The alignment requirements of the _BitInt are also the same as the alignment requirements of this fundamental data type of a quad-word, which this paragraph uses to explain the resulting alignment requirements on the aggregate containing a bit-field. In the C/C++ language mapping description of "Bit-fields" we mention that a bit-field may have any integral type and since a bit-precise integer is an integral type this still holds. The explanation of where a field can be placed in this section relies on *alignment* requirements of the field type. For _BitInt this lines up with the discussion on *fundamental data types* at the machine-level, since the alignment requirements of a _BitInt are that of the "chunk" it is made up of, which is the fundamental data type of a quad-word. Hence I believe the current language does not need updating for bit-precise integers. -- _BitInt(N > 128) types and Homogeneous Aggregates. With this changeset, _BitInt(N>128) types are treated as arrays of __int128 values. Hence at the machine level they would be a Homogeneous Aggregate of quad-words. The wording in section 5.9.5 currently mentions "uniquely addressable Members". I am not sure what this is referring to, but would expect this is referring to addressable members of the base type. If that is the case then I don't believe anything needs to be updated. If it were referring to addressable members at the language level (which would be strange given the context) then this may need updating since one language-level _BitInt(256) type would not equat to one Fundamental Data Type. -- Combination of unspecified bits in _BitInt and C.16 in PCS rules. The mapping this commits defines from a _BitInt to a Machine Type specifies that the bits of the relevant Machine Type that are unused in a _BitInt(N) have unspecified value. The rules C.12 and C.16 of our parameter passing standard specify that when there are unused bits of a structure and/or Integral fundamental data type that are passed in registers, those unused bits are unspecified. The combination of these two rules means that e.g. when passing a _BitInt(2) across a PCS boundary in a register, bits [2-63] inclusive are unspecified. -- In-memory and in-register representations match. This commit only specifies the mapping from language level type to machine type. The machine type is then treated as it currently is in memory and in registers.
`_BitInt(N)` and `unsigned _BitInt(N)` are new integral types added for C23. These types have a bit width of `N` and define a different type for each `N`. Here we define the language mapping between these language types and the machine data types used throughout AAPCS32 through which our calling standard is defined. Along with the AAPCS64 commit, this closes ARM-software#175. The rationale for the choices in this patch are presented in a seperate commit with a rationale document. Some points of note around the language in this commit: -- This commit does not update any wording around bit-fields. The current wording under "Bit-fields" section 5.3.4 mentions that "A member of an aggregate that is a Fundamental Data Type may be subdivided into bit-fields". I do not believe this needs updating. While a _BitInt(N>64) can be subdivided into a bit-field at the *language level*, once it has been converted to a Machine Type, such a subdivision would look like some number of double-words which have not been subdivided (either fully used or fully unused) and either one or zero double-words which have been subdivided. The alignment requirements of the _BitInt are also the same as the alignment requirements of this fundamental data type of a double-word, which this paragraph uses to explain the resulting alignment requirements on the aggregate containing a bit-field. In the C/C++ language mapping description of "Bit-fields" we mention that a bit-field may have any integral type and since a bit-precise integer is an integral type this still holds. The explanation of where a field can be placed in this section relies on *alignment* requirements of the field type. For _BitInt this lines up with the discussion on *fundamental data types* at the machine-level, since the alignment requirements of a _BitInt are that of the "chunk" it is made up of, which is the fundamental data type of a double-word. Hence I believe the current language does not need updating for bit-precise integers. -- _BitInt(N > 64) types and Homogeneous Aggregates. With this changeset, _BitInt(N>64) types are treated as arrays of uint64_t values. Hence at the machine level they would be a Homogeneous Aggregate of double-words. -- Combination of unspecified bits in _BitInt and B.2 in PCS rules. The mapping this commits defines from a _BitInt to a Machine Type specifies that the bits of the relevant Machine Type that are unused in a _BitInt(N) have unspecified value. Rule B.2 of our parameter passing standard specifies that when there are unused bits in an integral Fundamental Data Type that is passed in registers, those unused bits are zero- or sign-extended to a full-word. The combination of this rule with the fact that _BitInt types are zero- or sign-extended to the Fundamental Data Type which they are passed in means that e.g. when passing a _BitInt(2) across a PCS boundary in a register, bits [2-63] inclusive are sign-extended. -- In-memory and in-register representations match. This commit only specifies the mapping from language level type to machine type. The machine type is then treated as it currently is in memory and in registers.
This type has been added into the C2x specification, alongside changes to describe how they are represented at a machine level we also add a design document describing the rationale behind the choices we made. One of the main contentious issues has been around the alignment of large _BitInt types. In this document we have chosen to specify that large _BitInt's are treated as if they were an array of double-register sized chunks. Other psABI's have chosen to represent large _BitInt's as an array of single-register sized chunks. This means that differences in alignment might be surprising to developers attempting to write portable code. However in contrast we believe that developers familiar with ABI's for Arm Architectures would be less surprised by this decision. N.b. I happened to also notice a stack overflow question which was suggesting the use of _BitInt(128) for u128. Given the discussion did not have any mention of the different ABI between this and __uint128 I would take it as evidence for a benefit to having the two integrals ABI match. https://stackoverflow.com/questions/16088282/is-there-a-128-bit-integer-in-gcc
Just pinging everyone that has been watching this. Otherwise could someone @rearnsha @smithp35 @stuij have a double-check that everything seems fine to them and approve this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Use-cases known of so far | ||
------------------------- | ||
|
||
There seem to be two different regimes for these types. The "small" regime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be two different regimes for these types. The "small" regime | |
There are two different regimes for these types. The "small" regime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted to "We believe there are two different regimes" (point being that we could be mistaken -- there could be more logical regimes to think about -- and want to convey that this is how we view things rather than something which we can be 100% confident on)
|
||
Since this is a new type there is large uncertainty on how it will be used by | ||
programmers in the future. Decisions we make here may also influence future | ||
usage. Nonetheless we must make trade-off decisions with this uncertainty. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usage. Nonetheless we must make trade-off decisions with this uncertainty. The | |
usage. Therefore, we must make trade-off decisions with this uncertainty. The |
critical than performance concerns in this use-case. | ||
|
||
|
||
To help the compiler optimize (e.g. for auto vectorization) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help the compiler optimize (e.g. for auto vectorization) | |
To help the compiler optimize (for example, for auto vectorization) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted this to (possibly for auto vectorization)
-- hope that isn't falling into another ease-of-parsing pitfall ;-)
|
||
- This would be a less familiar representation to programmers. Especially the | ||
fact that a ``_BitInt(8)`` would not have the same representation in a | ||
register as a ``char`` could cause confusion (e.g. when debugging, or writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
register as a ``char`` could cause confusion (e.g. when debugging, or writing | |
register as a ``char`` could cause confusion (For example, when debugging, or writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason the "for example" reads clunkier than "e.g." to me.
I went with cause confusion (we imagine when debugging or writing assembly code)
.
Hopefully that avoids the ease-of-parsing difficulty for non-native speakers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor changes - mostly phrasing. Please do avoid using latin contractions (e.g., i.e, and etc) they're not good for non-native speakers.
Noted -- thanks for the info! |
``unsigned _BitInt(N)``. These are defined for integral ``N`` and each ``N`` is | ||
a different type. | ||
|
||
The proposal for these types can be found in following link. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal for these types can be found in following link. | |
The proposal for these types can be found in following link: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor changes - mostly phrasing. Please do avoid using latin contractions (e.g., i.e, and etc) they're not good for non-native speakers.
Mostly things that should help the document be easily understood for non-native speakers.
@sallyarmneale Thanks for the reviews -- I've updated all the things you pointed at and upon the re-read found some places that I felt could be better worded as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
Here we add mappings from the new
_BitInt
type to machine types.All that I believe needs to be changed in the ABI documents themselves are entries in the Mapping tables for the PCS rules.
I explain in the commit messages below the reasoning that this is all that is needed.
This PR also contains a rationale document.
As discussed I'm putting the initial proposed document up and going to add the comments I've already heard onto this github request (to keep the discussion as in the open as possible).