-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cdac] Data Descriptor Spec #100253
[cdac] Data Descriptor Spec #100253
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a first pass though, This seems more verbose and complicated than previous prototypes.
struct BinaryBlobDataDescriptor | ||
{ | ||
struct Directory { | ||
uint32_t TypesStart; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would these Start values not be 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this comes from teh requirement that we would like to be able to read binary descriptors directly out of object files without understanding the object file format.
That means that if the object file is produced by a C compiler, we want to be able to ignore any padding or alignment that the C compiler adds. So because we have fixed-size arrays of structs as part of BinaryBlobDataDescriptor
I think the C compiler may add padding between when one array ends and the next one starts. So to make the blob self-describing, we need to capture the offsets of the fields of BinaryBlobDataDescriptor
as part of itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean this field is the relative offset from the start of the BinaryBlobDescriptor to the start of the Types array and we'd never expect a reader to know that BinaryBlobDescriptor has a Types field? If I followed you correctly I'd be tempted to omit all of those array fields from the struct definition or add a comment saying those fields are implementation details that a reader shouldn't rely upon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to omit all of those array fields from the struct definition or add a comment saying those fields are implementation details that a reader shouldn't rely upon.
+1. We do not need to spec the intermediate artifacts used in the process of building the final descriptor. They can be just our internal implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas ok, I'm gonna take one more pass at simplifying things:
- only document the json physical format (but with a second flavor for the in-proc version that has an aux array for the global pointers)
- keep the "object file blob" stuff as an implementation detail for tooling that will create the in-proc json data descriptor.
- Drop the requirements to be able to scrape the on-disk representation out of the spec.
The build tooling (I'm thinking of it as cdac-build-tool
) will work like this:
- runtime build generates one or more object files with blobs containing offsets and struct sizes
cdac-build-tool
scrapes the object files and a baseline and writes out a C file that contains a json data descriptor and an aux array of globals- we compile the C file and link it into the runtime
At diagnostic time:
- DAC tools find the data descriptor in target process memory (not on disk)
|
||
## Example | ||
|
||
And example C header describing some data types is given in [sample.data.h](./sample.data.h). And |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on what this might look like for NativeAOT data structures defined in C#?
Or before that have we identified any NativeAOT data structures we'd want expose from managed code? So far I believe every type described in NativeAOT's DebugHeader has always been defined in native code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the resulting .o file it will look the same. it can be produced by other means - it doesn't need to be produced by C marcros. (In fact even for CoreCLR it doesn't need to be produced by macros. It can be created with C++ constexpr
and templates). The preprocessor example is just meant to demonstrate that this can be constructed using C for the most restricted runtimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't need to be produced by C marcros
I wasn't worried that it would be constrained to macros :) I just have no understanding of what alternative method we would use to generate it instead and I'm searching for info that helps me fill in that gap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have no understanding of what alternative method we would use to generate it instead
I have no idea. Perhaps a new phase for ILCompiler. or possibly some source generator that runs over NativeAOT's runtime source code and produces a C# blittable struct definition that follows the format we expect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would expect a variant of cdac-build-tool
to be part of the native AOT ILCompiler in the fullness of time.
## Global value descriptors | ||
|
||
Each global value descriptor consists of: | ||
* a name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that ordinals would be a bit simpler as well as being more compact, but if it fits in the size targets I'm fine with names.
I wouldn't be surprised if a dense ordinal-based encoding without any baseline values also fits within the perf size goals. If that meant we didn't need any custom build tooling, JSON format, or baseline offset tables that seems like a nice simplification. The part that remains an unknown for me on that route is how we deal with C# globals/types in NativeAOT assuming that we would have some. So far all the DebugHeader work for NativeAOT has never needed to encode info for managed type or global and I don't know what capabilities we have to work with in the NativeAOT toolchain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like my suggestion is not the direction you are choosing to go. If you want to discuss it more I'm happy to, but I also don't want to be a distraction when you've decided on another approach already. I believe the current approach with JSON, baselines, and custom build tools can work even if it feels more complex than needed to me. Feel free to mark this resolved.
it's a build tooling implementation detail
Assuming we're converging toward consensus on this PR, the next thing I'd like to tee up is a spec for the overall contract descriptor #100365 |
* `"type": "type name"` the name of a primitive type or another type defined in the logical descriptor | ||
* optional `"offset": int | "unknown"` the offset of the field or "unknown". If omitted, same as "unknown". | ||
|
||
Note that the logical descriptor does not contain "unknown" offsets: it is expected that the binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call out specifically that this also means only the baseline data descriptor should have 'unknown' offsets and the in-memory one should not.
"types": [ | ||
{ | ||
"name": "Thread", | ||
"fields": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to use more compact style for key/value pairs for the in-memory descriptor at least, the json parser can support both less verbose and more verbose variants:
"types": [
"Thread" : { "ThreadId" : 32, "ThreadState" : 0, "Next" : 128 }`,
...
I know the right way to serialize key/value pairs in JSON is one of the FAQs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a "compact" JSON variant to the spec
} | ||
], | ||
"globals": [ | ||
{ "name": "s_pThreadStore", "value": { "indirect": 0 } } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should indirect
be 1 in this example? The value at index 0 below is FEATURE_EH_FUNCLETS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. the value of FEATURE_EH_FUNCLETS in this example is stored in the baseline and the in-memory version happened to match and doesn't override it. (Even if it did override it, we would probably store it as a literal value in the json since it is known at build time and won't vary at execution time. There is no need to store constant values in the aux array). I'll clarify the example
If the value is given as `{"indirect": int}` then the value is stored in an auxiliary array that is | ||
part of the data contrat descriptor. Only in-memory data descriptors may have indirect values; baseline data descriptors may not have indirect values. | ||
|
||
Rationale: This allows tooling to generate the in-memory data descriptor as a single constant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that the JSON format was for documentation purposes only and it would never appear in memory of the target process. What is the scenario where we expect the JSON format to be embedded in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that the JSON format was for documentation purposes only
The well-known baseline JSON data will be embedded into the CDAC reader tool. It's not just for documentation.
What is the scenario where we expect the JSON format to be embedded in memory?
If the in-proc data descriptor is going to be a data-segment constant that is being generated by cdac-build-tool
, there didn't seem to be any reason why we should be a binary blob. JSON is easy to read with a variety of off-the-shelf parsers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also:
- The size of the compact JSON format is not that much different from the size of the originally proposed binary format, and there are standardized ways to make JSON smaller if we need to (compression, binary JSON, ...)
- Designing our own binary format for structured data is a rathole
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just misunderstood the text. I thought it was saying the JSON would be inside the target process being debugged. If this JSON is included in the reader then I have no concern with that at all. I consider it an implementation choice of any diagnostic tool, including ours, how it wants to utilize the documentation. Sorry for confusion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was saying the JSON would be inside the target process being debugged.
The text is saying that the (compact) JSON will be inside the runtime binary in the target process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it looks like I had it right the first time. I see there was an edit earlier today that eliminated the binary format entirely and just use the JSON as the in-memory format.
This feels like it is getting progressively further away from the approach I would have favored, but I think you can make it work and it feels like my opinion on this is in the minority. My impression is that defining flat in-memory structs with no baseline at all would be easier to generate (no build tooling needed or baseline to manage), easier to consume (no JSON parser needed), and fits in our size goals. As an example the GC already does this as have past successful prototypes. I know I suggested having a baseline at an earlier point, but pretty quickly it felt like the amount of complexity it added to the design made it feel not worth the potential size savings.
In any case if your vision is different and you think JSON, baselines, and custom build tools are the best approach go for it. I see no reason it won't work even if it isn't the approach I would have picked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it felt like the amount of complexity it added to the design made it feel not worth the potential size savings.
@lambdageek and I chatted about this too. I'm hoping we can focus more on actually understanding what the entire description will be and then after we know for sure we can apply a baseline logic or just accept that compisition wasn't needed because the size isn't as bad as assumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no baseline at all would be easier to generate (no build tooling needed or baseline to manage)
I'm hoping we can focus more on actually understanding what the entire description will be
I think the design having the space for baselining makes sense, but I would like to start with embedding everything in the target process and see where that is with PrintException
and how much we think it will grow. And then actually use/implement the baselining if deemed necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of my thinking that we wouldn't need a baseline is that I'm expecting a simple ordinal based binary encoding to be between 10-40% of the size of JSON encoding for the same set of data. For example you might see that a JSON encoding of the entire runtime descriptor with no baseline is 20KB in size which we decide is too big. Then we either save size by sticking with JSON and using a baseline, or we save size by using some simple structs with binary data that is substantially more efficient.
585a123
to
00f8139
Compare
00f8139
to
b2dba2c
Compare
| s_pThreadStore | pointer | 0x0100ffe0 | | ||
|
||
The `FEATURE_EH_FUNCLETS` global's value comes from the baseline - not the in-memory data | ||
descriptor. By contrast, `FEATUER_COMINTEROP` comes from the in-memory data descriptor - with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
descriptor. By contrast, `FEATUER_COMINTEROP` comes from the in-memory data descriptor - with the | |
descriptor. By contrast, `FEATURE_COMINTEROP` comes from the in-memory data descriptor - with the |
|
||
Unknown offsets are not supported in the compact format. | ||
|
||
Rationale: the compact format is expected ot be used for the in-memory data descriptor. In the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rationale: the compact format is expected ot be used for the in-memory data descriptor. In the | |
Rationale: the compact format is expected to be used for the in-memory data descriptor. In the |
Rationale: the compact format is expected ot be used for the in-memory data descriptor. In the | ||
common case the field type is known from the baseline descriptor. As a result, a field descriptor | ||
like `"field_name": 36` is the minimum necessary information to be conveyed. If the field is not | ||
present in the baseline, then `"field_name": [12, "uint16"]` may be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"may be used" or "must be used"?
|
||
## Version <version_number> | ||
#### Baseline data descriptor identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to data_descriptor.md
where it talks about baseline descriptor. This higher-level doc should only have a brief mention.
|
||
Insert description (if possible) about what is interesting about this particular version of the contract | ||
#### Global Values | ||
Global values which can be of types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, pointer, nint, nuint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Global values which can be of types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, pointer, nint, nuint) | |
Global values can be either primitive integer constants or pointers. |
The detailed explanation is in the linked doc
|
||
## Version <version_number>, DEFAULT | ||
#### Data Structure Layout | ||
Each data structure layout has a name for the type, followed by a list of fields. These fields can be of primitive types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, nint, nuint, pointer) or of another named data structure type. Each field descriptor provides the offset of the field, the name of the field, and the type of the field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each data structure layout has a name for the type, followed by a list of fields. These fields can be of primitive types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, nint, nuint, pointer) or of another named data structure type. Each field descriptor provides the offset of the field, the name of the field, and the type of the field. | |
Each data structure layout has a name for the type, followed by a list of fields. These fields can be of primitive integer types, pointers or of another named data structure type. Each field descriptor provides the offset of the field, the name of the field, and the type of the field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo feedback. Thank you!
Rationale: the compact format is expected to be used for the in-memory data descriptor. In the | ||
common case the field type is known from the baseline descriptor. As a result, a field descriptor | ||
like `"field_name": 36` is the minimum necessary information to be conveyed. If the field is not | ||
present in the baseline, then `"field_name": [12, "uint16"]` must be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it some more:
I thought that the types are only meant for documentation purposes, but I do not see it actually mentioned anywhere. Do we expect to the types to be used in the reader?
If the types were for documentation purposes only, it does not make sense to mention them in the compact form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just a choice of what do we want to happen if a field ever changes its type? One option is type is implicit in the name so the name must change either in the source or via some override the tooling supports. The other option is we include the type explicitly so that the reader can match on (type,name) tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote would be for requiring field rename if it changes type.
Building on #100253 , describe an in-memory representation of the toplevel contract descriptor, comprisied of: * some target architecture properties * a data descriptor * a collection of compatible contracts Contributes to #99298 Fixes #99299 --- * [cdac] Physical contract descriptor spec * Add "contracts" to the data descriptor * one runtime per module if there are multiple hosted runtimes, diagnostic tooling should look in each loaded module to discover the contract descriptor * Apply suggestions from code review * Review feedback - put the aux data and descriptor sizes closer to the pointers - Don't include trailing nul `descriptor_size`. Clarify it is counting bytes and that `descriptor` is in UTF-8 - Simplify `DotNetRuntimeContractDescriptor` naming discussion --------- Co-authored-by: Elinor Fung <elfung@microsoft.com>
Contributes to #100162 which is part of #99298
Follow-up to #99936 that removes "type layout" and "global value" contracts and instead replaces them with a "data descriptor" blob.
Conceptually a particular target runtime provides a pair of a logical data descriptor together with a set of algorithmic contract versions. The logical data descriptor is just a single blob that defines all the globals and type layouts relevant to the set of algorithmic contract versions.
A logical data descriptor is realized by merging several physical data descriptors in a proscribed order.
The physical data descriptors provide some subset of the type layouts or global values.
The physical data descriptors come in two flavors:
Each physical descriptor may refer to a baseline and represents a delta applied on top of the baseline.
The data contract model works on top of a flattened logical data descriptor.