-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build ID Section for WASM #133
Comments
Sounds good to me. Would the build ID be modified by tools that process the wasm, say by Binaryen when it optimizes the binary? Or would it stay fixed after it's emitted from the original compiler? (The former seems to make sense, as optimizations change the binary, but then we'd need to describe those changes here I think.) |
Definitely should change as the file changes. Of note is that in the Microsoft ecosystem the age on the PDB signature (those extra 4 bytes) get incremented with every transformation. This from my experience has made things more complicated in practice because they were not consistently changed everywhere. For instance the age is stored more than once in the PE format and actually comes out desynched from Microsoft's own tools. I think it would be wiser to explicitly tell tools to always completely override the embedded ID if it goes through a transformation. This does mean you can't track back to the original ID of the originally created WASM file but I'm not sure if that is necessary in general. Would be curious to hear though if there are some advantages of the pdb+age system on the Microsoft side. |
How important is it to have an explicit field for this, as opposed to just having tools compute a hash of a wasm binary to use as an effective build ID? |
@sunfishcode since the ID needs to survive a stripping of the file, it's very important. With DWARF in place you normally want to separate out the object file into two: one that contains CODE and other sections necessary to run the code, a second one with the DWARF sections ( |
Can tools just hash the contents of the main wasm sections then, and ignore debug info sections? I don't have a strong opinion either way yet; I just want to understand the space. |
To generate the build ID they could take the hash of the main wasm sections and store it in the file. They can alternatively just generate a random UUID and embed it. I do think though that a build ID should ideally always be embedded. (This here describes the workflow where this information is particularly useful) |
I'm curious about what situations storing a Build ID in the file is better than computing a hash on demand whenever it's needed. Naively, computing it on demand would seem to have several advantages:
|
FWIW Breakpad supports the embedded, format specific identifiers, that @mitsuhiko mentioned, but if they aren't available for any reason, it falls back to computing an md5 of the first 1024 bytes of the TEXT section (or equivalent). You bring up some good points about some advantages computing the build ID has, but to me the point of the Build ID is to precisely identify a particular build so that different tools can always pair the code with the debug information, so allowing tools to choose their own hash function or which sections to hash, brings up problems when tools need to communicate with each other, eg. between a debugger and a symbol store that use different hash functions. So storing the Build ID does have some disadvantages, particularly when a tool does a transformation that doesn't also change the Build ID, but I'm much more concerned with tools having a consistent source of truth. |
I've now found this stackoverflow post which I found helpful. The Build ID isn't just a hash of the contents; it's something like a hash of the contents and the debug info together, which is then recorded and preserved, even if debug info is stripped. As such, it can't always be recomputed. There are a lot of use cases other than debug info that would seem to want something like a Build ID, but what they need is something subtly different from what the Build ID actually is. So, brainstorming here, what if we do have a Build ID section, but call it the "Debug Info ID", and say:
Would that make sense? |
I think it's fair to specifically call this a The hashing fallback path of breakpad has caused more issues than it solved so I would prefer we don't spec out something like this. |
Would this build ID be generated at the point when the debug info is split out (either by the linker, or some kind of post link debug-splitting tool)? Or would it be present even in binaries that still have their debug info embedded? |
@sbc100 definitely already in binaries that have the debug info embedded. We for instance have lots of cases where we want to symbolicate stacktraces where the client just submitted instruction addresses and then people upload the entire binary with debug information included. This is especially important normally when doing stack unwinding out of memory dumps. This obviously is less useful for wasm right now, but in terms of existing work flows having the debug ID even in unstripped binaries has been very valuable. |
Just as a counter-point, one downside of an embedded id seems to be precisely that it would usually survive destructive operations on the code. That is, if code is post-processed by a tool similar to wasm-opt or wasm-bindgen, and if that tool can't correctly update DWARF information, then the build id would remain the same even though the code has changed and no longer matches the debug info. In this case you as a consumer (Sentry or otherwise) explicitly don't want such debug info to be matched and used. Arguably, every such tool should either support DWARF or be able to at least change build ID to some new unique value, but it seems that hashing of code section would alleviate this concern even more naturally. |
Since we're adding WASM DWARF support at Sentry at the moment we might be going ahead and require customers to embed a |
@mitsuhiko Does the "hash of the code section" idea not work for you? |
@RReverser Generally I did not define how the For what it's worth embedding a random |
Yeah, walrus is a high-level IR and, as such, rewrites even the code you didn't touch, which, in turn, affects debug offsets. You need a lower-level representation instead, e.g. [shameless plug] you can try my wasmbin library which was created with similar use-cases in mind. https://github.com/GoogleChromeLabs/wasmbin |
I've pushed an example for random You'll probably want to extend it to be more robust (e.g. add detection of existing |
Oh this is neat. Going to use this. |
Come to think of it, due to the nature of Wasm binary format, if you didn't want to check for presence of existing fn main() {
let filename = std::env::args()
.nth(1)
.expect("Provide a filename as an argument");
let mut f = OpenOptions::new().append(true).open(filename)?;
f.write_all(&[
// Custom section (id=0)
0x00,
// Length of payload (length of length of name + length of name + length of UUID)
1 + 8 + 16,
// Length of name
8,
]);
f.write_all("build_id".as_bytes())?;
f.write_all(uuid::Uuid::new_v4().as_bytes())?;
Ok(())
} Won't save too much in terms of perf and the code won't be as clean, but hey, it's possible in case you want to avoid any dependencies altogether and make a tiny util :) |
I extended your tool into one that does not override existing build IDs and also splits the file into two: getsentry/symbolicator#303 |
I think we should pick this up and add support to LLVM/emscripten to make this easier.
@sunfishcode also suggests above that tools not write a build ID if they don't generate debug info. I don't really see the harm either way; a wasm file that never had debug info will be indistinguishable from one that had debug info stripped out. If we specify that (or even just implement the linker such that) the build id is a hash of some file contents, that would slow down linking, so we'd want to get some benefit in return for it. |
I don't think we want a random UUID for |
yeah build determinism is a good point, LLVM and emscripten should definitely have that, even if other tools might not care. GNU ld and ELF lld actually have both options (hashing sections, picking a random UUID, and using a value specified on the command line).
... Actually, Looking at ELF lld's implementation, maybe we just want to hash the entire output file. |
(sorry we raced). Yes, a tool-conventions doc like that one would be perfect, to specify the section's format. |
That approach seems reasonable to me. I guess this is not unlike relocation entries which get written with placeholders and then updated. The difference here is that we could obviously need to wait until all other sections have been written since we could be hashing their final content. |
Thanks for the extra details. Some thoughts (with the caveat that I am not familiar with other conventions here from ELF or other formats):
|
|
Ahh yeah I guess random is problematic when it comes to reproducible builds. Maybe the user-supplied string is an easy one to start with then? The idea of supporting the various different types sounds great but just seeing if we can scope this down a bit so its easier to make progress on |
As for hash format there is probably quite some flexibility here but traditionally the limitations were often the intention to support some form of breakpad compatibility. The default debug id field has space for a UUID/GUID + 4 bytes as u32 (the age field). Since Macho selects a UUID for the hash and PDB uses this UUID + 32bit age it's probably not a bad idea to encourage tools to emit a reproducible UUID (v3 or v5) as build ID. That has the highest form of compatibility. Knowing which exact type of a build ID something is has not been useful in our experience. (For additional context this is the abstraction we use for what we call breadpad compatible debug ids: https://docs.rs/debugid/0.7.2/debugid/struct.DebugId.html — any gnu build ID longer than 16 bytes is chopped off and an age of 0 is always used. We then use the original gnu build ID as secondary information for debug file lookup. Our symbol server lookup strategies are documented here: https://getsentry.github.io/symbolicator/advanced/symbol-lookup/) |
Sorry I've sat on this so long. Let's finally get it done. I uploaded #183 which I think captures what we've discussed here. After hearing @mitsuhiko's experience that knowing the exact type of ID isn't useful (and not being able to think of any use myself) I decided to just leave it out of the encoding. |
Also I just realized that I didn't take @mitsuhiko's advice and encourage a reproducible UUID as the output (or implement one in lld in https://reviews.llvm.org/D107662); instead I went with the same default lld uses for ELF (which is actually just an 8-byte "fast" hash). Do you think that's compatible "enough" or should we invent something new in lld? |
@mitsuhiko I guess a followup question, if I were to make lld generate a v5 UUID (based on, a hash of the contents), what would I use as the "namespace" UUID to go with it? |
Would it be reasonable to just generate a random UUID once and bake it into the llvm code, as an "llvm namespace"? |
@dschuff about the namespace it probably doesn't matter. You can probably hardcode a random ID and just use that consistently and document it. I don't have any expectations that there is a tool independent way of generating the same reproducible IDs. It's more important that the tool itself has some stability. |
I updated the prototype in https://reviews.llvm.org/D107662 |
I think the implementation and document in #183 capture what we've worked out here. Feel free to reopen (or open a new issue, as appropriate) if there are objections or changes we should make; Since there's no ecosystem yet I don't think it's too late to make breaking changes if we do it soon. |
This is awesome, I see that this just landed in llvm, can you update once we know the version of emscripten this is associated with? |
This change is now included in emscripten 3.1.33 |
The Zig wasm linker supports it as well. |
Zig has its own wasm linker? Is it based on wasm-ld or something different? |
Written from scratch, like other linkers. |
FWIW this tag is now also natively recognised by |
I originally brought this up in the design repo (WebAssembly/design#1306) but I believe this fits here better.
For deferred symbolication on services like sentry it would be nice to be able to match up DWARF debug information to the main WASM file by build ID. In ELF this is typically accomplished with the GNU build ID note, on windows with the PDB signature and age and on darwin the macho UUID fulfills that purpose.
I would love to see a
build_id
custom section that contains a 16 or 20 byte ID which tools would ensure remains in both WASM files (CODE, debug companion containing DWARF info) if they get split. Capping it at 16 bytes makes it possible to roundtrip this through breakpad which uses a 16+4 byte char array for the debug id. 16 for the PDB UUID + 4 byte for the PDB age.Motivation: Sentry and other systems like to be able to look up files by build ID because then they can access an external symbol server for that information. That way one just provides some sources where debug information can be found and then symbolicators just reach out to that service to find the debug information files.
The text was updated successfully, but these errors were encountered: