-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lib: log vCPU diagnostics on triple fault and for some unhandled exit types #795
Conversation
The clippy failure is a new suggestion from 1.82. I'll fix that separately. |
fba85be
to
e4f891f
Compare
OK, I think this is ready for another review lap; I'll tag folks back in presently. I haven't added any API support for toggling the logging flag in this PR, mostly because I don't know what the plumbing is going to look like in Nexus, and I'm reluctant to have an unused internal API sitting around for an unknown length of time. I think this would be very, very straightforward to add, though, either as an instance ensure parameter or as a separate API that controls the behavior of the server process writ large. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems good to me, big fan of a bit we can toggle for this. i mentioned the bhyve-api
part but same as #794 i'm not pressed about that happening exactly right now as long as one of us gets to it sooner than later (e.g. i'm happy to make a note to wire that up this week)
pub fn read<T: Copy + FromBytes>( | ||
&self, | ||
addr: GuestAddr, | ||
) -> Option<GuestData<T>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw this diff misses the same thing i did in #794, which is that do_read
in bhyve-api
reads from guest memory too. i was figuring that'd be a relatively small change to chase through this week but i wanted to go read the crate and bhyve side a bit first. if you don't get to it here, that future PR will be me doing "bring MemCtx::read
improvements to do_read
", if you do change that here too then that's also fine!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of do_read
is that it's used to read state from the in-kernel VMM--things like the state of in-kernel emulated devices, bhyve-managed MSRs, and captured VMX/SVM state that might need to be migrated.
Some of that could technically have come from the guest (via WRMSR or an I/O instruction), but to my mind none of it is really guest application data of the sort we're looking to protect in this PR. (I'm going off the list of device state information classes in bhyve-api/sys/src/vmm_data.rs
to draw this conclusion; there are VDC_REGISTER
and VDC_FPU
classes, but AFAICT these haven't actually been implemented for this bhyve interface. Everything else looks to me like internal kernel VMM state.)
Add a
propolis::vcpu::Diagnostics
type that captures and pretty-prints the register state of a vCPU. Log this state if a vCPU triple faults or (in propolis-server) if it raises aPaging
orInstEmul
unhandled exit, which are the failure modes seen in #300, #333, #340, and #755.To avoid logging guest data outside of development environments, the
Diagnostics
type only captures data if thedump-guest-state
feature is enabled. Enable this feature by default for propolis-server and -standalone, then modify the Omicron image build job to disable it in Omicron zone images for now (so that the feature will be disabled for customer VMs). In the fullness of time we may want a more flexible way to configure this option.Finally, hand-implement
Debug
for theInstEmul
exit context to redact this exit's instruction bytes ifdump-guest-state
is disabled.This helps with #335. In a subsequent change I'd like to add the memory region near guest %rip to the diagnostics, but this requires some additional plumbing to translate from a GVA to a GPA before trying to read memory (there's an ioctl for this already; just needs some track-laying in Propolis).
Tests: added a diagnostics log line to the "reset" suspend exit so that I could force diagnostics to be captured on demand via bhyvectl, then checked both that they're logged with a binary built in the default configuration and elided in a binary built with the Omicron configuration.