-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE][ChromeOS] Expose DSP load tracker info for the visibility to ChromeOS #5521
Comments
@plbossart @ujfalusi @ranj063 any thoughts ? |
When this is implemented, feature needs to be documented to https://thesofproject.github.io/latest/developer_guides/debugability/perf-counters/index.html?highlight=performance |
I'd say an IPC to on-demand ask this information from FW would be the most suitable solution. So we could e.g. have a debugfs interface (exposed by SOF linux driver) and when CRAS queries this (at the rate it chooses), SOF drivers sends an IPC and FW replies with current DSP load tracker data. |
@kv2019i we have mtrace and Zephyr oops now, anything else pending ? |
there is one line log may cover this topic: [ 65.782163] ll_schedule: ll timer avg 2138, max 4647, overruns 0 simply check overruns can know whether dsp overloaded. |
@lgirdwood @btian1 This is not done. The request is to have this data in user-space without enabling the FW logs (disabled in Chromebooks by default). So this requires new DSP+kernel infra. In IPC4, we have messages to report load measurements, but these are not implemented yet (on FW nor on kernel side). So this is still a valid ask and I'm not aware of any ongoing work to implement these (we discussed about the IPC4 messages in question with the cycles-budget (CPC) topic around manifest). |
Why is this not just a log level configuration problem?
The old sof-logger has similar capabilities. So the idea to start another, new and entirely different logging framework sounds weird to me. @johnylin76, @andyross, @cujomalainey why are Chromebooks "throwing the baby with the bathwater" and disabling logs entirely as opposed to selectively keep the logging they need/want? |
@marc-hb at least on pre-ace systems I have been warned that enabling logging hijacks a DMA (which PNP is not happy about when it comes to power) and also it can block power states. Trust me I would love to have even just ERR/WARN on. |
The old sof-logger uses either shared memory for high levels or DMA for all levels or both at the same time. It's configurable. I'm not familiar with all Zephyr logging possibilities yet but I know for sure that the way it's configured does NOT use any DMA right now in our CI, it relies on shared memory "mtrace" only (which means it can overrun and lose statements; "you get what you paid for") |
@cujomalainey @marc-hb That's a good question. "mtrace" Zephyr backend is friendlier to system flows (it doesn't take a DMA channel). but it's still not zero cost. There are writes to the SRAM that take DSP cycles, there's a logging thread that takes some memory and there's impact to code size (although we already have dictionary formats in Zephyr up and running limiting this). But in the end it will depend on the specific product. Our current assumption is still that at least not all release firmwares will have logging enabled (like has been the case with SOF previously). I.e. key debugging tools (like ability to extract a FW coredump), will be available even if you don't have logging enabled. Having a way to access DSP load information without logging, seems like a reasonable feature, similar to the coredump. I think we are running out of runway for v2.6 though, so this probably needs to be pushed out to v2.7. |
Logging never comes for free; OK. Something Heisenberg something? But how would a new technique focused on logging the DSP load only avoid all these costs? By being uber-specialized and optimized for reporting DSP load only? Then someone else realizes that "oh, wait, we also need the ability to log this other critical thing too in production builds... and that other one..." Maybe I'm missing something but until fine-tuning the configuration of the existing logging framework has not been attempted and measured, starting a new brand new technique seems to fall in the "premature optimization" category.
If that's already available without logging then it's different; it does really come "for free". Coredumps feel quite different from regular logging too; for instance you don't need a long-running thread streaming a series of coredumps :-) |
Will happily pay that price if it means debugging time is cut because we are not having blind errors in the field
i think the bigger issue was that we didn't enable it in the current which meant the resources were never allocated. So even though the firmware was ready the kernel was enabled. I think the load data could even be host triggered for our purposes into the log, but we need to be able to run the log in production without PNP getting annoyed. |
@kv2019i , agree with your comment on running out of runway for v2.6. Moving to v2.7. |
These seem to imply a "poll" style of user interface where a process must run and regularly poll the firmware. The description and some other comments seem to lean more towards a "push" style of interface where the firmware regularly logs the data somehow and somewhere (a different question). In the latter case, the interval is controlled by the firmware so it is more predictable. @cujomalainey (who |
SRAM is FW push. DebugFS can be blocking read() for an update every 1sec using existing FW MCPS infra. |
No preference, @johnylin76 thoughts? |
@jsarha for v2.9 we need something initially that can provide data in an unused RO SRAM Window (for Intel - other Vendors will choose their memory locations) that can be read by debugFS as binary data with userspace parsing i.e. a Python script would read and parse the debugFS and display the contents. Later on it could use other transports like probes.
Will enable data items with ID 123, 789 and disable ID 456 (negative).
struct payload_hdr {
uint32_t total_size; // total size of the whole payload in bytes
uint32_t hdr_size; // size of this header only in bytes
uint32_t magic; // used to identify valid data
uint32_t abi; // ABI version, can be used by tool to warn users if too old.
uint64_t tstamp; // aligned with host epoch.
}; Following this will be the data structures/containers. They should all lead with a standard header that the tool can use struct elem_hdr {
uint32_t size; // size of this elem in bytes;
uint32_t id; // ID of this elem type
};
/* elem body follows here */ This way the Python tool can read and parse, it should ignore any IDs it does not know about. |
Stable-v2.9 branched, this didn't make the cut, bumping to 2.10. There is progress with both Zephyr thread analyzer and with a more generic telemetry interface, but neither is ready for merge for SOF2.9. |
@jsarha can you give an update of any PRs and output to date. Thanks ! |
I have some couple of PRs to address this issue. First there is Intel ADSP fixes for Zephyr Thread Analyzer: And then couple two other PRs, depending on the first one, that still need some polishing, but which are functional and working at the moment: The telemetry2 and thread info FW code, and the python client to decode the associated debugfs. Simple debugfs code to Linux sof driver to map the telemetry2 window to a debugfs file. |
Here is example output of the current implementation:
|
@lgirdwood What if you want this FW data collection to start immediately without waiting for user-space to enable this with an IPC? Take the example of the trace, it's on immediately. userspace can select to read it or not. |
Kconfig initially, but there could be later updates that allow developers to switch on/of via other methods. |
@lgirdwood if it's a Kconfig, then why would you need an IPC to turn the data collection on? |
I don't know about this case specifically but you can have a default value in Kconfig for boot time and the ability to change it again at runtime, it's not unheard of. (you can also have the Kconfig ability to strip the code and save space but that's a different Kconfig) |
indeed, and my issue is with the default value as false requiring an IPC to start data collection. That's a sure way to introduce non-reproducible behavior between runs since we can't control when the IPC would be sent. |
Again, not sure about this particular case but a typical configuration for this type of features in general is:
Problem solved? |
how would we know at the kernel level if the feature is on or off by default? There's a need to report this capability... |
I second @marc-hb I'm so glad to see this feature being fulfilled. Since it's a developer feature anyway, Kconfig makes more senses to me as well.
My two cents is that kernel doesn't really to know whether this feature is supported or not, as long as no false alarm is reported by firmware while it does not support. |
in my experience, collecting data after a random time isn't great for early boot issues. |
All, lets take this step by step - we can incrementally add to the config/control when the basics are upstream. |
I think everyone is on board with incremental, Continuous Integration but to reduce churn, back-and-forth and even going around in circles sometimes, it's good to have some high level idea of the final design and UI a.k.a. "requirements"? They don't have to be set in stone of course... "Fail to plan, plan to fail" - my 2 cents. |
See above re UI #5521 (comment) requirement is to track DSP load as per subject. |
I was including the user interface in the "requirements" but I admit none of this is an "exact science". |
There are two solutions now to expose DSP load over share memory:
|
I think we are ready now v2.11 work with #9398 and #9451 . @jsarha can you add a summary of what is available now and how to use the feature? @johnylin76 if you can review and see if this meets the original feature request needs, or do we something left to do under this ticket (we can then either close, move to 2.12 and/or file follow-up new tickets). |
LGTM. Thanks for the great work. |
Is your feature request related to a problem? Please describe.
Currently we don't have much information for DSP in system logs attached in feedback report of audio issue. To debug the issue we need to re-build kernel with trace feature on and capture logs by sof-logger. It will be super helpful for debugging if DSP load tracker info could be obtained without those debugging steps. Moreover, we can setup an autotest item to track the DSP load performance for all supported products.
Describe the solution you'd like
I don't have ideas which should be the appropriate/simple way to do this. It can be either actively reported from DSP, or on demands of the host request.
For the active approach, it should be enough by reporting critical info only (e.g. DSP load is over the threshold) in order not to cause more burdens.
Describe alternatives you've considered
None
Additional context
None
The text was updated successfully, but these errors were encountered: