[FEATURE][ChromeOS] Expose DSP load tracker info for the visibility to ChromeOS #5521

johnylin76 · 2022-03-11T11:09:22Z

Is your feature request related to a problem? Please describe.
Currently we don't have much information for DSP in system logs attached in feedback report of audio issue. To debug the issue we need to re-build kernel with trace feature on and capture logs by sof-logger. It will be super helpful for debugging if DSP load tracker info could be obtained without those debugging steps. Moreover, we can setup an autotest item to track the DSP load performance for all supported products.

Describe the solution you'd like
I don't have ideas which should be the appropriate/simple way to do this. It can be either actively reported from DSP, or on demands of the host request.
For the active approach, it should be enough by reporting critical info only (e.g. DSP load is over the threshold) in order not to cause more burdens.

Describe alternatives you've considered
None

Additional context
None

lgirdwood · 2022-03-11T22:40:33Z

@plbossart @ujfalusi @ranj063 any thoughts ?

kv2019i · 2022-03-25T08:41:49Z

When this is implemented, feature needs to be documented to https://thesofproject.github.io/latest/developer_guides/debugability/perf-counters/index.html?highlight=performance

kv2019i · 2022-03-25T08:43:37Z

I'd say an IPC to on-demand ask this information from FW would be the most suitable solution. So we could e.g. have a debugfs interface (exposed by SOF linux driver) and when CRAS queries this (at the rate it chooses), SOF drivers sends an IPC and FW replies with current DSP load tracker data.

lgirdwood · 2023-05-11T20:01:11Z

@kv2019i we have mtrace and Zephyr oops now, anything else pending ?

btian1 · 2023-05-12T02:05:08Z

there is one line log may cover this topic:

[ 65.782163] ll_schedule: ll timer avg 2138, max 4647, overruns 0
[ 66.806163] ll_schedule: ll timer avg 2143, max 2924, overruns 0

simply check overruns can know whether dsp overloaded.

kv2019i · 2023-05-12T11:22:13Z

@lgirdwood @btian1 This is not done. The request is to have this data in user-space without enabling the FW logs (disabled in Chromebooks by default). So this requires new DSP+kernel infra. In IPC4, we have messages to report load measurements, but these are not implemented yet (on FW nor on kernel side). So this is still a valid ask and I'm not aware of any ongoing work to implement these (we discussed about the IPC4 messages in question with the cycles-budget (CPC) topic around manifest).

marc-hb · 2023-05-12T21:44:46Z

The request is to have this data in user-space without enabling the FW logs (disabled in Chromebooks by default). So this requires new DSP+kernel infra.

Why is this not just a log level configuration problem?
https://docs.zephyrproject.org/latest/services/logging/index.html

Logging API is highly configurable at compile time as well as at run time. [...] During compilation logs can be filtered out on module basis and severity level. [...] Logs can also be compiled in but filtered on run time using dedicate API. Run time filtering is independent for each backend and each source of log messages. Source of log messages can be a module or specific instance of the module.

The old sof-logger has similar capabilities.

So the idea to start another, new and entirely different logging framework sounds weird to me. @johnylin76, @andyross, @cujomalainey why are Chromebooks "throwing the baby with the bathwater" and disabling logs entirely as opposed to selectively keep the logging they need/want?

cujomalainey · 2023-05-15T21:39:13Z

@marc-hb at least on pre-ace systems I have been warned that enabling logging hijacks a DMA (which PNP is not happy about when it comes to power) and also it can block power states. Trust me I would love to have even just ERR/WARN on.

marc-hb · 2023-05-15T22:45:55Z

The old sof-logger uses either shared memory for high levels or DMA for all levels or both at the same time. It's configurable.

I'm not familiar with all Zephyr logging possibilities yet but I know for sure that the way it's configured does NOT use any DMA right now in our CI, it relies on shared memory "mtrace" only (which means it can overrun and lose statements; "you get what you paid for")

kv2019i · 2023-05-17T16:48:31Z

@cujomalainey @marc-hb That's a good question. "mtrace" Zephyr backend is friendlier to system flows (it doesn't take a DMA channel). but it's still not zero cost. There are writes to the SRAM that take DSP cycles, there's a logging thread that takes some memory and there's impact to code size (although we already have dictionary formats in Zephyr up and running limiting this). But in the end it will depend on the specific product. Our current assumption is still that at least not all release firmwares will have logging enabled (like has been the case with SOF previously). I.e. key debugging tools (like ability to extract a FW coredump), will be available even if you don't have logging enabled. Having a way to access DSP load information without logging, seems like a reasonable feature, similar to the coredump. I think we are running out of runway for v2.6 though, so this probably needs to be pushed out to v2.7.

marc-hb · 2023-05-17T18:22:11Z

There are writes to the SRAM that take DSP cycles, there's a logging thread that takes some memory and there's impact to code size (although we already have dictionary formats in Zephyr up and running limiting this).

Logging never comes for free; OK. Something Heisenberg something?

But how would a new technique focused on logging the DSP load only avoid all these costs? By being uber-specialized and optimized for reporting DSP load only? Then someone else realizes that "oh, wait, we also need the ability to log this other critical thing too in production builds... and that other one..."

Maybe I'm missing something but until fine-tuning the configuration of the existing logging framework has not been attempted and measured, starting a new brand new technique seems to fall in the "premature optimization" category.

I.e. key debugging tools (like ability to extract a FW coredump), will be available even if you don't have logging enabled

If that's already available without logging then it's different; it does really come "for free". Coredumps feel quite different from regular logging too; for instance you don't need a long-running thread streaming a series of coredumps :-)

cujomalainey · 2023-05-17T19:48:57Z

There are writes to the SRAM that take DSP cycles, there's a logging thread that takes some memory and there's impact to code size

Will happily pay that price if it means debugging time is cut because we are not having blind errors in the field

Our current assumption is still that at least not all release firmwares will have logging enabled (like has been the case with SOF previously)

i think the bigger issue was that we didn't enable it in the current which meant the resources were never allocated. So even though the firmware was ready the kernel was enabled.

I think the load data could even be host triggered for our purposes into the log, but we need to be able to run the log in production without PNP getting annoyed.

alex-cri · 2023-05-23T22:20:35Z

@kv2019i , agree with your comment on running out of runway for v2.6. Moving to v2.7.

lgirdwood · 2023-07-04T15:51:44Z

@kv2019i can we get the overall LL load per core in SRAM window ?
@ujfalusi will probably need a kernel debugFS to read it.

marc-hb · 2023-07-05T01:32:44Z

SOF drivers sends an IPC and FW replies with current DSP load tracker data.

@ujfalusi will probably need a kernel debugFS to read it.

These seem to imply a "poll" style of user interface where a process must run and regularly poll the firmware.

The description and some other comments seem to lean more towards a "push" style of interface where the firmware regularly logs the data somehow and somewhere (a different question). In the latter case, the interval is controlled by the firmware so it is more predictable.

@cujomalainey (who ~~filed this issue~~ has been requesting this feature) do you have a preference?

lgirdwood · 2023-07-05T10:19:43Z

SOF drivers sends an IPC and FW replies with current DSP load tracker data.

@ujfalusi will probably need a kernel debugFS to read it.

These seem to imply a "poll" style of user interface where a process must run and regularly poll the firmware.

The description and some other comments seem to lean more towards a "push" style of interface where the firmware regularly logs the data somehow and somewhere (a different question). In the latter case, the interval is controlled by the firmware so it is more predictable.

SRAM is FW push. DebugFS can be blocking read() for an update every 1sec using existing FW MCPS infra.

cujomalainey · 2023-07-05T20:58:57Z

No preference, @johnylin76 thoughts?

lgirdwood · 2024-01-08T12:48:42Z

@jsarha for v2.9 we need something initially that can provide data in an unused RO SRAM Window (for Intel - other Vendors will choose their memory locations) that can be read by debugFS as binary data with userspace parsing i.e. a Python script would read and parse the debugFS and display the contents. Later on it could use other transports like probes.

The SRAM Windows and FW data collection will be disabled by default - it will be enabled at runtime via a new IPC message. i.e. the IPC to enable can be sent by kernel when debugFS is open() and IPC to disable when debugFS is close().
There will be a second debugFS that allows for the FW data collection to be tuned to the needs of the developer. Each data payload will have an ID (iniut32_t) and if host echos the ID to debug FS then that sends an IPC (set large config) to enable that feature i.e.

echo 123, 789, -456 > /sys/debug/blah/sof_blah

Will enable data items with ID 123, 789 and disable ID 456 (negative).

The FW data wont be parsed by the kernel (except for the total size & magic). It will just be passed through. It should have a generic header that is easy to parse e.g.

struct payload_hdr {
    uint32_t total_size; // total size of the whole payload in bytes
    uint32_t hdr_size; // size of this header only in bytes
    uint32_t magic; // used to identify valid data
    uint32_t abi;  // ABI version, can be used by tool to warn users if too old.
    uint64_t tstamp; // aligned with host epoch.
};

Following this will be the data structures/containers. They should all lead with a standard header that the tool can use
e.g.

struct elem_hdr {
    uint32_t size; // size of this elem in bytes;
    uint32_t id; // ID of this elem type
};
/* elem body follows here */

This way the Python tool can read and parse, it should ignore any IDs it does not know about.

kv2019i · 2024-03-04T13:11:14Z

Stable-v2.9 branched, this didn't make the cut, bumping to 2.10. There is progress with both Zephyr thread analyzer and with a more generic telemetry interface, but neither is ready for merge for SOF2.9.

lgirdwood · 2024-04-25T09:39:12Z

@jsarha can you give an update of any PRs and output to date. Thanks !

jsarha · 2024-04-26T11:17:56Z

I have some couple of PRs to address this issue. First there is Intel ADSP fixes for Zephyr Thread Analyzer:

Thread analyzer adsp fix zephyrproject-rtos/zephyr#71408

And then couple two other PRs, depending on the first one, that still need some polishing, but which are functional and working at the moment:

The telemetry2 and thread info FW code, and the python client to decode the associated debugfs.

Telemetry2 thread info #9084

Simple debugfs code to Linux sof driver to map the telemetry2 window to a debugfs file.

Telemetry2 debugfs linux#4956

jsarha · 2024-04-26T14:40:48Z

Here is example output of the current implementation:

python3 thread-analyzer.py 
Open /sys/kernel/debug/sof/telemetry2 failed '[Errno 14] Bad address'
core_count: 3
Core 0 load 0%
        0x401184a8           cpu 0.0% stack 2.0%
        0x401183e0           cpu 0.0% stack 2.0%
        1 thread info        cpu 0.0% stack 8.2%
        0 thread info        cpu 0.0% stack 47.1%
        edf_workq            cpu 0.0% stack 8.2%
        sysworkq             cpu 0.0% stack 31.8%
        logging              cpu 0.0% stack 19.6%
        idle 00              cpu 99.6% stack 33.7%
Core 2 load 0%
        idle 02              cpu 99.6% stack 27.1%
        0x40118570           cpu 0.0% stack 47.1%
        2 thread info        cpu 0.0% stack 47.1%
Core 0 load 0%
        0x401184a8           cpu 0.0% stack 2.0%
        0x401183e0           cpu 0.0% stack 2.0%
        1 thread info        cpu 0.0% stack 8.2%
        0 thread info        cpu 0.0% stack 47.1%
        edf_workq            cpu 0.0% stack 8.2%
        sysworkq             cpu 0.0% stack 31.8%
        logging              cpu 0.0% stack 19.6%
        idle 00              cpu 99.6% stack 33.7%
Core 2 load 6%
        ll_thread2           cpu 0.0% stack 18.8%
        idle 02              cpu 93.3% stack 28.6%
        0x40118570           cpu 0.4% stack 47.1%
        2 thread info        cpu 0.0% stack 47.1%
Core 0 load 0%
        0x401184a8           cpu 0.0% stack 2.0%
        0x401183e0           cpu 0.0% stack 2.0%
        1 thread info        cpu 0.0% stack 8.2%
        0 thread info        cpu 0.0% stack 47.1%
        edf_workq            cpu 0.0% stack 8.2%
        sysworkq             cpu 0.0% stack 31.8%
        logging              cpu 0.0% stack 19.6%
        idle 00              cpu 99.6% stack 33.7%
Core 2 load 5%
        ll_thread2           cpu 5.5% stack 18.8%
        idle 02              cpu 94.1% stack 28.6%
        0x40118570           cpu 0.0% stack 47.1%
        2 thread info        cpu 0.0% stack 47.1%
Core 0 load 0%
        0x401184a8           cpu 0.0% stack 2.0%
        0x401183e0           cpu 0.0% stack 2.0%
        1 thread info        cpu 0.0% stack 8.2%
        0 thread info        cpu 0.0% stack 47.1%
        edf_workq            cpu 0.0% stack 8.2%
        sysworkq             cpu 0.0% stack 31.8%
        logging              cpu 0.0% stack 19.6%
        idle 00              cpu 99.6% stack 33.7%
Core 2 load 5%
        ll_thread2           cpu 5.5% stack 18.8%
        idle 02              cpu 94.1% stack 28.6%
        0x40118570           cpu 0.0% stack 47.1%
        2 thread info        cpu 0.0% stack 47.1%
Core 0 load 0%
        0x401184a8           cpu 0.0% stack 2.0%
        0x401183e0           cpu 0.0% stack 2.0%
        1 thread info        cpu 0.0% stack 8.2%
        0 thread info        cpu 0.0% stack 47.1%
        edf_workq            cpu 0.0% stack 8.2%
        sysworkq             cpu 0.0% stack 31.8%
        logging              cpu 0.0% stack 19.6%
        idle 00              cpu 99.6% stack 33.7%
Core 2 load 5%
        ll_thread2           cpu 5.5% stack 18.8%
        idle 02              cpu 94.1% stack 28.6%
        0x40118570           cpu 0.0% stack 47.1%
        2 thread info        cpu 0.0% stack 47.1%
Core 0 load 1%
        0x401184a8           cpu 0.0% stack 2.0%
        0x401183e0           cpu 0.0% stack 2.0%
        1 thread info        cpu 0.0% stack 8.2%
        0 thread info        cpu 0.0% stack 47.1%
        edf_workq            cpu 0.0% stack 8.2%
        sysworkq             cpu 0.0% stack 31.8%
        logging              cpu 0.8% stack 19.6%
        idle 00              cpu 98.4% stack 33.7%

plbossart · 2024-04-30T14:15:17Z

The SRAM Windows and FW data collection will be disabled by default - it will be enabled at runtime via a new IPC message. i.e. the IPC to enable can be sent by kernel when debugFS is open() and IPC to disable when debugFS is close().

@lgirdwood What if you want this FW data collection to start immediately without waiting for user-space to enable this with an IPC?

Take the example of the trace, it's on immediately. userspace can select to read it or not.

lgirdwood · 2024-04-30T15:42:27Z

The SRAM Windows and FW data collection will be disabled by default - it will be enabled at runtime via a new IPC message. i.e. the IPC to enable can be sent by kernel when debugFS is open() and IPC to disable when debugFS is close().

@lgirdwood What if you want this FW data collection to start immediately without waiting for user-space to enable this with an IPC?

Take the example of the trace, it's on immediately. userspace can select to read it or not.

Kconfig initially, but there could be later updates that allow developers to switch on/of via other methods.

plbossart · 2024-04-30T15:52:27Z

@lgirdwood if it's a Kconfig, then why would you need an IPC to turn the data collection on?

marc-hb · 2024-04-30T16:13:42Z

I don't know about this case specifically but you can have a default value in Kconfig for boot time and the ability to change it again at runtime, it's not unheard of.

(you can also have the Kconfig ability to strip the code and save space but that's a different Kconfig)

plbossart · 2024-04-30T16:22:25Z

indeed, and my issue is with the default value as false requiring an IPC to start data collection. That's a sure way to introduce non-reproducible behavior between runs since we can't control when the IPC would be sent.

marc-hb · 2024-04-30T16:56:54Z

Again, not sure about this particular case but a typical configuration for this type of features in general is:

OFF by default in production builds
ON by default in debug builds thanks to some -DCONFIG_xxx override in some sof/app/debug_overlay.conf
Run-time knob available in all builds for more flexibility. Not used by default in automated testing.

Problem solved?

plbossart · 2024-04-30T17:38:04Z

how would we know at the kernel level if the feature is on or off by default? There's a need to report this capability...

johnylin76 · 2024-04-30T18:15:20Z

I second @marc-hb

I'm so glad to see this feature being fulfilled. Since it's a developer feature anyway, Kconfig makes more senses to me as well.

how would we know at the kernel level if the feature is on or off by default? There's a need to report this capability...

My two cents is that kernel doesn't really to know whether this feature is supported or not, as long as no false alarm is reported by firmware while it does not support.
In another aspect, kernel can never consider the feature is "on" by default because there is always the mismatch between two ends.

plbossart · 2024-04-30T18:36:37Z

in my experience, collecting data after a random time isn't great for early boot issues.
If you don't start from T0, then your results will depend on when the host driver sends that IPC, and that depends on a user-space program which is even worse.
Also there's a conceptual difference between COLLECTING data and REPORTING data. The first one should not be dependent on anything.

lgirdwood · 2024-05-01T11:54:49Z

All, lets take this step by step - we can incrementally add to the config/control when the basics are upstream.
Today, we just need the Kconfig that advertises (enabled) in the extended manifest.
Next steps can be IPCs for runtime config/control, different backends (like probes) and so on, the important part is we need the basics 1st.

marc-hb · 2024-05-01T14:13:31Z

I think everyone is on board with incremental, Continuous Integration but to reduce churn, back-and-forth and even going around in circles sometimes, it's good to have some high level idea of the final design and UI a.k.a. "requirements"? They don't have to be set in stone of course...

"Fail to plan, plan to fail" - my 2 cents.

lgirdwood · 2024-05-01T16:08:15Z

I think everyone is on board with incremental, Continuous Integration but to reduce churn, back-and-forth and even going around in circles sometimes, it's good to have some high level idea of the final design and UI a.k.a. "requirements"?

See above re UI #5521 (comment) requirement is to track DSP load as per subject.

marc-hb · 2024-05-01T16:11:42Z

I was including the user interface in the "requirements" but I admit none of this is an "exact science".

kv2019i · 2024-06-05T09:12:23Z

There are two solutions now to expose DSP load over share memory:

IPC4 telemetry blob, provides SOF/app level DSP load -> merged and available but not supported by SOF Linux driver
telemetry2 developed by @jsarha , provides Zephyr level CPU/core load -- this is currently waiting for Zephyr dependencies Thread analyzer adsp fix zephyrproject-rtos/zephyr#71408 so trending not make the v2.10 cut

kv2019i · 2024-09-12T08:43:59Z

I think we are ready now v2.11 work with #9398 and #9451 . @jsarha can you add a summary of what is available now and how to use the feature?

@johnylin76 if you can review and see if this meets the original feature request needs, or do we something left to do under this ticket (we can then either close, move to 2.12 and/or file follow-up new tickets).

johnylin76 · 2024-09-12T09:36:51Z

I think we are ready now v2.11 work with #9398 and #9451 . @jsarha can you add a summary of what is available now and how to use the feature?

@johnylin76 if you can review and see if this meets the original feature request needs, or do we something left to do under this ticket (we can then either close, move to 2.12 and/or file follow-up new tickets).

LGTM. Thanks for the great work.

johnylin76 added the enhancement New feature or request label Mar 11, 2022

lgirdwood added this to the v2.2 milestone Apr 6, 2022

kv2019i added the performance Issues related to code performance and speed optimizations. label Jun 16, 2022

marc-hb mentioned this issue Sep 26, 2022

check-playback: add check for FW performance thesofproject/sof-test#959

Closed

kv2019i modified the milestones: v2.2, v2.6 Feb 17, 2023

kv2019i mentioned this issue May 12, 2023

[FEATURE] Checkers with limits of DSP load tracker #5518

Closed

alex-cri modified the milestones: v2.6, v2.7 May 23, 2023

lgirdwood assigned kv2019i Jul 4, 2023

marc-hb mentioned this issue Jun 30, 2023

[FEATURE] support dictionary formats with "mtrace" thesofproject/sof-test#1046

Open

aiChaoSONG mentioned this issue Oct 5, 2023

[BUG] delayed mtrace collection thesofproject/linux#4618

Open

marc-hb added the area:SOF logging label Jul 5, 2023

kv2019i assigned jsarha Dec 18, 2023

kv2019i modified the milestones: v2.9, v2.10 Mar 4, 2024

jsarha mentioned this issue Apr 30, 2024

Telemetry2 debugfs thesofproject/linux#4956

Closed

cujomalainey mentioned this issue May 14, 2024

[FEATURE] Ability to use Zephyr shell with SOF #9081

Closed

kv2019i removed their assignment Jun 5, 2024

lgirdwood modified the milestones: v2.10, v2.11 Jun 10, 2024

kv2019i closed this as completed Nov 18, 2024

[FEATURE][ChromeOS] Expose DSP load tracker info for the visibility to ChromeOS #5521

[FEATURE][ChromeOS] Expose DSP load tracker info for the visibility to ChromeOS #5521

Comments

johnylin76 commented Mar 11, 2022

lgirdwood commented Mar 11, 2022

kv2019i commented Mar 25, 2022

kv2019i commented Mar 25, 2022

lgirdwood commented May 11, 2023

btian1 commented May 12, 2023

kv2019i commented May 12, 2023

marc-hb commented May 12, 2023 • edited Loading

cujomalainey commented May 15, 2023

marc-hb commented May 15, 2023 • edited Loading

kv2019i commented May 17, 2023

marc-hb commented May 17, 2023 • edited Loading

cujomalainey commented May 17, 2023 • edited Loading

alex-cri commented May 23, 2023

lgirdwood commented Jul 4, 2023

marc-hb commented Jul 5, 2023 • edited Loading

lgirdwood commented Jul 5, 2023

cujomalainey commented Jul 5, 2023

lgirdwood commented Jan 8, 2024

kv2019i commented Mar 4, 2024

lgirdwood commented Apr 25, 2024

jsarha commented Apr 26, 2024 • edited by marc-hb Loading

jsarha commented Apr 26, 2024 • edited Loading

plbossart commented Apr 30, 2024

lgirdwood commented Apr 30, 2024

plbossart commented Apr 30, 2024

marc-hb commented Apr 30, 2024 • edited Loading

plbossart commented Apr 30, 2024 • edited Loading

marc-hb commented Apr 30, 2024 • edited Loading

plbossart commented Apr 30, 2024

johnylin76 commented Apr 30, 2024

plbossart commented Apr 30, 2024

lgirdwood commented May 1, 2024

marc-hb commented May 1, 2024 • edited Loading

lgirdwood commented May 1, 2024

marc-hb commented May 1, 2024

kv2019i commented Jun 5, 2024 • edited Loading

kv2019i commented Sep 12, 2024

johnylin76 commented Sep 12, 2024

marc-hb commented May 12, 2023 •

edited

Loading

marc-hb commented May 15, 2023 •

edited

Loading

marc-hb commented May 17, 2023 •

edited

Loading

cujomalainey commented May 17, 2023 •

edited

Loading

marc-hb commented Jul 5, 2023 •

edited

Loading

jsarha commented Apr 26, 2024 •

edited by marc-hb

Loading

jsarha commented Apr 26, 2024 •

edited

Loading

marc-hb commented Apr 30, 2024 •

edited

Loading

plbossart commented Apr 30, 2024 •

edited

Loading

marc-hb commented Apr 30, 2024 •

edited

Loading

marc-hb commented May 1, 2024 •

edited

Loading

kv2019i commented Jun 5, 2024 •

edited

Loading