Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

soc/intel_adsp: ipc: initialize semaphore in driver init #72158

Merged
merged 1 commit into from
May 6, 2024

Conversation

kv2019i
Copy link
Collaborator

@kv2019i kv2019i commented Apr 30, 2024

The ipc driver device data (struct intel_adsp_ipc_data) contains a semaphore. Upon device init, the device data is zeroed out. This is safe for other fields, but the semaphore should be properly initialized before use.

This lack of initialization leads to a system crash when CONFIG_POLL is enabled (e.g. to enable CONFIG_SHELL), IPC driver handles an interrupt and executes k_sem_give() on a uninitialized semaphore object. This will eventually lead to null deferences in z_handle_obj_poll_events().

The ipc driver device data (struct intel_adsp_ipc_data) contains a
semaphore. Upon device init, the device data is zeroed out. This is safe
for other fields, but the semaphore should be properly initialized
before use.

This lack of initialization leads to a system crash when CONFIG_POLL is
enabled (e.g. to enable CONFIG_SHELL), IPC driver handles an interrupt
and executes k_sem_give() on a uninitialized semaphore object. This will
eventually lead to null dereference in z_handle_obj_poll_events().

Signed-off-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
@kv2019i kv2019i force-pushed the 202404-adsp-ipc-fix branch from e64bb96 to f8b1cdd Compare April 30, 2024 12:30
Copy link
Collaborator

@tmleman tmleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the change was tested in SOF CI.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Apr 30, 2024

@tmleman wrote:

I assume the change was tested in SOF CI.

Thanks, that's a good reminder! So not yet. I've tested this locally on cavs25 and ace15 platforms, but let me do a full test round as well just to be sure:
thesofproject/sof#9089
(UPDATE: one unexpected fail seen on ace15 -> https://sof-ci.01.org/sofpr/PR9089/build4389/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50 -- marking with DNM until investigated)

@kv2019i kv2019i added the DNM This PR should not be merged (Do Not Merge) label Apr 30, 2024
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one. I had to go back and check (it's been a while) but the architecture of this code is that the semaphore should be initialized on message send, is taken in a later message_sync() call, and is given only in the resulting ISR.

If someone is touching the semaphore in other contexts that almost certainly means that there's a race somewhere. Initializing the semaphore might prevent a crash but it's not going to save you from the resulting deadlock. Are there more details about what is going on?

@marc-hb
Copy link
Collaborator

marc-hb commented May 1, 2024

This lack of initialization leads to a system crash when CONFIG_POLL is enabled (e.g. to enable CONFIG_SHELL),

Based on other communications from @kv2019i , I suspect CONFIG_SHELL is the key element to reproduce a crash.

@andyross
Copy link
Contributor

andyross commented May 1, 2024

Ah, indeed, I'd seen that discussion go by SOF-side. So I'm guessing what's happening is that the IPC driver's semaphore is being added to a k_poll() list?

Yeah, that won't work with the driver as it stands. k_poll() doesn't understand the idea of a semaphore being reset while it's in progress.

If we want to do that we should update the driver to use the semaphore more conventionally, merely doing a k_sem_give() at send time and not an init, and increasing the max value to handle the resulting pipelining if someone races on the next send (that worry was why it was originally done like this, so that you recover after one cycle if the code blasts a bunch of racing sends).

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 to capture my assumption that this driver is k_poll-incompatible and needs more work.

@kv2019i
Copy link
Collaborator Author

kv2019i commented May 2, 2024

@andyross wrote:

Ah, indeed, I'd seen that discussion go by SOF-side. So I'm guessing what's happening is that the IPC driver's semaphore is being added to a k_poll() list?

There's no linkage between the IPC driver and Zephyr poll implementation directly.

The problem happens with following sequence:

  • SOF sends initial FW_READY directly from its boot code (this probably should be changed to use the driver)
  • IPC driver's first interrupt is caused by host replying to the FW_READY IPC message
  • z_intel_adsp_ipc_isr() handles the interrupt, driver calls k_sem_give() on a semaphore object that is zeroed out with memset() in init
  • if CONFIG_POLL is enabled in the Zephyr build, z_impl_k_sem_give() will call z_handle_obj_poll_events()
    - in normal SOF builds CONFIG_POLL is not enabled, but CONFIG_SHELL enables it indirectly as a dependency
  • z_handle_obj_poll_events() will do a sys_dlist_get() on an uninitialized list object
  • with uninitialized sys_dlist object, sys_dlist_is_empty() will return false (list->head is NULL)
  • core crashes doing sys_dlist_remove() (trying to remove an item from uninitialized list object)

The SOF FW_READY should probably be changed so it sends the FW_READY via the driver, but I see no reason to leave a possibility to execute above code on uninitialized semaphore object. If the driver is initialized, it should be ready to handle interrupts. If the host is faulty (or modified with bad intentions) and sends unexpected IPC on boot, you can potentially hit this very hard to debug case (Zephyr fails silently here, no exception raised, in ISR context with spinlock held). And given I triggered this by only enabling CONFIG_SHELL=y in the build, not even running any shell code, it took a while to figure out what is happening and how linking (unused!) code can break DSP boot so fundamentally. But alas, enabling shell also impacted the Zephyr semaphore implementation, but it took a while for me to figure this out. I'd hope to spare other poor souls from this experience.

CONFIG_POLL is only used for shell's internal implementation (for shell signals).

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. To rephrase my understanding: the first IPC ISR is the result of a boot-time command sent outside this driver. So there's no active command and the semaphore is garbage. In that case I completely agree, just initialize it ahead of time and let the driver wake up a command that isn't waiting. Then the rest on the first "real" command will recover state. And I guess k_sem_give() on a zero-filled semaphore technically doesn't crash, only k_poll's augmented code (to catch the wakeup) does.

OK, I'm sold.

@andyross
Copy link
Contributor

andyross commented May 2, 2024

Though actually if you do respin for any reason, maybe augmenting the commit message to be more explicit about the case would be nice: "The driver can receive ISRs from IPC activity performed before and during Zephyr boot, so the semaphore must always be initialized", something like that.

@kv2019i
Copy link
Collaborator Author

kv2019i commented May 3, 2024

Extra manual test run with thesofproject/sof#9089 and now got a clean run with PR testing. Removing the DNM.

@kv2019i kv2019i removed the DNM This PR should not be merged (Do Not Merge) label May 3, 2024
@fabiobaltieri fabiobaltieri merged commit 2dd6486 into zephyrproject-rtos:main May 6, 2024
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform: Intel ADSP Intel Audio platforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants