Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ZEPHYR FATAL ERROR 4: ASSERTION FAIL [aligned_addr == addr] intel_adsp_hda.h:163 #7191

Closed
keqiaozhang opened this issue Mar 1, 2023 · 52 comments
Assignees
Labels
bug Something isn't working as expected DSP panic DSP panic observed IPC timeout IPC timeout observed IPC4 Issues observed with IPC4 (same IPC as Windows) MTL Applies to Meteor Lake platform P1 Blocker bugs or important features TGL Applies to Tiger Lake Zephyr Issues only observed with Zephyr integrated
Milestone

Comments

@keqiaozhang
Copy link
Collaborator

keqiaozhang commented Mar 1, 2023

Describe the bug
Observed this issue in CI daily test. From the kernel message, there's a IPC timed out 0x13020004, but mtrace recorded a fatal exception and caused zephyr kernel panic on CPU0.

DMESG

[ 3556.753931] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc tx      : 0x13020004|0x0: GLB_SET_PIPELINE_STATE
[ 3556.788076] kernel: snd_sof:sof_pcm_open: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: open stream 0 dir 1
[ 3556.788082] kernel: snd_sof:sof_pcm_open: sof-audio-pci-intel-tgl 0000:00:1f.3: period min 192 max 16384 bytes
[ 3556.788084] kernel: snd_sof:sof_pcm_open: sof-audio-pci-intel-tgl 0000:00:1f.3: period count 2 max 16
[ 3556.788086] kernel: snd_sof:sof_pcm_open: sof-audio-pci-intel-tgl 0000:00:1f.3: buffer max 65536 bytes
[ 3556.788185] kernel: snd_sof_intel_hda_common:hda_link_dma_hw_params: sof-audio-pci-intel-tgl 0000:00:1f.3: format_val=0x11, rate=48000, ch=2, format=2
[ 3556.788191] kernel: snd_sof:sof_pcm_hw_params: sof-audio-pci-intel-tgl 0000:00:1f.3: pcm: hw params stream 0 dir 1
[ 3556.788198] kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x40000 successful
[ 3556.788212] kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x40000 successful
[ 3556.788216] kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-tgl 0000:00:1f.3: period_bytes:0x4000
[ 3556.788217] kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-tgl 0000:00:1f.3: periods:4
[ 3556.788228] kernel: snd_sof:sof_ipc4_prepare_copier_module: sof-audio-pci-intel-tgl 0000:00:1f.3: copier copier.HDA.4.1, type 28
[ 3556.788230] kernel: snd_sof:sof_ipc4_init_audio_fmt: sof-audio-pci-intel-tgl 0000:00:1f.3: matched audio format index for 48000Hz, 32bit, 2 channels: 0
[ 3556.788232] kernel: snd_sof:sof_ipc4_init_audio_fmt: sof-audio-pci-intel-tgl 0000:00:1f.3: Init input audio formats for copier.HDA.4.1
[ 3556.788234] kernel: snd_sof:sof_ipc4_dbg_audio_format: sof-audio-pci-intel-tgl 0000:00:1f.3: Pin index #0: 48000Hz, 32bit (ch_map 0xffffff10 ch_cfg 1 interleaving_style 0 fmt_cfg 0x2002) buffer size 384
[ 3556.788237] kernel: snd_sof:sof_ipc4_prepare_copier_module: sof-audio-pci-intel-tgl 0000:00:1f.3: Output audio format for copier.HDA.4.1
[ 3556.788238] kernel: snd_sof:sof_ipc4_dbg_audio_format: sof-audio-pci-intel-tgl 0000:00:1f.3: Pin index #0: 48000Hz, 32bit (ch_map 0xffffff10 ch_cfg 1 interleaving_style 0 fmt_cfg 0x2002) buffer size 384
[ 3556.788240] kernel: snd_sof:sof_ipc4_prepare_copier_module: sof-audio-pci-intel-tgl 0000:00:1f.3: copier copier.HDA.4.1, IPC size is 84
[ 3556.788242] kernel: snd_sof:sof_ipc4_prepare_copier_module: sof-audio-pci-intel-tgl 0000:00:1f.3: copier copier.host.3.1, type 24
[ 3556.788245] kernel: snd_sof:sof_ipc4_init_audio_fmt: sof-audio-pci-intel-tgl 0000:00:1f.3: matched audio format index for 48000Hz, 16bit, 2 channels: 0
[ 3556.788247] kernel: snd_sof:sof_ipc4_init_audio_fmt: sof-audio-pci-intel-tgl 0000:00:1f.3: Init input audio formats for copier.host.3.1
[ 3556.788248] kernel: snd_sof:sof_ipc4_dbg_audio_format: sof-audio-pci-intel-tgl 0000:00:1f.3: Pin index #0: 48000Hz, 32bit (ch_map 0xffffff10 ch_cfg 1 interleaving_style 0 fmt_cfg 0x2002) buffer size 384
[ 3556.788250] kernel: snd_sof:sof_ipc4_prepare_copier_module: sof-audio-pci-intel-tgl 0000:00:1f.3: Output audio format for copier.host.3.1
[ 3556.788251] kernel: snd_sof:sof_ipc4_dbg_audio_format: sof-audio-pci-intel-tgl 0000:00:1f.3: Pin index #0: 48000Hz, 16bit (ch_map 0xffffff10 ch_cfg 1 interleaving_style 0 fmt_cfg 0x1002) buffer size 192
[ 3556.788253] kernel: snd_sof:sof_ipc4_prepare_copier_module: sof-audio-pci-intel-tgl 0000:00:1f.3: copier copier.host.3.1, IPC size is 84
[ 3556.788256] kernel: snd_sof:sof_ipc4_widget_setup: sof-audio-pci-intel-tgl 0000:00:1f.3: pipeline: 4 memory pages: 2
[ 3556.788258] kernel: snd_sof:sof_ipc4_widget_setup: sof-audio-pci-intel-tgl 0000:00:1f.3: Create widget pipeline.4 instance 3 - pipe 4 - core 0
[ 3557.257000] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ipc timed out for 0x13020004|0x0
[ 3557.257071] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: preventing DSP entering D3 state to preserve context
[ 3557.257076] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ IPC dump start ]------------
[ 3557.257135] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: hda irq intsts 0x00000000 intlctl 0xc0000f80 rirb 00
[ 3557.257208] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: dsp irq ppsts 0x00000000 adspis 0x00000000
[ 3557.257275] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: Host IPC initiator: 0x93020004|0x0|0x0, target: 0x0|0x0|0x0, ctl: 0x3
[ 3557.257350] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ IPC dump end ]------------
[ 3557.257402] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump start ]------------
[ 3557.257453] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: IPC timeout
[ 3557.257489] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
[ 3557.257544] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: 0x00000005: module: ROM, state: FW_ENTERED, running
[ 3557.257621] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: extended rom status:  0x5 0x0 0x0 0x0 0x0 0x0 0x0 0x1
[ 3557.257680] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump end ]------------
[ 3557.257755] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: failed to set final state 4 for all pipelines

MTRACE

[   17.176621] <inf> ipc: rx	: 0x45010002|0x1000003
[   17.176640] <inf> ipc: buffer new size 0x300 id 2.1 flags 0x0
[   17.176683] <inf> pipe: comp:2 0x20001 connect buffer 1 as sink
[   17.176695] <inf> pipe: comp:1 0x30000 connect buffer 1 as source
[   17.177976] <inf> ipc: rx	: 0x13020003|0x0
[   17.178016] <wrn> ipc: ipc_pipeline_complete(): no scheduling component specified, use comp 131073
[   17.178288] <inf> ipc: rx	: 0x13020004|0x0
ASSERTION FAIL [aligned_addr == addr] @ /srv/home/jenkins/workspace/sof_generic_build/zephyr/soc/xtensa/intel_adsp/common/include/intel_adsp_hda.h:163
	Buffer must be 128 byte aligned
[   17.178553] <err> os:  ** FATAL EXCEPTION
[   17.178578] <err> os:  ** CPU 0 EXCCAUSE 63 (zephyr exception)
[   17.178586] <err> os:  **  PC 0xbe05227a VADDR (nil)
[   17.178595] <err> os:  **  PS 0x60f20
[   17.178608] <err> os:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:15 CALLINC:2)
[   17.178618] <err> os:  **  A0 0xbe014f58  SP 0xbe0a2f40  A2 0x4  A3 0xbe0a2f50
[   17.178628] <err> os:  **  A4 0xbe0a2f30  A5 0x4  A6 0x30  A7 (nil)
[   17.178638] <err> os:  **  A8 0xbe013461  A9 0xbe0a2eb0 A10 0xbe074028 A11 0xbe0a2f50
[   17.178648] <err> os:  ** A12 0xbe0a2f30 A13 0x4 A14 0x1 A15 0x1
[   17.178658] <err> os:  ** LBEG 0xbe016339 LEND 0xbe016348 LCOUNT 0xbe016bc5
[   17.178665] <err> os:  ** SAR 0x1c

Backtrace:0xbe052277:0xbe0a2f40 0xbe014f55:0xbe0a2f50 0xbe01de0c:0xbe0a2f70 0xbe01dfd8:0xbe0a2f90 0xbe02b077:0xbe0a2fb0 0xbe043bf5:0xbe0a3030 0xbe02b660:0xbe0a30c0 0xbe02b715:0xbe0a3100 0xbe02b72f:0xbe0a31d0 0xbe02b9a5:0xbe0a3200 0xbe02bbd9:0xbe0a3240 0xbe02bdf2:0xbe0a3290 0xbe02c65f:0xbe0a32c0 0xbe01fb86:0xbe0a32f0 0xbe021edc:0xbe0a3310 0xbe028ab9:0xbe0a3330 0xbe04c38c:0xbe0a3360 0xbe013a82:0xbe0a33b0 

[   17.178855] <err> os: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
[   17.178863] <err> os: Current thread: 0x9e0a8d80 (unknown)
[   17.181323] <err> os: Halting system
Terminated

To Reproduce
~/sof-test/test-case/multiple-pipeline.sh -f p -c 20 -l 25

Reproduction Rate
This bug has occurred three times in CI.

Environment

  1. Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
  2. Name of the topology file
    • Topology: avs-tplg/sof-hda-generic.tplg
  3. Name of the platform(s) on which the bug is observed.
    • Platform: TGLU_UP_HDA_IPC4ZPH

dmesg.txt
mtrace.txt

@keqiaozhang keqiaozhang added bug Something isn't working as expected IPC timeout IPC timeout observed Zephyr Issues only observed with Zephyr integrated TGL Applies to Tiger Lake IPC4 Issues observed with IPC4 (same IPC as Windows) labels Mar 1, 2023
@lgirdwood
Copy link
Member

@abonislawski @mwasko looks like an assert in the hda logic

ASSERTION FAIL [aligned_addr == addr] @ /srv/home/jenkins/workspace/sof_generic_build/zephyr/soc/xtensa/intel_adsp/common/include/intel_adsp_hda.h:163

@keqiaozhang
Copy link
Collaborator Author

This issue happened again in CI. Test ID:https://sof-ci.ostc.intel.com/#/result/planresultdetail/21700

@abonislawski
Copy link
Member

looks like an assert in the hda logic

interesting because we have correct alignment in dtsi: dma-buf-addr-alignment = <128>
which is used correctly in SOF in host/dai_zephyr.c and chain_dma.c:

ret = dma_get_attribute(cd->dma_host->z_dev, DMA_ATTR_BUFFER_ADDRESS_ALIGNMENT,
				&addr_align);

I found we are using wrong alignment for buffer size in chain_dma.c (the same alignment used for addr and size instead of separate alignments) but it shouldnt be a problem (at least in this case)

fifo_size = ALIGN_UP_INTERNAL(fifo_size, addr_align);
cd->dma_buffer = buffer_alloc(fifo_size, SOF_MEM_CAPS_DMA, addr_align);

@kfrydryx kfrydryx added the P2 Critical bugs or normal features label Mar 7, 2023
@lgirdwood
Copy link
Member

@abonislawski will you send a FW fix or will this be a topology fix (or both) ?

@marc-hb marc-hb changed the title [BUG] ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0 [BUG] ZEPHYR FATAL ERROR 4: zephyr kernel panic on CPU 0 Mar 9, 2023
@abonislawski
Copy link
Member

@lgirdwood Im not working on this issue, just checked if we are using alignment correctly

@kv2019i kv2019i added this to the v2.5 milestone Mar 13, 2023
@kv2019i kv2019i added P1 Blocker bugs or important features and removed P2 Critical bugs or normal features labels Mar 14, 2023
@kv2019i kv2019i self-assigned this Mar 14, 2023
@kv2019i
Copy link
Collaborator

kv2019i commented Mar 14, 2023

Assigning to myself as this is affecting SOF2.5 release.

@kv2019i
Copy link
Collaborator

kv2019i commented Mar 15, 2023

Somewhat hard to hit this issue today, but it seems we have a problem somewhere in host_zephyr_params(). I can see from the oops that the offending address is not valid at all (e.g. buffer address passed to DMA driver is 0x12c0). This is not aligned with 128 bytes, but alignment is not the worst problem here. FYI @abonislawski @juimonen Continuing the debug...

@kv2019i
Copy link
Collaborator

kv2019i commented Mar 15, 2023

Very low reproduction rate slowing down debug.

@kv2019i
Copy link
Collaborator

kv2019i commented Mar 17, 2023

So far one patch helps to avoid hitting the problem, but I can't explain why this helps -> https://github.com/kv2019i/sof/commits/202303-buffer-alloc-rework FYI @lyakh @ranj063

So far I've tried various Zephyr debugging tools for heap and stack debug, tried aligning "struct coherent" to a full cacheline, cherry-picked recent DW-DMA fixes from upstream Zephyr, but the panic is still hit (albeit reproduction rate is low so long test runs are needed).

@kv2019i
Copy link
Collaborator

kv2019i commented Mar 22, 2023

Tested with #7323 -> kernel panic still occurs
(just cross-checking here as we otherwise have no good hypothesis why this is happening)

@kv2019i
Copy link
Collaborator

kv2019i commented Mar 22, 2023

Tested with "CONFIG_INCOHERENT=n" ->panic still occurs with the test case

@kv2019i
Copy link
Collaborator

kv2019i commented Apr 3, 2023

With today's mainline ("362a0781f20b6709cf8c609219798f6079a57511 (origin/main) topology2: Remove dma_buffer_size attribute"), the bug no longer shows up as Zephyr panic, but instead there's a plain IPC timeout and nothing in the FW log. But the original test sequence with "~/sof-test/test-case/multiple-pipeline.sh -f p -c 20 -l 25" is still failing.

@kv2019i
Copy link
Collaborator

kv2019i commented Apr 11, 2023

Downgrading to P2 as this no longer shows up in daily tests and I can no longer reproduce this.

It is possible #7361 is caused by same rootcause, but until that is proven, let's track #7361 as the P1 item.

@kv2019i
Copy link
Collaborator

kv2019i commented Jun 2, 2023

@lgirdwood We do have cases that affect TGL HDA tplg on multiple-pipeline-playback-50 tests. We did have this bug open on SOF2.5 already, so this is not a regression, but let's keep the door open for a last minute fix.

I now have suspect under investigation.

@kv2019i
Copy link
Collaborator

kv2019i commented Jun 2, 2023

It seems chain-dma is somehow related to this. I've noted in past week this has only occured on systems where cha-dmain is enabled in topology.

I did a custom test run in Intel CI and without chain-dma, issue is not seen (26927). Same test with current main and chain-dma enabled, issue happens (26930).

UPDATE: I've run many thousands of cycles with identical hw than is used in CI, but I've not able to reproduce the issue. Code analysis of chain-dma code did not reveal anything yet either. Debug continues...

@fredoh9
Copy link
Contributor

fredoh9 commented Jun 2, 2023

Today we have non alignment exception in TGL UP Xtreme, we have seen this before though.

[   46.704056] <inf> dai_intel_dmic: dmic_start(), cic 0x0000c001
[   46.704068] <err> os:  ** FATAL EXCEPTION
[   46.704078] <err> os:  ** CPU 0 EXCCAUSE 13 (load/store PIF data error)
[   46.704086] <err> os:  **  PC 0xbe01cddc VADDR 0x4
[   46.704095] <err> os:  **  PS 0x60620
[   46.704108] <err> os:  **    (INTLEVEL:0 EXCM: 0 UM:1 RING:0 WOE:1 OWB:6 CALLINC:2)
[   46.704118] <err> os:  **  A0 0xbe019793  SP 0xbe0a7110  A2 0xbe084154  A3 (nil)
[   46.704128] <err> os:  **  A4 0xf  A5 0x9e095138  A6 0x9e0aab1c  A7 (nil)
[   46.704138] <err> os:  **  A8 0x1  A9 0x1 A10 0x9e0aab00 A11 0x18c0
[   46.704148] <err> os:  ** A12 (nil) A13 (nil) A14 0xbe099310 A15 0xbe099320
[   46.704158] <err> os:  ** LBEG 0xbe01d2b1 LEND 0xbe01d2c5 LCOUNT 0xbe01d61c
[   46.704166] <err> os:  ** SAR 0x18

Intel internal daily test: planresultdetail/26938?model=TGLU_UP_HDA_IPC4ZPH&testcase=multiple-pipeline-capture-50

@kv2019i
Copy link
Collaborator

kv2019i commented Jun 2, 2023

@fredoh9 wrote:

[ 46.704078] os: ** CPU 0 EXCCAUSE 13 (load/store PIF data error)
[ 46.704086] os: ** PC 0xbe01cddc VADDR 0x4

Let's track this as a separate error (do we have a bug for this?). The PC points to Zephyr dmic interrupt handler and the value of
dmic->reg_base is invalid, leading to errror when running

»       val0 = dai_dmic_read(dmic, OUTSTAT0);
»       val1 = dai_dmic_read(dmic, OUTSTAT1);

The PIF exception occurs on the OUTSTAT0. I'd need to the matching ELF image to be sure, but I think A8=0x1 is value from dmic->reg_base (which explains the unaligned access exception).

@singalsu @juimonen @abonislawski @softwarecki any ideas? It seems we get an interrupt with foobar dmic context.

Anyways, until we have concrete data this is related to this bug, I'd track this in a separate bug and focus this #7191 on the aligned_addr assertion.

@juimonen
Copy link

juimonen commented Jun 5, 2023

@kv2019i at this point I can only thing that something is writing over the dmic regbase (aka mem corruption), it is read from DT in compilation phase into dmic driver struct and I think nothing in dmic code should write there.

@keqiaozhang
Copy link
Collaborator Author

In today's CI test(ID:27130), this issue happened when testing check-playback-100times on MTL-NOCODEC, the reproduce rate is almost 100%. It should be a regression.

I did a quick bisect, it points to 07ed14b

mtrace

[00:00:12.705,005] <inf> dai_intel_ssp: dai_ssp_get_properties: dai_ssp_get_properties init delay 0
[00:00:12.705,016] <inf> dai_intel_ssp: dai_ssp_pre_start: dai_ssp_pre_start
[00:00:12.705,051] <inf> dma_intel_adsp_gpdma: intel_adsp_gpdma_power_on: intel_adsp_gpdma_power_on: dma dma@7c000 initialized
[00:00:12.705,083] <inf> dai_intel_ssp: dai_ssp_start: dai_ssp_start
[00:00:12.705,095] <inf> dai_intel_ssp: dai_ssp_update_bits: dai_ssp_update_bits base 28000, reg 30, mask 100, value 100
ASSERTION FAIL [aligned_addr == addr] @ ZEPHYR_BASE/soc/xtensa/intel_adsp/common/include/intel_adsp_hda.h:163
	Buffer musTerminated

@kv2019i , please help to verify it, I can open a new bug if needed.

@kv2019i
Copy link
Collaborator

kv2019i commented Jun 7, 2023

@keqiaozhang Trying to reproduce on one of the CI machines, but so far no luck, so I'm not seeing that 100% repro rate. But let's keep this in this bug as the assert is the same. 07ed14b has a Zephyr baseline update (to 3.4.0-rc2) so it could bring in some change.

@marc-hb
Copy link
Collaborator

marc-hb commented Jun 7, 2023

Reproduction will be affected by this?

@keqiaozhang
Copy link
Collaborator Author

keqiaozhang commented Jun 8, 2023

@kv2019i , this issue happened again in today's daily test, I also can easily reproduce it.

device: jf-mtlp-rvp-nocodec-6 / jf-mtlp-rvp-nocodec-4(Test ID:27145)

@kv2019i
Copy link
Collaborator

kv2019i commented Jun 8, 2023

@keqiaozhang Ack, I can easily hit this now. There's still unexpected connection to the binary build I can't explain. If I modify the code a bit (like remove on of the commits of #7726 -> these only add code that is run after a panic has happened), issue no longer happens. Only explanation I can think of is that the .text section size changes and this will impact the FW memory map. Debug ongoing.

@kv2019i
Copy link
Collaborator

kv2019i commented Jun 8, 2023

This didn't make the v2.6 cut, moving to v2.7.

@kv2019i kv2019i modified the milestones: v2.6, v2.7 Jun 8, 2023
@marc-hb
Copy link
Collaborator

marc-hb commented Jun 10, 2023

After a lot of MTL test failures across all PRs for a couple days(with empty mtraces since #7726), today's https://sof-ci.01.org/softestpr/PR1054/build419/devicetest/index.html is mostly green. Go figure.

@kv2019i
Copy link
Collaborator

kv2019i commented Jun 12, 2023

@marc-hb This is starting to look like a cache coherency issue and reproduction rate is very sensitive to any change to code. E.g. the rimage change merged on Friday seems to make this bug disappear again #7756 . The rimage change was not intended, so I sent a fixup PR and ta-daa, we have 7191 triggering agani with high rate -> #7782 .

@marc-hb
Copy link
Collaborator

marc-hb commented Jun 12, 2023

Indeed MTLP_RVP_NOCODEC is mostly red again in https://sof-ci.01.org/sofpr/PR7782/build9265/devicetest/index.html (#7782)

EDIT: lucky us, we even have an error message in mtrace

[00:00:03.269,431] <inf> dai_intel_ssp: dai_ssp_update_bits: dai_ssp_update_bits base 28000, reg 30, mask 100, value 100
ASSERTION FAIL [aligned_addr == addr] @ ZEPHYR_BASE/soc/xtensa/intel_adsp/common/include/intel_adsp_hda.h:163
	Buffer must be 128 byte aligned
[00:00:03.269,611] <err> os:  ** FATTerminated

Other MTL configurations are green. It seems to be MTLP_RVP_NOCODEC affected every time. Older report: https://sof-ci.01.org/sofpr/PR7773/build9201/devicetest/index.html

@kv2019i
Copy link
Collaborator

kv2019i commented Jun 14, 2023

Status update:

  • extensive testing to check for use-after-free and out-of-bounds access to heap objects, but no indication of any unexpected access to the buffer object that gets corrupted
  • debug consistently points to sequence where we can observe correct data via uncached read, then we have some cached accesses, and then we see unexpected data via uncached read
  • above two points both hint the issue is not stray DMA writes overwriting parts of heap (DMA does not change DSP cache state)
  • primary suspect is buffer_alloc() and its use of incoherent.h -- the failure is always seen in this function

@kv2019i kv2019i added the MTL Applies to Meteor Lake platform label Jun 14, 2023
@kv2019i
Copy link
Collaborator

kv2019i commented Jun 14, 2023

Debugging with @lyakh , we finally have a hypothesis that explains what happens and why e.g. #7786 helps:

  • at time t1: write to memory X via uncached alias
  • at time t2: cache invalidate of buffer X
  • at time t3: prefetch of data cache kicked off
  • at time t4: write to memory X via uncached alias
  • at time t5: reads and writes to memory X via cached access
  • at time t6: memory X written back to memory, cached use ends (e.g. at end of buffer_alloc())

At step t5, the writes to t4 are not seen in the prefetched data and end up being discarded.

In case of this bug, buffer->stream.addr is written two values, NULL at t1 and allocated heap address at t4. The t4 write gets lost due to prefetch and we end up asserting due to invalid buffer address.

@kv2019i
Copy link
Collaborator

kv2019i commented Jun 15, 2023

Further experiments with prefetcher configuration today, confirming the analysis and hypothesis of #7191 (comment) .

Let's wait for some time to see how the results look, but expectation is now that this bug no longer occurs after #7786 . Based on these findings, we've surveyed the rest of the codebase and e.g. #7808 was submitted.

@keqiaozhang
Copy link
Collaborator Author

Confirmed that this issue no longer exists in CI. Closing this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected DSP panic DSP panic observed IPC timeout IPC timeout observed IPC4 Issues observed with IPC4 (same IPC as Windows) MTL Applies to Meteor Lake platform P1 Blocker bugs or important features TGL Applies to Tiger Lake Zephyr Issues only observed with Zephyr integrated
Projects
None yet
Development

No branches or pull requests