Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] rimage + openssl3 is broken in the stable-v2.2 and cavs25 branches, always have been #9340

Closed
marc-hb opened this issue Jul 31, 2024 · 8 comments
Assignees
Labels
bug Something isn't working as expected P1 Blocker bugs or important features

Comments

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 31, 2024

Describe the bug

rimage + openssl3 is broken in the stable-v2.2 and cavs-drop-stable branch, always has been.

This was fixed 2 years ago by rimage commit thesofproject/rimage@73a9d7c

However that critical rimage fix never made it to https://github.com/thesofproject/sof/commits/stable-v2.2/rimage. stable-v2.2 fell 6 commits short.

That's because we tested https://github.com/thesofproject/sof/commits/stable-v2.3/rimage for a long time (which does have the openssl3 fix) and then decided to go backwards to stable-v2.2. I don't remember why.

In thesofproject/rimage#97 (review) @lgirdwood wrote "We need to take this for v2.2, ..." but that never happened.

The original investigation happened in 2022 in https://github.com/thesofproject/sof/issues/5917 . I didn't re-open that bug and opened a new one instead because that investigation took an incredibly long time with many red herrings so the original bug is very long and very hard to follow. One of the reasons it took so long: some people were using openssl1 (which worked) and other people were using openssl3 (which failed) but it took a long time to notice that difference.

To Reproduce

This affects only some SOF commits and not others and I have no idea why. Any OpenSSL expert around?

This was found in totally unrelated PR #9336. When I fetch 4ed65ed which is the latest pull/9336/merge commit for that PR and run git submodule update --recursive, then the firmware never boots on my TGL Xtreme. After I cherry-pick openssl3 fix thesofproject/rimage@73a9d7c, the firmware boots again every time.

(the git bisect wasn't straight-forward for unrelated reasons but it worked; you have been warned)

Reproduction Rate

100% for some SOF commits but 0% for other SOF commits like the current tag v2.2.10. No idea why some commits boot while others not.

The toolchain can also affect reproduction, see comments in unrelated #9336. I guess anything that changes the binary.

Expected behavior

The firmware boots.

Impact

Show stopper.

Screenshots or console output

juil. 31 02:39:51 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: Attempting iteration 0 of Core En/ROM load...
juil. 31 02:39:51 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf010e0e successful
juil. 31 02:39:51 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf010e0e successful
juil. 31 02:39:51 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: unstall/run core: core_mask = 1
juil. 31 02:39:51 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: DSP core(s) enabled? 1 : core_mask 1
juil. 31 02:39:51 tglup-mh kernel: usbcore: registered new interface driver snd-usb-audio
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0xd4]=0x0 timedout
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf010f0f successful
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f successful
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: DSP core(s) enabled? 0 : core_mask 1
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: Attempting iteration 1 of Core En/ROM load...
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf010f0f successful
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf010f0e successful
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: unstall/run core: core_mask = 1
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: DSP core(s) enabled? 1 : core_mask 1
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0xd4]=0x80000000 successful
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x80000]=0x5000001 successful
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: Code loader DMA starting
juil. 31 02:39:52 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: Code loader DMA done, waiting for FW_ENTERED status
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x80000]=0x80000012 timedout
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: hda_cl_copy_fw: timeout with rom_status_reg (0x80000) read
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x140000 successful
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: Code loader DMA stopped
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump start ]------------
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware download failed
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (3)
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: 0x80000012: module: ROM, state: CSE_VALIDATE_IMAGE_REQUEST, not running
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: error code: 0x2c (error: signature verification failed)
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: extended rom status:  0x80000012 0x2c 0x0 0x0 0x0 0x0 0x2510113 0x0
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump end ]------------
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: Failed to start DSP
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: error: failed to boot DSP firmware -110
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: fw_state change: 3 -> 4
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: FW Poll Status: reg[0x4]=0x1d003c timedout
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: error: hda_dsp_core_reset_enter: timeout on HDA_DSP_REG_ADSPCS read
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: error: dsp core reset failed: core_mask 1
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: fw_state change: 4 -> 0
juil. 31 02:39:55 tglup-mh kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: error: sof_probe_work failed err: -110

@marc-hb marc-hb added the bug Something isn't working as expected label Jul 31, 2024
@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 31, 2024

So now we have (at least) two obvious options:

  • cherry-pick ONLY that rimage fix for stable-v2.2. Minimal risk of breaking something else BUT risk of missing other fixes.

  • Fast forward sof/stable-v2.2 to the last "standalone" rimage commit before the rimage repo was merged into the sof repo. I smoke-tested that and it boots too.

  • Anything in between

I could be wrong but I don't think there would be much value cherry-picking rimage commits AFTER the transfer to sof.git back into the stable-v2.2 branch.

@marc-hb marc-hb added the P1 Blocker bugs or important features label Jul 31, 2024
@marc-hb marc-hb changed the title [BUG] rimage + openssl3 is broken in the stable-v2.2 branch, has always been [BUG] rimage + openssl3 is broken in the stable-v2.2 branch, always has been Jul 31, 2024
marc-hb added a commit to marc-hb/sof that referenced this issue Jul 31, 2024
Fixes thesofproject#9340

This adds the following commits from the main rimage branch:

```
commit 02abc5d ("rimage: ace signing functions need openssl 3.0 guards")
commit 73a9d7c ("rimage: fix openssl 3.0 support in ver25 signing")
commit 8ba3d17 ("adsp_config: fix name parsing error in parse_signed_pkg_ace_v1_5")
commit fe4dcaa ("mtl: Add ASRC module to the manifest")
commit af947cb ("config: Add toml config for mtl")
commit c484d99 ("rimage:  add ACE V1.5 handling")
commit 1b233f6 ("config: add rmb toml file to support rembrandt build")
```

This is the smallest main branch fast-forward that includes the critical
openssl3 fix commit 73a9d7c ("rimage: fix openssl 3.0 support in
ver25 signing") _and_ compiles.

This happens to align the rimage version in stable-v2.2 to the version
in stable-v2.3. stable-v2.3 is not in use anymore but it was routinely
tested in CI for a long time. In fact this stable-v2.2 commit is the
same as stable-v2.3 commit 4e1d3ba ("rimage: update to version 02abc5d")

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@cujomalainey
Copy link
Contributor

rimage is pretty good at being backwards compatible right? How much risk is there in the fast forward in breaking something?

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 1, 2024

Probably good (compatibility) but we have no evidence of that. There have been some big code changes later and they were just too big for me to audit them. On the other hand, I spotted no fix later as important as this one. If anyone volunteers then please free to submit a bigger update after this one. The lack of branching and cherry-picks now will make a future upgrade easy from a git perspective.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 1, 2024

rimage is pretty good at being backwards compatible right? How much risk is there in the fast forward in breaking something?

Sorry I assumed you meant "fast forward all the way" and answered accordingly. Now I see this may or may not be what you meant.

#9342 is the smallest (and small) fast forward with the fix.

marc-hb added a commit that referenced this issue Aug 2, 2024
Fixes #9340

This adds the following commits from the main rimage branch:

```
commit 02abc5d ("rimage: ace signing functions need openssl 3.0 guards")
commit 73a9d7c ("rimage: fix openssl 3.0 support in ver25 signing")
commit 8ba3d17 ("adsp_config: fix name parsing error in parse_signed_pkg_ace_v1_5")
commit fe4dcaa ("mtl: Add ASRC module to the manifest")
commit af947cb ("config: Add toml config for mtl")
commit c484d99 ("rimage:  add ACE V1.5 handling")
commit 1b233f6 ("config: add rmb toml file to support rembrandt build")
```

This is the smallest main branch fast-forward that includes the critical
openssl3 fix commit 73a9d7c ("rimage: fix openssl 3.0 support in
ver25 signing") _and_ compiles.

This happens to align the rimage version in stable-v2.2 to the version
in stable-v2.3. stable-v2.3 is not in use anymore but it was routinely
tested in CI for a long time. In fact this stable-v2.2 commit is the
same as stable-v2.3 commit 4e1d3ba ("rimage: update to version 02abc5d")

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@cujomalainey
Copy link
Contributor

@marc-hb we can close this now right?

@marc-hb marc-hb closed this as completed Aug 2, 2024
@marc-hb
Copy link
Collaborator Author

marc-hb commented Sep 3, 2024

Re-opening for cavs2.5-001-drop-stable, see random firmware boot failures in https://sof-ci.01.org/sofpr/PR9427/build7557/devicetest/index.html

Still P1 because firmware in cavs2.5 does not... boot.

Exactly like what happened in #9336, a random, unrelated topology change is causing a "butterfly effect" in the firmware image and it stops booting.

I thought I could very quickly cherry-pick the same rimage upgrade than I did for stable-v2.2 (in #9342) and that has been working well. But that rimage submodule cherry-pick has a conflict because @abonislawski pushed a cavs25-only, "file error handling" rimage fix in thesofproject/rimage@d287016. I don't know why this file error handling fix was only on cavs25 and without any PR or code review. It looks like yet another static analysis "quick fix".

At this stage the best option is probably to create a new branch https://github.com/thesofproject/rimage/tree/cavs2.5-001-drop-stable-2 that points at the rimage commits used in stable-v2.2 + maybe a cherry-pick -x of ""file error handling" d287016caac ?

EDIT: or much simpler: forget about "rogue", unreviewed and pointless commit d287016caac and just align cavs2.5 with stable-v2.2

cc: @macchian , @mwasko

@marc-hb marc-hb reopened this Sep 3, 2024
@marc-hb marc-hb changed the title [BUG] rimage + openssl3 is broken in the stable-v2.2 branch, always has been [BUG] rimage + openssl3 is broken in the stable-v2.2 and cavs25 branches, always have been Sep 3, 2024
@abonislawski
Copy link
Member

abonislawski commented Oct 7, 2024

Unfortunately this "rogue", unreviewed and pointless commit d287016caac is required for SOF releases from cavs2.5 branch.

I didn't saw any conflicts and just added openssl3 fix to cavs2.5 branch, no issues in static analysis scans.

https://github.com/thesofproject/sof/tree/cavs2.5-001-drop-stable/

@kv2019i not sure who is able now to verify it?

@kv2019i
Copy link
Collaborator

kv2019i commented Oct 7, 2024

I tested stable-v2.2 and cavs2.5-001-drop-stable both today and works after the backport. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected P1 Blocker bugs or important features
Projects
None yet
Development

No branches or pull requests

9 participants