Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to boot after Upgrade to 10.0 CM4 NVMe #2479

Closed
KartoffelToby opened this issue Apr 18, 2023 · 33 comments
Closed

Fail to boot after Upgrade to 10.0 CM4 NVMe #2479

KartoffelToby opened this issue Apr 18, 2023 · 33 comments
Labels
board/raspberrypi Raspberry Pi Boards bug

Comments

@KartoffelToby
Copy link

Describe the issue you are experiencing

I didn't get the note its not bootable with an nvme,

So i use a cm4 with an nvme boot, i Upgrade from 9.5 to 10 in the UI and now it doesnt boot anymore.

Any Options to downgrade via fileswap? E.g.

This is really dangures to provide updates in the UI that breaks some Setups totally

What operating system image do you use?

rpi4-64 (Raspberry Pi 4/400 64-bit OS)

What version of Home Assistant Operating System is installed?

10

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. Go to the UI and hit the os 10 Update Button
  2. Wait for the Install
  3. Reboot, bricked boot

Anything in the Supervisor logs that might be useful for us?

Nothing

Anything in the Host logs that might be useful for us?

Nothing

System information

CM4 with nvme

Additional information

No response

@littlecake

This comment was marked as off-topic.

@agners agners added the board/raspberrypi Raspberry Pi Boards label Apr 18, 2023
@agners
Copy link
Member

agners commented Apr 18, 2023

So i use a cm4 with an nvme boot

What carrier board are you running on?

Any Options to downgrade via fileswap? E.g.

Powering it off and back on three times it should revert to the previous OS version.

This is really dangures to provide updates in the UI that breaks some Setups totally

True, and I guarantee you that we do not do this intentionally! We test all common configuration: Raspberry Pi 3 Model B/B+, Raspberry Pi 4, Raspberry Pi 400, Yellow etc. etc. and all passed our testing. Furthermore, we have more than 1000 beta testers which tested our release candidates. Something must be special in your setup which causes it to break.

@agners
Copy link
Member

agners commented Apr 18, 2023

On proxmox same problem

@littlecake this issue is about Raspberry Pi. Please open a new issue along with all information and if possible screenshot of the system console (screen output).

@ChristianWaechter
Copy link

ChristianWaechter commented Apr 18, 2023

I have the same problem!

Hardware used:

  • Raspberry Pi Compute Module 4 8GB RAM, 32GB Flash, WLAN + BT
  • Waveshare CM4-IO-BASE-B
  • Western Digital PC SN520 NVMe SSD (256 GB))
  • Power via custom USB-C plug on which only the power pins are connected with a Meanwell HDR-15-5 power supply (no data pins connected)

Further information about my setup:

  • no further was connected to the board during the update (except LAN and power)
  • no SD Card inserted
  • Raspberry Pi OS is installed on the eMMC which is booting normaly when removing the M.2 NVMe
  • not the smallest image output via HDMI when trying to boot from the M.2

Any Options to downgrade via fileswap? E.g.

Powering it off and back on three times it should revert to the previous OS version.

Can you provide more information about this, e.g. how long to turn on before turning off. How long to wait after the fourth power on, which then should revert the old version? I tried this (turning on, waiting for ~5 seconds, turning off, waiting for ~5 seconds), but nothing changed...

@KartoffelToby
Copy link
Author

KartoffelToby commented Apr 18, 2023

@agners
I have the same board as @ChristianWaechter
The 3 time Reset Thing doesnt seem to work

I'll try a recover to 9.5 tomorrow

@bschatzow
Copy link

@agners Did the bios change from the rcs to 10.0? I had no issue with any of the rcs. 10.0 did not start for me as well. Similar to the 8.x issues I had. I was able to start by disconnecting my ssd turning off my pi, reconnecting the ssd and restarting.

@agners
Copy link
Member

agners commented Apr 19, 2023

Can you provide more information about this, e.g. how long to turn on before turning off. How long to wait after the fourth power on, which then should revert the old version? I tried this (turning on, waiting for ~5 seconds, turning off, waiting for ~5 seconds), but nothing changed...

The U-Boot phase must complete for this to work, which depends a bit on how fast the Raspberry Pi firmware finds and starts the U-Boot. 5 seconds sounds a bit short, I'd give it more like 10-20s.

However, if the system is in such a state that it doesn't even boot U-Boot, then this won't work.

Do you happen to have a serial console to check?

@agners
Copy link
Member

agners commented Apr 19, 2023

@agners Did the bios change from the rcs to 10.0? I had no issue with any of the rcs. 10.0 did not start for me as well.

@bschatzow yes, in 10.0.rc4 the fimrware changed to Raspberry Pi's latest release `1.20230405.

Do you have a CM4 as well?

@Kiremas
Copy link

Kiremas commented Apr 19, 2023

Same problem here:
Raspberry Pi CM4 on a Waveshare CM4-IO-BASE-B, console screen stays completely black when NVME is connected.

Work-around for now is to connect the NVME with an USB adapter, this starts without a problem.

@JoeyGnarf
Copy link

Confirming:
Rpi4 CM4 on a Waveshare CM4-IO-BASE-B with WD Black NVMe. Console shows no output at all.

System boots (with NVMe installed as is) from SD card, so writing the HAOS 9.5 image and restoring a backup was relatively easy.

@KartoffelToby
Copy link
Author

Yes to only option that works is to recover all system partions with the 9.5

@bschatzow
Copy link

@agners Did the bios change from the rcs to 10.0? I had no issue with any of the rcs. 10.0 did not start for me as well.

@bschatzow yes, in 10.0.rc4 the fimrware changed to Raspberry Pi's latest release `1.20230405.

Do you have a CM4 as well?

I do not. I have just a rPI 4. I commented as this was the identical issue I and many others had for over a year. No updates worked correctly until a fix in the firmware was made. I asked if the firmware changed (which it did) and wanted to remind you of this issue and maybe it is related to this?

@rainerbeck
Copy link

Confirming: Rpi4 CM4 on a Waveshare CM4-IO-BASE-B with WD Black NVMe. Console shows no output at all.

Confirming with similar hardware: Kernel not loading, no problem with 9.5

@swingstate
Copy link

swingstate commented Apr 19, 2023

My config: CM4 4GB w/ 0GB MMC with 512GB NVME on a Oratek Tofuboard - Native NVME boot. Plus one cod.m Zigbee Hat.

I faced the same issue after upgrading to HA OS 10 and more or less fixed the situation as follows:

  1. Pulled the HA 10 NVME and connected it with a NVME to USB Adapter to my Mac/Linux VM
  2. On the Linux VM mounted the /hassos-data partition and copied the latest full backup to the Desktop in the VM (Note: It seems that the system does not do a full backup automatically when doing an OS upgrade, so my full backup is 2 weeks old).
  3. Flashed HA OS 9.5 to another 256GB NVME (luckily I still had a spare from early tests). Also tried with HA OS but the CM4 definitely does not boot HA 10 from NVME in my case.
  4. Mounted the 256GB NVME into the HA case, bootet waited a few mins until the login appeared and then did a restore with the 2 weeks old backup.
  5. Once done, shut the system down and moved the SSD again to the USB adapter, mounted the /boot partition (sdb1 usually) and made necessary changes to the config.txt - in my case I have to disable Bluetooth and enable the serial connection for the Zigbee HAT.
  6. Build the 256GB SSD back into the HA case, bootet and so far the system incl. Zigbee is working.
  7. Restored the partial backup I did yesterday which recovered all the dashboard improvements I did during the last 2 weeks.

I basically lost 2 weeks of historically data from my energy dashboard / solar system. Everything else what I did in the last 2 weeks, e.g. dashboard changes, should be manually recoverable from the 512GB SSD.

Lessons learned:

  • always keep a spare NVME
  • do a manual backup
  • keep a Linux VM at hand (since MacOS seems to have issues mounting the /boot partitions)
  • Come here and read for any issues before doing major upgrades
  • Wait 1 week to do major upgrades.

I think this major bug should have been catched during beta testing.

@lumilooms
Copy link

I have the same board as ChristianWaechter (Waveshare CM4-IO-BASE-B with NVME), the upgrade to HAOS 10 bricked it.

@agners
Copy link
Member

agners commented Apr 20, 2023

@lumilooms what NVMe are you using?

@lumilooms
Copy link

@lumilooms what NVMe are you using?
Union Memory 128GB M.2 PCI-e NVME. Been using it for HAOS for a almost 1 year without issues.

@ChristianWaechter
Copy link

Yesterday I played around a bit to get my stuff back running. The conclusion was, that the whole Home Assistant 10.0 image does not work, even when you flash it directly and not as update from 9.5. This is what I did:

  1. Update HA OS from 9.5 to 10.0. No other Updates were pending during that > System behaves as described in this topic, no startup, no image output, no network access possible
  2. Remove the NVMe > The old Raspberry Pi OS on the eMMC starts normaly
  3. Update the rpi-eeprom to the newest version (2023-01-11 from here) while not changing the boot order (BOOT_ORDER=0xf416 in boot.conf, NVMe first) > Startup into Raspberry Pi OS on the eMMC
  4. Mount the NVMe again > Still no startup into HAOS
  5. Change BOOT_ORDER=0xf641 in boot.conf (SD/eMMC first) and keep NVMe mounted > Startup into Raspberry Pi OS on the eMMC and data on NVMe would be accessable
  6. Download the official haos_rpi4-64-10.0.img.xz and write it to the NVMe with dd if=haos_rpi4-64-10.0.img.xz of=/dev/nvmexxx
  7. Change boot order in boot.conf back to BOOT_ORDER=0xf416 (NVMe first) > Still no change, even on the first boot. Not the slighliest image output
  8. Repeat steps 5 - 6 but with haos_rpi4-64-9.5.img.xz > Normal startup with immediately image output (still using pieeprom-2023-01-11.bin)

I think this major bug should have been catched during beta testing.

Full Ack, especially as the Home Assistant yellow features a M.2 as well...

@swingstate
Copy link

Honestly, I wonder why this release is still available and not pulled back, unless there is an immediate fix available. This will brick more and more HA devices (especially when people have time on the upcoming weekend to click the "update" button), causing those who boot from an NVME a lot of trouble, potential data loss and maybe some frustration.

@agners
Copy link
Member

agners commented Apr 20, 2023

The problem is, that there is always a setup which breaks in some weird ways.

With more than 150k installations which opt-in to stats, according to analytics.home-assistant.io already 34k upgraded successfully (otherwise the stats would not get updated). If I pull the update for every failing installation I see, we won't be able to publish a new OS release ever.

In general: USB SSD boot on Raspberry Pi has always been fragile, and being discouraged for a long time. That the CM4 NVMe boot on certain base boards is unstable is also known for a while, and often required EEPROM updates etc.

Ideally, you boot HAOS from a SD-card or from the eMMC, this is known to be much more reliable. The data disk feature then allows to use NVMe or USB attached SSD's still.

@agners
Copy link
Member

agners commented Apr 20, 2023

Any Options to downgrade via fileswap? E.g.

Most likely replacing U-Boot (u-boot.bin) from HAOS 9.5 on the first partition fixes the boot issue.

@swingstate
Copy link

swingstate commented Apr 20, 2023

The problem is, that there is always a setup which breaks in some weird ways.

Well, I think in this case there is no weird way. HA 10 has a 99% chance to not boot when a CM4 is being used together with a NVME as boot drive. So this should be called out as "breaking change", while the release is being fixed. Booting from NVMEs isn't something weird nowadays, it became a norm.

Leaving the state like this and watching the number of complains is just killing brand reputation. Why would one risk that?

Just my two cents :) I don't understand this way of product management and prefer quality over quantity.

@agners
Copy link
Member

agners commented Apr 20, 2023

Well, I think in this case there is no weird way. HA 10 has a 99% chance to not boot when a CM4 is being used together with a NVME as boot drive.

Not true: One of my Yellow test devices boots from NVMe. Also, there is not a single report where Home Assistant Yellow wouldn't boot from NVMe where it did before.

It seems only other boards are affected. I currently don't know why that is.

I've been able to reproduce, and to me it seems a EEPROM issue. However, that doesn't really make sense as we don't update the EEPROM. I wonder how this ever worked.

I'll continue investigation.

Just my two cents :) I don't understand this way of product management and prefer quality over quantity.

Oh yeah, I am all in for quality over quantity! The problem is, HAOS these days has to support so many weird configurations (RPi 4 + USB S-ATA adapter from brand A/B/C, RPi 4 + USB M.2 adapter from brand A/B/C, CM4 + native NVMe on I/O board X, on I/O board Y, etc. etc.). I can't possible test all these configurations! If we'd really want to increase quality, we should just prevent boot on any board we don't test... But that wouldn't really be the open source spirit.

@agners
Copy link
Member

agners commented Apr 20, 2023

FWIW, it HAOS 10 is off the stable channel for RPi 4 devices now: home-assistant/version#288.

@swingstate
Copy link

bold move, appreciate that!

I am happy to add to testing with my config if that helps and if desired, since I think the TOFU board I am using is a great piece of hardware but less common.

To your point:

If we'd really want to increase quality, we should just prevent boot on any board we don't test.

I think you don't have to go that far, however there should be a hardware compatibility list to which the core team and users can contribute. Is this being considered?

@agners
Copy link
Member

agners commented Apr 20, 2023

I think you don't have to go that far, however there should be a hardware compatibility list to which the core team and users can contribute. Is this being considered?

For most boards things are quite straight forward since folks usually just boot from SD-card. I do have every board we support. But for more advanced/special setup, such a list would be a nice to have indeed! Along with a list of users which are willing to test pre-releases on the beta channel. Maybe a GitHub wiki could do the job? 🤔

In any case, I've found the problem, PR #2493 fixes it. The problem will be fixed in HAOS 10.1.

@agners agners changed the title Fail to boot after Upgrade to 10.0 cm4 nvme Fail to boot after Upgrade to 10.0 CM4 NVMe Apr 20, 2023
@maxromanovsky
Copy link

@agners

  1. Add it to the generic Raspberry Pi U-Boot configuration so Yellow as well as other CM4 based systems can boot from NVMe SSD again.

Does it means that booting from NVMe is also broken for Yellow?

  1. The problem will be fixed in HAOS 10.1.

In the meantime, will replacing u-boot.bin from HAOS 9.5 work?

  1. How long is wait for the first 10.1 beta?
  2. I've been able to reproduce, and to me it seems a EEPROM issue

I've updated to the latest EEPROM on CM4 - no changes.

  1. Ideally, you boot HAOS from a SD-card or from the eMMC, this is known to be much more reliable. The data disk feature then allows to use NVMe or USB attached SSD's still.

That's the thing, it doesn't work on Waveshare boards #1887

@agners
Copy link
Member

agners commented Apr 20, 2023

Does it means that booting from NVMe is also broken for Yellow?

No, the configuration was present for Yellow. It just didn't apply for the rpi4/rpi4-64 images.

In the meantime, will replacing u-boot.bin from HAOS 9.5 work?

Yes that should work.

How long is wait for the first 10.1 beta?

I'll trigger a dev build tonight, it will be available from https://os-builds.home-assistant.io/ tomorrow.

@maxromanovsky
Copy link

Okay, some interesting updates :)
I know it's not the most scientific way, but I had few updates coming, and decided to combine them all, so don't blame me that I haven't done it step-by-step and haven't identified what exactly made it successful.

  • Used stable HAOS 10.0 with u-boot.bin extracted from dev build
  • Replaced CM4 base board with CM4-POE-UPS-BASE (PoE, LiPo backup power, Fan control & RTC - albeit unused by HASS AFAIK)
  • Replaced 256Gb M.2 NVMe with 64Gb M.2 eMMC from Steam Deck: Foresee e2m2 064g - looking at the specs on the Internet, pretty slow, but I haven't found any issues so far (MariaDB as a DB backend)
    • When I flashed it with HAOS, it couldn't boot (i.e. latest stable CM4 EEPROM with BOOT_ORDER=0xf25416 couldn't detect it)
    • When I flashed internal eMMC with HAOS AND didn't erase M.2 eMMC, it seems that u-boot actually picked up HAOS on M.2 eMMC (it showed up 64Gb storage instead of on-board 8Gb and didn't show data disk)
    • When I erased M.2 eMMC, then HAOS loaded from internal eMMC, and I was able to move data disk to external storage.

Win-win, as long as it won't break later on. In any case, I'll keep current u-boot.bin from dev build ;)

@agners
Copy link
Member

agners commented Apr 24, 2023

looking at the specs on the Internet, pretty slow, but I haven't found any issues so far (MariaDB as a DB backend)

Probably still fast enough for the BCM2711 chip (Raspberry Pi SoC).

When I flashed internal eMMC with HAOS AND didn't erase M.2 eMMC, it seems that u-boot actually picked up HAOS on M.2 eMMC (it showed up 64Gb storage instead of on-board 8Gb and didn't show data disk)

The system detects the data disk by name. If you have two installation available, the outcome is random.

When I erased M.2 eMMC, then HAOS loaded from internal eMMC, and I was able to move data disk to external storage.

Hm, so at this point you have a HAOS installation on the internal eMMC, that means you also boot from the eMMC. This use case should already work with HAOS 10.0. U-Boot is only used at boot time, and the bug in OS 10.0 is that U-Boot can't continue booting when booting from the external NVMe SSD.

Nit: Technically, there is no such thing as a "M.2 eMMC". That is just a M.2 SSD (solid-state disk). eMMC stands for embedded multimedia card, which is the name of the protocol the CM4 on-board flash storage. The M.2 SSDs on Waveshare/Yelow use NVMe as the protocol (over PCIe), so more precise would be M.2 NVMe SSD.

@maxromanovsky
Copy link

@agners

Hm, so at this point you have a HAOS installation on the internal eMMC, that means you also boot from the eMMC. This use case should already work with HAOS 10.0. U-Boot is only used at boot time, and the bug in OS 10.0 is that U-Boot can't continue booting when booting from the external NVMe SSD.

Correct (in theory), but previously not possible due to #1887

Nit: Technically, there is no such thing as a "M.2 eMMC". That is just a M.2 SSD (solid-state disk). eMMC stands for embedded multimedia card, which is the name of the protocol the CM4 on-board flash storage. The M.2 SSDs on Waveshare/Yelow use NVMe as the protocol (over PCIe), so more precise would be M.2 NVMe SSD.

I'm using Steam Deck marketing lingo here (and they refer to it as eMMC) - first line is the module I have:

64 GB eMMC (PCIe Gen 2 x1)
256 GB NVMe SSD (PCIe Gen 3 x4 or PCIe Gen 3 x2*)
512 GB high-speed NVMe SSD (PCIe Gen 3 x4 or PCIe Gen 3 x2*)
All models use socketed 2230 m.2 modules (not intended for end-user replacement)

@agners
Copy link
Member

agners commented Apr 26, 2023

This issue is resolved by #2493 and part of HAOS 10.1.

@xyklex
Copy link

xyklex commented Sep 11, 2023

@agners is the PR #2493 the only change applied to for U-Boot boot the NVMe drive? I'm still trying to find the reason why u-boot can't recognize the NVMe but Linux does. I've used the same config variables you have mentioned in the PR. The output from u-boot I'm getting is:

U-Boot> pci
BusDevFun  VendorId   DeviceId   Device Class       Sub-Class
_____________________________________________________________
00.00.00   0x14e4     0x2711     Bridge device           0x04
01.00.00   0x1c5c     0x174a     Mass storage controller 0x08
U-Boot> nvme info
U-Boot> nvme scan
Cannot set queue count (err=-110)
Unable to setup I/O queues(err=-110)
Failed to probe 'nvme#0': err=-110
U-Boot>

As you can see, the pci controller is recognized but the nvme driver not. I created a thread in u-boot mail list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
board/raspberrypi Raspberry Pi Boards bug
Projects
None yet
Development

No branches or pull requests