-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RP2040 (Raspberry Pi Pico) fails to start up - code clobbered or XIP-flash inaccessible? #82632
Comments
Hi @DavidCPlatt! We appreciate you submitting your first issue for our open-source project. 🌟 Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙 |
Can you find the version exactly? First look, the XIP related code has not changed recently for 1 year. so would it be possible to test the behavior before and after this? |
I will certainly try! This has the flavor of a Heisenbug. When the problem occurs with any specific build, it's 100% reproducible (rebuilding with a clean build directory and the same options yields the same result). However, touching the source code in almost any way, and then rebuilding, often makes the problem disappear for no obvious reason... and making other changes makes it come back. My test builds this morning have all (annoyingly!) worked perfectly, even with the larger amount of code moved into RAM. The next time I succeed in doing something which breaks the image, I'll save it, then rebuild with a few routines moved back from RAM to flash, get an image which starts up OK, and save that as well. It's feeling as if it's probably either timing-sensitive, or very sensitive to the addresses of the code being moved (maybe, somehow, a clash with the XIP cache lines in which the copy-it-to-RAM code happens to lie?). Time will tell... I'll try to put enough salt on its tail to pin it down :-) |
OK, I was able to reproduce the problem for long enough to make some images. The attached ZIP archive contains both ELF and UF2 versions of the same firmware, differing only in whether I did or did not flag a bunch of routines as "time_sensitive" to force them into RAM. With this option in use:
and without it
The image named "works" starts up OK... the usual Zephyr kernel log info comes out the primary UART, and it will enumerate as an ACM "UART" on USB and you can talk to it with The image named "hangs" does just that... never logs anything on the UART. According to Next step will be to make sure I can still reproduce it, then start bisecting Zephyr
|
Hmmm. Interesting. I just did a gdb attach to a Pico which was successfully running my latest (not-forced-to-RAM) image, and did a disassembly of memmove(). Although this image started up just fine (and is still running), gdb seems to be showing that chunks of the memmove and later routines are full of zeros. I'm also seeing zeros in the middle of routines which I know are being executed constantly Beats me what's going on here! Maybe, once XIP is active, one cannot count on the So, the "memory is being stomped during startup" appearance may be an artifact of
|
More interesting, makes-me-wonder stuff going on. I took a look at the early-memcopy stuff, and at the registers at the time of the hang, and at the map output from my .elf image build. It appears that the code as-written is trying to copy a portion of the flash image onto itself. Both __flash_rodata_rom_start and __flash_rodata_reloc_start are the same value (0x00100001a8). The amount of data being copied is tiny (0xa0 bytes) so it should have taken next to no time. However, at the time GDB connected to the board, r3 has 0x23 in it (the index
If I can get builds to start failing again reliably, I'll try disabling the startup of various parts of my app and see if one of them seems to be triggering an early crash. mumblemutter... hate Heisenbugs... they never stay put... |
This is very mysterious. The linker configuration for Raspberry Pi Pico is almost the same as the normal of Zephry... |
OK, this is even weirder than I thought. Here's the memory code for memcpy() in my ELF file, as disassembled by objdump:
That all looks reasonable to me. Now, here's what I see in memory on the board, after I load the image, it hangs during startup, and I attach gdb:
I cannot currently account for the difference... but since R3 isn't being decremented, it's The first of the two corrupted instructions went from "strb r4, [r0, r3]" to "strb r6, [r3, r3]". I looked through the image, and fgrep doesn't see any "strb r3, [r7, #15]" anywhere at all. I do recall that there are all sorts of dire warnings in the RP2040 data sheet about Perhaps this "dummy" copy of a range of flash, over itself, is triggering some |
Aha! The plot thickens. I think I've identified the culprit. I did a dump of the address range of the __flash_rodata_rom_start stuff which is being
and I recognize the symbol name. "credits" is a constant text string that I compiled into I'd figured out a way to compel the linker to always include it: in my CMakeLists, I have
and that, almost certainly, is what's causing the "relocate some const-in-flash right over So - short term fix for me is going to be a big Don't Do That. I'll figure out a way to Unfortunately, saying LOCATION ROM instead of LOCATION FLASH doesn't work. With the relocation directive removed from the Makefile, the __flash_rodata_rom_start and |
1 similar comment
Aha! The plot thickens. I think I've identified the culprit. I did a dump of the address range of the __flash_rodata_rom_start stuff which is being
and I recognize the symbol name. "credits" is a constant text string that I compiled into I'd figured out a way to compel the linker to always include it: in my CMakeLists, I have
and that, almost certainly, is what's causing the "relocate some const-in-flash right over So - short term fix for me is going to be a big Don't Do That. I'll figure out a way to Unfortunately, saying LOCATION ROM instead of LOCATION FLASH doesn't work. With the relocation directive removed from the Makefile, the __flash_rodata_rom_start and |
I looked through the Zephyr documentation last night, and tried an experiment: putting
I thought this might prevent the __flash_rodata_rom_start and related symbols from I was half-correct. The __flash_rodata_rom_start and __flash_rodata_reloc_start symbols At this point, I think the cause of the boot-time crash is confirmed:
So, I think this bug could be resolved in either of two ways:
I'll leave it to you to decide how to resolve this. Thanks for your assistance! It's been a fun bug-hunt. |
Can this be closed down? |
I'll close this one. I'll educate myself on how the platform-specific linker scripts and runtime code for Zephyr work. If it turns out to be easy to add some sort of safety check against this particular condition, I'll see if I can do so and offer up a patch. |
Closing as "user error". There's room for improvement in how Zephyr handles this improper condition, and I'll see if I can offer up a patch at some point. |
Describe the bug
My application is running on a Raspberry Pi Pico. It's been operating successfully for about a year now.
Some of the time-critical portions of the code (signal demodulation) are flagged in the CMakeLists
file as needing relocation to RAM, in order to avoid XIP overhead and performance loss:
I've added some new code to these routines recently. Suddenly, after adding some quite
vanilla code, the app wouldn't start up at all... even the main() routine wasn't being
called. I played around with alternatives, and found that the issue didn't seem to be
the content of the code, only the amount of it.
When I attached a debugger to the Pi (a BlackMagic Debug probe) and fired up GDB, I
found that the application was stuck in a memmove() in the early stages of kernel startup.
Upon examination, GDB showed me what appears to be a memory clobber - a bunch of
instructions in the memmove() code and subsequent routines are showing up as zeros
("movs r0,r0"). With the code having been "gutted" it hangs.
A GDB memory dump seems to show that almost everything from 0x10000000 onwards is
reading as zero... as if the flash has stopped responding.
If I reset the board into the bootloader, and use "picotool" to read back the entire contents
of the flash, it's a byte-for-byte exact match with the "zephyr.bin" created by the build...
so, as far as I can tell, memmove() and the other code is OK in flash. It's just not being
read properly during the early memmove() for some reason... apparently both the XIP
of the instructions, and a GDB memory-dump access are getting zeros for all of these
locations.
I've seen this behavior on two different boards (both with vanilla Raspberry Pi Pico modules in
good working condition). I've seen it after two different "download to flash" techniques
(copying the .UF2 file to the "flash disk", or using picotool to talk directly to the bootloader).
Oddly, if I use GDB to run the image (downloading it directly) it reliably starts up OK, and
the code is readable out of flash without error.
Expected behavior
Code is successfully relocated from flash to RAM, and executes properly.
Impact
Showstopper for continuing to add code which must run from RAM for performance reasons.
Possible workaround is to restructure my code to minimize the number of routines
marked as "must be relocated to RAM".
Logs and console output
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: