Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RP2040 (Raspberry Pi Pico) fails to start up - code clobbered or XIP-flash inaccessible? #82632

Closed
DavidCPlatt opened this issue Dec 6, 2024 · 14 comments
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug platform: Raspberry Pi Pico Raspberry Pi Pico (RPi Pico) priority: low Low impact/importance bug

Comments

@DavidCPlatt
Copy link

Describe the bug

My application is running on a Raspberry Pi Pico. It's been operating successfully for about a year now.

Some of the time-critical portions of the code (signal demodulation) are flagged in the CMakeLists
file as needing relocation to RAM, in order to avoid XIP overhead and performance loss:

  zephyr_code_relocate(FILES src/demod.c LOCATION RAM NOKEEP)
  zephyr_code_relocate(FILES src/biquad.c LOCATION RAM NOKEEP)
  zephyr_code_relocate(FILES src/hdlc.c LOCATION RAM NOKEEP)
  zephyr_code_relocate(FILES ../../drivers/dma/dma_rpi_pico.c LOCATION RAM NOKEEP)

I've added some new code to these routines recently. Suddenly, after adding some quite
vanilla code, the app wouldn't start up at all... even the main() routine wasn't being
called. I played around with alternatives, and found that the issue didn't seem to be
the content of the code, only the amount of it.

When I attached a debugger to the Pi (a BlackMagic Debug probe) and fired up GDB, I
found that the application was stuck in a memmove() in the early stages of kernel startup.
Upon examination, GDB showed me what appears to be a memory clobber - a bunch of
instructions in the memmove() code and subsequent routines are showing up as zeros
("movs r0,r0"). With the code having been "gutted" it hangs.

A GDB memory dump seems to show that almost everything from 0x10000000 onwards is
reading as zero... as if the flash has stopped responding.

If I reset the board into the bootloader, and use "picotool" to read back the entire contents
of the flash, it's a byte-for-byte exact match with the "zephyr.bin" created by the build...
so, as far as I can tell, memmove() and the other code is OK in flash. It's just not being
read properly during the early memmove() for some reason... apparently both the XIP
of the instructions, and a GDB memory-dump access are getting zeros for all of these
locations.

I've seen this behavior on two different boards (both with vanilla Raspberry Pi Pico modules in
good working condition). I've seen it after two different "download to flash" techniques
(copying the .UF2 file to the "flash disk", or using picotool to talk directly to the bootloader).

Oddly, if I use GDB to run the image (downloading it directly) it reliably starts up OK, and
the code is readable out of flash without error.

Expected behavior

Code is successfully relocated from flash to RAM, and executes properly.

Impact

Showstopper for continuing to add code which must run from RAM for performance reasons.
Possible workaround is to restructure my code to minimize the number of routines
marked as "must be relocated to RAM".

Logs and console output

$ arm-none-eabi-gdb build/zephyr/zephyr.elf
GNU gdb (Debian 13.1-3) 13.1
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from build/zephyr/zephyr.elf...
Available Targets:
No. Att Driver
 1      Raspberry RP2040 M0+
 2      Raspberry RP2040 M0+
 3      Raspberry RP2040 Rescue (Attach to reset!) 
0x10016216 in memmove ()
(gdb) bt
#0  0x10016216 in memmove ()
#1  0x10015a0c in z_early_memcpy (dst=<optimized out>, src=<optimized out>, 
    n=<optimized out>) at /home/dplatt/zephyrproject/zephyr/kernel/init.c:210
#2  0x10012124 in data_copy_xip_relocation ()
    at /home/dplatt/zephyrproject/zephyr/apps/tnc/build/zephyr/code_relocation.c:23
#3  0x10011c02 in z_data_copy ()
    at /home/dplatt/zephyrproject/zephyr/kernel/xip.c:49
#4  0x1000e8fc in z_prep_c ()
    at /home/dplatt/zephyrproject/zephyr/arch/arm/core/cortex_m/prep_c.c:195
#5  0x1000e750 in z_arm_reset ()
    at /home/dplatt/zephyrproject/zephyr/arch/arm/core/cortex_m/reset.S:169
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) disassemble
Dump of assembler code for function memmove:
   0x1001620e <+0>:	push	{r4, lr}
   0x10016210 <+2>:	cmp	r0, r1
   0x10016212 <+4>:	bhi.n	0x10016222 <memmove+20>
   0x10016214 <+6>:	movs	r3, #0
=> 0x10016216 <+8>:	cmp	r2, r3
   0x10016218 <+10>:	beq.n	0x1001622c <memmove+30>
   0x1001621a <+12>:	ldrb	r4, [r1, r3]
   0x1001621c <+14>:	strb	r4, [r0, r3]
   0x1001621e <+16>:	adds	r3, #1
   0x10016220 <+18>:	b.n	0x10016216 <memmove+8>
   0x10016222 <+20>:	adds	r3, r1, r2
   0x10016224 <+22>:	movs	r0, r0
   0x10016226 <+24>:	movs	r0, r0
   0x10016228 <+26>:	subs	r2, #1
   0x1001622a <+28>:	bcs.n	0x1001622e <memmove+32>
   0x1001622c <+30>:	movs	r0, r0
   0x1001622e <+32>:	movs	r0, r0
   0x10016230 <+34>:	movs	r0, r0
   0x10016232 <+36>:	movs	r0, r0
End of assembler dump.
(gdb) info registers
r0             0x100001a8          268435880
r1             0x100001a8          268435880
r2             0xa0                160
r3             0x23                35
r4             0x3a                58
r5             0x20041f01          537140993
r6             0x18000000          402653184
r7             0x0                 0
r8             0xffffffff          -1
r9             0xffffffff          -1
r10            0xffffffff          -1
r11            0xffffffff          -1
r12            0x4001801c          1073840156
sp             0x20012de8          0x20012de8 <z_interrupt_stacks+2008>
lr             0x10015a0d          0x10015a0d <z_early_memcpy+6>
pc             0x10016216          0x10016216 <memmove+8>
xpsr           0x1000000           16777216
msp            0x20013f10          0x20013f10 <sys_work_q_stack>
psp            0x20012de8          0x20012de8 <z_interrupt_stacks+2008>
primask        0x1                 1 '\001'
basepri        0x0                 0 '\000'
faultmask      0x0                 0 '\000'
control        0x2                 2 '\002'

(gdb) x/64xh memmove
0x1001620e <memmove>:	0xb510	0x4288	0xd806	0x2300	0x429a	0xd008	0x5ccc	0x54c4
0x1001621e <memmove+16>:	0x3301	0xe7f9	0x188b	0x0000	0x0000	0x3a01	0xd200	0x0000
0x1001622e <memmove+32>:	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000
0x1001623e <memset+10>:	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000
0x1001624e <__file_str_put+10>:	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000
0x1001625e <putc+10>:	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000
0x1001626e <putc+26>:	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000
0x1001627e <__math_uflowf+2>:	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000	0x0000

Environment (please complete the following information):

  • OS: Debian Linux
  • Toolchain: zephyr-sdk-0.16.8
  • Commit SHA or Version used: commit 8469084 (HEAD -> v4.0.0, tag: v4.0.0, origin/v4.0-branch)
@DavidCPlatt DavidCPlatt added the bug The issue is a bug, or the PR is fixing a bug label Dec 6, 2024
Copy link

github-actions bot commented Dec 6, 2024

Hi @DavidCPlatt! We appreciate you submitting your first issue for our open-source project. 🌟

Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙

@rruuaanng rruuaanng added the platform: Raspberry Pi Pico Raspberry Pi Pico (RPi Pico) label Dec 6, 2024
@soburi
Copy link
Member

soburi commented Dec 6, 2024

@DavidCPlatt

My application is running on a Raspberry Pi Pico. It's been operating successfully for about a year now.

Can you find the version exactly?

First look, the XIP related code has not changed recently for 1 year.
Another change that seems to be related is the introduction of SDK2.0

dccf4bf

so would it be possible to test the behavior before and after this?
Of course, it would be best if you could identify it by bisect.
Or, if you can provide the reproducible code, I'll try reproduce it.

@DavidCPlatt
Copy link
Author

I will certainly try!

This has the flavor of a Heisenbug. When the problem occurs with any specific build, it's 100% reproducible (rebuilding with a clean build directory and the same options yields the same result). However, touching the source code in almost any way, and then rebuilding, often makes the problem disappear for no obvious reason... and making other changes makes it come back.

My test builds this morning have all (annoyingly!) worked perfectly, even with the larger amount of code moved into RAM.

The next time I succeed in doing something which breaks the image, I'll save it, then rebuild with a few routines moved back from RAM to flash, get an image which starts up OK, and save that as well.

It's feeling as if it's probably either timing-sensitive, or very sensitive to the addresses of the code being moved (maybe, somehow, a clash with the XIP cache lines in which the copy-it-to-RAM code happens to lie?).

Time will tell... I'll try to put enough salt on its tail to pin it down :-)

@DavidCPlatt
Copy link
Author

images.zip

OK, I was able to reproduce the problem for long enough to make some images.

The attached ZIP archive contains both ELF and UF2 versions of the same firmware, differing only in whether I did or did not flag a bunch of routines as "time_sensitive" to force them into RAM. With this option in use:

Memory region         Used Size  Region Size  %age Used
      BOOT_FLASH:         256 B        256 B    100.00%
           FLASH:      124728 B    2096896 B      5.95%
             RAM:       97976 B       264 KB     36.24%
        IDT_LIST:          0 GB        32 KB      0.00%

and without it

Memory region         Used Size  Region Size  %age Used
      BOOT_FLASH:         256 B        256 B    100.00%
           FLASH:      123928 B    2096896 B      5.91%
             RAM:       93880 B       264 KB     34.73%
        IDT_LIST:          0 GB        32 KB      0.00%

The image named "works" starts up OK... the usual Zephyr kernel log info comes out the primary UART, and it will enumerate as an ACM "UART" on USB and you can talk to it with
minicom or etc.

The image named "hangs" does just that... never logs anything on the UART. According to
gdb, the kernel memcpy() routine has been largely zeroed... a bit differently than
yesterday but it looks like similar shrapnel.

Next step will be to make sure I can still reproduce it, then start bisecting Zephyr
(maybe with that SDK-change commit) or roll back to 3.7 and see if I can make it
happen there.

   0x10014dc4 <memcpy+0>:	movs	r3, #0
   0x10014dc6 <memcpy+2>:	push	{r4, lr}
   0x10014dc8 <memcpy+4>:	cmp	r2, r3
   0x10014dca <memcpy+6>:	bne.n	0x10014dce <memcpy+10>
   0x10014dcc <memcpy+8>:	pop	{r4, pc}
=> 0x10014dce <memcpy+10>:	ldrb	r4, [r1, r3]
   0x10014dd0 <memcpy+12>:	strb	r6, [r3, r3]
   0x10014dd2 <memcpy+14>:	strb	r3, [r7, #15]
   0x10014dd4 <memcpy+16>:	b.n	0x10014dc8 <memcpy+4>
   0x10014dd6 <memmove+0>:	push	{r4, lr}
   0x10014dd8 <memmove+2>:	movs	r0, r0
   0x10014dda <memmove+4>:	movs	r0, r0
   0x10014ddc <memmove+6>:	movs	r0, r0
   0x10014dde <memmove+8>:	movs	r0, r0
   0x10014de0 <memmove+10>:	movs	r0, r0
   0x10014de2 <memmove+12>:	movs	r0, r0
   0x10014de4 <memmove+14>:	movs	r0, r0
   0x10014de6 <memmove+16>:	movs	r0, r0
   0x10014de8 <memmove+18>:	movs	r0, r0
   0x10014dea <memmove+20>:	movs	r0, r0
   0x10014dec <memmove+22>:	movs	r0, r0
   0x10014dee <memmove+24>:	movs	r0, r0
   0x10014df0 <memmove+26>:	movs	r0, r0
   0x10014df2 <memmove+28>:	movs	r0, r0
   0x10014df4 <memmove+30>:	movs	r0, r0
   0x10014df6 <memmove+32>:	movs	r0, r0
   0x10014df8 <memmove+34>:	movs	r0, r0
   0x10014dfa <memmove+36>:	movs	r0, r0
   0x10014dfc <memset+0>:	movs	r0, r0
   0x10014dfe <memset+2>:	movs	r0, r0
   0x10014e00 <memset+4>:	movs	r0, r0
   0x10014e02 <memset+6>:	movs	r0, r0
   0x10014e04 <memset+8>:	movs	r0, r0
   0x10014e06 <memset+10>:	movs	r0, r0
   0x10014e08 <memset+12>:	adds	r3, #1
   0x10014e0a <memset+14>:	b.n	0x10014e00 <memset+4>
   0x10014e0c <__file_str_put+0>:	movs	r0, r0
   0x10014e0e <__file_str_put+2>:	movs	r0, r0
   0x10014e10 <__file_str_put+4>:	movs	r0, r0
   0x10014e12 <__file_str_put+6>:	movs	r0, r0
   0x10014e14 <__file_str_put+8>:	movs	r0, r0
   0x10014e16 <__file_str_put+10>:	movs	r0, r0
   0x10014e18 <__file_str_put+12>:	movs	r0, r0
   0x10014e1a <__file_str_put+14>:	movs	r0, r0

@DavidCPlatt
Copy link
Author

Hmmm. Interesting. I just did a gdb attach to a Pico which was successfully running my latest (not-forced-to-RAM) image, and did a disassembly of memmove(). Although this image started up just fine (and is still running), gdb seems to be showing that chunks of the memmove and later routines are full of zeros.

I'm also seeing zeros in the middle of routines which I know are being executed constantly
and successfully.

Beats me what's going on here! Maybe, once XIP is active, one cannot count on the
gdb accesses-to-memory to return valid data from flash? that only those instructions
which are in the XIP cache are being seen properly by GDB?

So, the "memory is being stomped during startup" appearance may be an artifact of
The Way Things Work, and not the direct cause of the hang.

   0x10015f3a <memcpy+0>:	movs	r3, #0
   0x10015f3c <memcpy+2>:	push	{r4, lr}
   0x10015f3e <memcpy+4>:	cmp	r2, r3
   0x10015f40 <memcpy+6>:	bne.n	0x10015f44 <memcpy+10>
   0x10015f42 <memcpy+8>:	pop	{r4, pc}
   0x10015f44 <memcpy+10>:	ldrb	r4, [r1, r3]
   0x10015f46 <memcpy+12>:	strb	r4, [r0, r3]
   0x10015f48 <memcpy+14>:	adds	r3, #1
   0x10015f4a <memcpy+16>:	b.n	0x10015f3e <memcpy+4>
   0x10015f4c <memmove+0>:	movs	r0, r0
   0x10015f4e <memmove+2>:	movs	r0, r0
   0x10015f50 <memmove+4>:	movs	r0, r0
   0x10015f52 <memmove+6>:	movs	r0, r0
   0x10015f54 <memmove+8>:	movs	r0, r0
   0x10015f56 <memmove+10>:	movs	r0, r0
   0x10015f58 <memmove+12>:	movs	r0, r0
   0x10015f5a <memmove+14>:	movs	r0, r0
   0x10015f5c <memmove+16>:	movs	r0, r0
   0x10015f5e <memmove+18>:	movs	r0, r0
   0x10015f60 <memmove+20>:	movs	r0, r0
   0x10015f62 <memmove+22>:	movs	r0, r0
   0x10015f64 <memmove+24>:	movs	r0, r0
   0x10015f66 <memmove+26>:	movs	r0, r0
   0x10015f68 <memmove+28>:	movs	r0, r0
   0x10015f6a <memmove+30>:	movs	r0, r0
   0x10015f6c <memmove+32>:	movs	r0, r0
   0x10015f6e <memmove+34>:	movs	r0, r0
   0x10015f70 <memmove+36>:	b.n	0x10015f66 <memmove+26>
   0x10015f72 <memset+0>:	movs	r3, r0
   0x10015f74 <memset+2>:	adds	r2, r0, r2
   0x10015f76 <memset+4>:	cmp	r3, r2
   0x10015f78 <memset+6>:	bne.n	0x10015f7c <memset+10>
   0x10015f7a <memset+8>:	bx	lr
   0x10015f7c <memset+10>:	strb	r1, [r3, #0]
   0x10015f7e <memset+12>:	adds	r3, #1
   0x10015f80 <memset+14>:	b.n	0x10015f76 <memset+4>
   0x10015f82 <__file_str_put+0>:	ldr	r3, [r1, #16]
   0x10015f84 <__file_str_put+2>:	ldr	r2, [r1, #20]
   0x10015f86 <__file_str_put+4>:	cmp	r3, r2
   0x10015f88 <__file_str_put+6>:	beq.n	0x10015f90 <__file_str_put+14>
   0x10015f8a <__file_str_put+8>:	adds	r2, r3, #1
   0x10015f8c <__file_str_put+10>:	str	r2, [r1, #16]
   0x10015f8e <__file_str_put+12>:	strb	r0, [r3, #0]
   0x10015f90 <__file_str_put+14>:	bx	lr
   0x10015f92 <putc+0>:	ldrb	r3, [r1, #2]
   0x10015f94 <putc+2>:	movs	r0, r0
   0x10015f96 <putc+4>:	movs	r0, r0
   0x10015f98 <putc+6>:	movs	r0, r0
   0x10015f9a <putc+8>:	movs	r0, r0
   0x10015f9c <putc+10>:	movs	r0, r0
   0x10015f9e <putc+12>:	movs	r0, r0
   0x10015fa0 <putc+14>:	movs	r0, r0
   0x10015fa2 <putc+16>:	movs	r0, r0
   0x10015fa4 <putc+18>:	movs	r0, r0
   0x10015fa6 <putc+20>:	movs	r0, r0
   0x10015fa8 <putc+22>:	movs	r0, r0
   0x10015faa <putc+24>:	movs	r0, r0
   0x10015fac <putc+26>:	movs	r0, r0
   0x10015fae <putc+28>:	movs	r0, r0
   0x10015fb0 <putc+30>:	movs	r0, r0
   0x10015fb2 <putc+32>:	movs	r0, r0
   0x10015fb4 <putc+34>:	movs	r0, r0
   0x10015fb6 <putc+36>:	movs	r0, r0
   0x10015fb8 <putc+38>:	movs	r0, r0
   0x10015fba <__math_uflowf+0>:	movs	r0, r0
   0x10015fbc <__math_uflowf+2>:	movs	r0, r0
   0x10015fbe <__math_uflowf+4>:	movs	r0, r0
   0x10015fc0 <__math_uflowf+6>:	movs	r0, r0
   0x10015fc2 <__math_uflowf+8>:	movs	r0, r0
   0x10015fc4 <__math_uflowf+10>:	movs	r0, r0
   0x10015fc6 <__math_uflowf+12>:	movs	r0, r0
   0x10015fc8 <__aeabi_read_tp+0>:	movs	r0, r0
   0x10015fca <__aeabi_read_tp+2>:	movs	r0, r0
   0x10015fcc <__aeabi_read_tp+4>:	movs	r0, r0
   0x10015fce <__aeabi_read_tp+6>:	movs	r0, r0
   0x10015fd0 <__aeabi_read_tp+8>:	movs	r0, r0
   0x10015fd2 <__aeabi_read_tp+10>:	movs	r0, r0
   0x10015fd4 <__aeabi_read_tp+12>:	movs	r0, r0
   0x10015fd6 <__aeabi_read_tp+14>:	movs	r0, r0
   0x10015fd8 <__handle_samples_veneer+0>:	push	{r0}
   0x10015fda <__handle_samples_veneer+2>:	ldr	r0, [pc, #8]	@ (0x10015fe4 <__handle_samples_veneer+12>)
   0x10015fdc <__handle_samples_veneer+4>:	mov	r12, r0
   0x10015fde <__handle_samples_veneer+6>:	pop	{r0}
   0x10015fe0 <__handle_samples_veneer+8>:	bx	r12
   0x10015fe2 <__handle_samples_veneer+10>:	nop
   0x10015fe4 <__handle_samples_veneer+12>:	movs	r1, #5
   0x10015fe6 <__handle_samples_veneer+14>:	movs	r0, #0
   0x10015fe8 <__flash_do_cmd_veneer+0>:	movs	r0, r0
   0x10015fea <__flash_do_cmd_veneer+2>:	movs	r0, r0
   0x10015fec <__flash_do_cmd_veneer+4>:	movs	r0, r0
   0x10015fee <__flash_do_cmd_veneer+6>:	movs	r0, r0
   0x10015ff0 <__flash_do_cmd_veneer+8>:	movs	r0, r0
   0x10015ff2 <__flash_do_cmd_veneer+10>:	movs	r0, r0
   0x10015ff4 <__flash_do_cmd_veneer+12>:	movs	r0, r0
   0x10015ff6 <__flash_do_cmd_veneer+14>:	movs	r0, r0
   0x10015ff8 <__flash_write_partial_veneer+0>:	movs	r0, r0
   0x10015ffa <__flash_write_partial_veneer+2>:	movs	r0, r0
   0x10015ffc <__flash_write_partial_veneer+4>:	movs	r0, r0
   0x10015ffe <__flash_write_partial_veneer+6>:	movs	r0, r0
   0x10016000 <__flash_write_partial_veneer+8>:	movs	r0, r0
   0x10016002 <__flash_write_partial_veneer+10>:	movs	r0, r0
   0x10016004 <__flash_write_partial_veneer+12>:	movs	r0, r0
   0x10016006 <__flash_write_partial_veneer+14>:	movs	r0, r0
   0x10016008 <__flash_range_erase_veneer+0>:	movs	r0, r0
   0x1001600a <__flash_range_erase_veneer+2>:	movs	r0, r0
   0x1001600c <__flash_range_erase_veneer+4>:	movs	r0, r0
   0x1001600e <__flash_range_erase_veneer+6>:	movs	r0, r0
   0x10016010 <__flash_range_erase_veneer+8>:	movs	r0, r0
   0x10016012 <__flash_range_erase_veneer+10>:	movs	r0, r0
   0x10016014 <__flash_range_erase_veneer+12>:	movs	r0, r0
   0x10016016 <__flash_range_erase_veneer+14>:	movs	r0, r0
   0x10016018 <__flash_range_program_veneer+0>:	movs	r0, r0
   0x1001601a <__flash_range_program_veneer+2>:	movs	r0, r0
   0x1001601c <__flash_range_program_veneer+4>:	movs	r0, r0
   0x1001601e <__flash_range_program_veneer+6>:	movs	r0, r0
   0x10016020 <__flash_range_program_veneer+8>:	movs	r0, r0
   0x10016022 <__flash_range_program_veneer+10>:	movs	r0, r0
   0x10016024 <__flash_range_program_veneer+12>:	movs	r0, r0
   0x10016026 <__flash_range_program_veneer+14>:	movs	r0, r0
   0x10016028:	movs	r0, r0
   0x1001602a:	movs	r0, r0
   0x1001602c:	movs	r0, r0
   0x1001602e:	movs	r0, r0
   0x10016030 <__init___device_dts_ord_26+0>:	movs	r0, r0
   0x10016032 <__init___device_dts_ord_26+2>:	movs	r0, r0
   0x10016034 <__init___device_dts_ord_26+4>:	movs	r0, r0
   0x10016036 <__init___device_dts_ord_26+6>:	movs	r0, r0
   0x10016038 <__init_statics_init_pre+0>:	movs	r0, r0

@DavidCPlatt
Copy link
Author

More interesting, makes-me-wonder stuff going on.

I took a look at the early-memcopy stuff, and at the registers at the time of the hang, and at the map output from my .elf image build.

It appears that the code as-written is trying to copy a portion of the flash image onto itself. Both __flash_rodata_rom_start and __flash_rodata_reloc_start are the same value (0x00100001a8). The amount of data being copied is tiny (0xa0 bytes) so it should have taken next to no time. However, at the time GDB connected to the board, r3 has 0x23 in it (the index
into the data being copied)... so, it's in the middle of copying.

  1. Why (try to) copy the flash RODATA data over itself when there's no relocation being done? Could there be odd side effects from doing so?
  2. Is the code actually hanging in the early-memcopy logic, or is it succeeding, and then crashing-and-restarting over and over at a later time during app startup?

If I can get builds to start failing again reliably, I'll try disabling the startup of various parts of my app and see if one of them seems to be triggering an early crash. mumblemutter... hate Heisenbugs... they never stay put...

@soburi
Copy link
Member

soburi commented Dec 7, 2024

This is very mysterious. The linker configuration for Raspberry Pi Pico is almost the same as the normal of Zephry...
I'll start looking into it, but this might be a challenge...

@DavidCPlatt
Copy link
Author

OK, this is even weirder than I thought.

Here's the memory code for memcpy() in my ELF file, as disassembled by objdump:

10014dc4 <memcpy>:
10014dc4:       2300            movs    r3, #0
10014dc6:       b510            push    {r4, lr}
10014dc8:       429a            cmp     r2, r3
10014dca:       d100            bne.n   10014dce <memcpy+0xa>
10014dcc:       bd10            pop     {r4, pc}
10014dce:       5ccc            ldrb    r4, [r1, r3]
10014dd0:       54c4            strb    r4, [r0, r3]
10014dd2:       3301            adds    r3, #1
10014dd4:       e7f8            b.n     10014dc8 <memcpy+0x4>

That all looks reasonable to me.

Now, here's what I see in memory on the board, after I load the image, it hangs during startup, and I attach gdb:

Dump of assembler code for function memcpy:
   0x10014dc4 <+0>:	movs	r3, #0
   0x10014dc6 <+2>:	push	{r4, lr}
=> 0x10014dc8 <+4>:	cmp	r2, r3
   0x10014dca <+6>:	bne.n	0x10014dce <memcpy+10>
   0x10014dcc <+8>:	pop	{r4, pc}
   0x10014dce <+10>:	ldrb	r4, [r1, r3]
   0x10014dd0 <+12>:	strb	r6, [r3, r3]
   0x10014dd2 <+14>:	strb	r3, [r7, #15]
   0x10014dd4 <+16>:	b.n	0x10014dc8 <memcpy+4>

I cannot currently account for the difference... but since R3 isn't being decremented, it's
no surprise that the startup code hangs itself. R3 was initialized to zero at the start of the
routine, and it's now stuck at 2... so, it seems to have been properly incremented the first
couple of times around the loop, and then the increment instruction kinda got lost somewhere
in the ozone and there have been no further increments. I've single-stepped around the memcpy loop a few times and R3 simply isn't changing.

The first of the two corrupted instructions went from "strb r4, [r0, r3]" to "strb r6, [r3, r3]".
The second went from "adds r3, #1" to "strb r3, [r7, #15]". Kinda feels as if bits are being
picked from 0 to 1, somehow?

I looked through the image, and fgrep doesn't see any "strb r3, [r7, #15]" anywhere at all.
So, this leans against "valid instruction read from wrong place" and more towards
"corrupted instruction".

I do recall that there are all sorts of dire warnings in the RP2040 data sheet about
writing to the flash... one must not do it while running code out of XIP, or
Very Bad Things Will Happen. The flash drivers are required to have some code loaded
in RAM that they can jump to, to perform the erasure and copying.

Perhaps this "dummy" copy of a range of flash, over itself, is triggering some
lockout activity in the SPI-flash chip which is causing a few XIP fetches to fall on
their tail and return gibberish?

@DavidCPlatt
Copy link
Author

Aha! The plot thickens. I think I've identified the culprit.

I did a dump of the address range of the __flash_rodata_rom_start stuff which is being
copied over itself... and lo and behold:

0x100001a8 <credits>:	0xb4	0x01	0x00	0x10	0x00	0x00	0x00	0x00
0x100001b0 <credits+8>:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x100001b8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x100001c0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00

and I recognize the symbol name. "credits" is a constant text string that I compiled into
my application, which I want in the image even if nothing refers to it (it's a GPL/Berkeley
license conformance thing).

I'd figured out a way to compel the linker to always include it: in my CMakeLists, I have

# Force the credits data to be loaded into the image,
# even though there's no reference to it.
zephyr_code_relocate(FILES src/credits.c LOCATION FLASH)

and that, almost certainly, is what's causing the "relocate some const-in-flash right over
itself" behavior, and (I strongly suspect) driving the startup code into undefined
behavior by "writing to flash".

So - short term fix for me is going to be a big Don't Do That. I'll figure out a way to
force a reference to the credits array from some live code somewhere, so that the
linker doesn't drop the symbol out as unused, and I'll get rid of that code relocation
directive.

Unfortunately, saying LOCATION ROM instead of LOCATION FLASH doesn't work.

With the relocation directive removed from the Makefile, the __flash_rodata_rom_start and
related symbols are gone from the image... so, no more code to do a bad copy. The app
starts up OK even with all of my time-sensitive code forced into RAM.

1 similar comment
@DavidCPlatt
Copy link
Author

Aha! The plot thickens. I think I've identified the culprit.

I did a dump of the address range of the __flash_rodata_rom_start stuff which is being
copied over itself... and lo and behold:

0x100001a8 <credits>:	0xb4	0x01	0x00	0x10	0x00	0x00	0x00	0x00
0x100001b0 <credits+8>:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x100001b8:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00
0x100001c0:	0x00	0x00	0x00	0x00	0x00	0x00	0x00	0x00

and I recognize the symbol name. "credits" is a constant text string that I compiled into
my application, which I want in the image even if nothing refers to it (it's a GPL/Berkeley
license conformance thing).

I'd figured out a way to compel the linker to always include it: in my CMakeLists, I have

# Force the credits data to be loaded into the image,
# even though there's no reference to it.
zephyr_code_relocate(FILES src/credits.c LOCATION FLASH)

and that, almost certainly, is what's causing the "relocate some const-in-flash right over
itself" behavior, and (I strongly suspect) driving the startup code into undefined
behavior by "writing to flash".

So - short term fix for me is going to be a big Don't Do That. I'll figure out a way to
force a reference to the credits array from some live code somewhere, so that the
linker doesn't drop the symbol out as unused, and I'll get rid of that code relocation
directive.

Unfortunately, saying LOCATION ROM instead of LOCATION FLASH doesn't work.

With the relocation directive removed from the Makefile, the __flash_rodata_rom_start and
related symbols are gone from the image... so, no more code to do a bad copy. The app
starts up OK even with all of my time-sensitive code forced into RAM.

@DavidCPlatt
Copy link
Author

I looked through the Zephyr documentation last night, and tried an experiment: putting
back in the relocation clause, but specifying NOCOPY:

# Force the credits data to be loaded into the image,
# even though there's no reference to it.
zephyr_code_relocate(FILES src/credits.c LOCATION FLASH NOCOPY)

I thought this might prevent the __flash_rodata_rom_start and related symbols from
being created, and the early-memcopy from being included in code_relocation.c

I was half-correct. The __flash_rodata_rom_start and __flash_rodata_reloc_start symbols
are still generated for the credits strings... but the memcpy call isn't present in
code_relocation.c and so there's no problem at run-time.

At this point, I think the cause of the boot-time crash is confirmed:

  • On the RP2040, trying to copy to flash at boot time is unpredictably fatal, depending on the state of the XIP cache. It "ought to be" a no-op, but the side effects are Bad.
  • Specifying a "LOCATION FLASH" relocation target for a file, in CMakeLists.text, will result in the generation of code which does this unpredictably-fatal operation.
  • Specifying "LOCATION FLASH NOCOPY" still leaves some symbol artifacts in the image (ROM source, flash target, and length) but the improper memcpy() call is not generated.

So, I think this bug could be resolved in either of two ways:

  1. Close it as "user error, inadequate understanding of the tools".
  2. Treat it as a request for improvement. Since "LOCATION FLASH" (without "NOCOPY") results in a useless-and-potentially-fatal memcpy() at boot time on the RP2040, flag this as an error on that platform and fail the build. "LOCATION FLASH NOCOPY" is still useful (it's an easy way to get a KEEP for the contents of the file, which is why I wanted it) so I wouldn't suggest disallowing it.

I'll leave it to you to decide how to resolve this.

Thanks for your assistance! It's been a fun bug-hunt.

@fabiobaltieri fabiobaltieri added the priority: low Low impact/importance bug label Dec 10, 2024
@fabiobaltieri
Copy link
Member

Can this be closed down?

@DavidCPlatt
Copy link
Author

I'll close this one.

I'll educate myself on how the platform-specific linker scripts and runtime code for Zephyr work. If it turns out to be easy to add some sort of safety check against this particular condition, I'll see if I can do so and offer up a patch.

@DavidCPlatt
Copy link
Author

Closing as "user error". There's room for improvement in how Zephyr handles this improper condition, and I'll see if I can offer up a patch at some point.

@DavidCPlatt DavidCPlatt closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug platform: Raspberry Pi Pico Raspberry Pi Pico (RPi Pico) priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants