Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

esp8266: Hangs when erasing spi sector on mtd0 if using esp_wifi #16281

Closed
iosabi opened this issue Apr 5, 2021 · 16 comments · Fixed by #17080
Closed

esp8266: Hangs when erasing spi sector on mtd0 if using esp_wifi #16281

iosabi opened this issue Apr 5, 2021 · 16 comments · Fixed by #17080
Assignees
Labels
Area: cpu Area: CPU/MCU ports Platform: ESP Platform: This PR/issue effects ESP-based platforms Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@iosabi
Copy link
Contributor

iosabi commented Apr 5, 2021

Description

esp8266 hangs when the esp_wifi module is built in and one attempts to format a spiffs on mtd0.

I'm using the ESP12x board (in my case the Adafruit Huzzah breakout board without anything connected to it other than the UART).

Steps to reproduce the issue

  1. Download xtensa-esp8266-elf-gcc and make it available in your PATH
  2. Download ESP RTOS SDK to /opt/esp/ESP8266-RTOS-SDK
  3. Compile the following example:
USEMODULE="esp_spiffs esp_wifi lwip_netdev lwip_arp" make BOARD=esp8266-esp-12x -C examples/filesystem all flash
  1. Connect with your favorite terminal. Check that the board is working (type help for example).
  2. Run format (if it runs ok then rung format again).

Expected results

/sda successfully formatted

Actual results

The format command will most likely hang the board. Sometimes it doesn't hang the board the first time, just run it again in that case, but ~90% it does. The board doesn't respond anymore and after a few seconds it restarts with the following message:

 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset

Note that the WiFi connection is not established. There's no AP around with the default name (RIOT_AP), but you would see sporadically messages like this in the console:

WiFi disconnected from ssid RIOT_AP, reason 201 (NO_AP_FOUND)

Versions

Operating System: "Ubuntu" "20.04.2 LTS (Focal Fossa)"

xtensa-esp8266-elf-gcc: xtensa-esp8266-elf-gcc (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a5) 5.2.0

Extra info

The following examples work fine:

USEMODULE="esp_spiffs" make BOARD=esp8266-esp-12x -C examples/filesystem flash

The spiffs test behaves the same way. It works ok with just "esp_spiffs", but hangs when using esp_wifi as well.

CFLAGS=-DCONFIG_USE_HARDWARE_MTD USEMODULE="esp_spiffs" make BOARD=esp8266-esp-12x -C tests/pkg_spiffs flash

Note: tests/pkg_spiffs is a bit slow because it erases the whole device.

@iosabi
Copy link
Contributor Author

iosabi commented Apr 5, 2021

Maybe @gschorcht has some info.

Debugging

I enabled ENABLE_DEBUG in cpu/esp_common/periph/flash.c and I can see that the device hangs in the erase function. Adding a few printf I see that it always hangs inside spi_flash_erase_sector.

My guess is that this is related to a WiFi-related ISR triggering while we are waiting for the flash to erase a block. Erasing a block happens in the ROM function SPI_sector_erase, which is inside a block of Cache_Read_Disable_2 and Cache_Read_Enable_2 when called from spi_flash_erase_sector.

I listed the ISR handlers in the _xt_interrupt_table and handlers 8 (ETS_WDT_INUM, registered to esp_task_wdt_isr) and 10 (registered to ets_timer_handler_dsr) are the only ones in 0x402xxxxx address range (IROM - flash), while others are in IRAM. I don't know if this makes a difference.

I replaced ISR 10 with a wrapper in IRAM that raises a GPIO, then calls the original handler and then lowers the GPIO after returning from the handler. I can see ISR 10 firing every 100ms in the esp_wifi case. This doesn't occur at all when esp_wifi module is not compiled (the ISR 10 is also not registered to anything).

My guess would be that the CPU is trying to execute something from flash (IROM) but the flash is disabled / not mapped or something like that which causes the CPU to stall, and my guess would be that this code is the ISR 10; however I don't see the GPIO high when the CPU hangs which would indicate that's not inside the ISR 10.

Any other idea of where could it be hanging? I couldn't get the JTAG debugger to work so I don't know what's the CPU doing before WDT triggers. I'll try to debug a bit more but I'm out of ideas.

@jeandudey jeandudey added Area: cpu Area: CPU/MCU ports Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) labels Apr 13, 2021
@aabadie aabadie added the Platform: ESP Platform: This PR/issue effects ESP-based platforms label May 20, 2021
@MrKevinWeiss MrKevinWeiss added this to the Release 2021.07 milestone Jun 21, 2021
@MrKevinWeiss MrKevinWeiss removed this from the Release 2021.07 milestone Jul 15, 2021
@gschorcht
Copy link
Contributor

My guess is that this is related to a WiFi-related ISR triggering while we are waiting for the flash to erase a block. Erasing a block happens in the ROM function SPI_sector_erase, which is inside a block of Cache_Read_Disable_2 and Cache_Read_Enable_2 when called from spi_flash_erase_sector.

That's right, the IROM cache and thus the access to instructions from SPI flash is disabled for any SPI flash write access to avoid conflicts with read accesses from the IROM cache controller. Since erasing a sector takes some time, it is quite possible that the WiFi interface generates some interrupt while the IROM cache is disabled. However, since ISRs should usually be realized in IRAM (to be fast enough and always accessible) that shouldn't be a problem. Furthermore, Interrupts from the WiFi interface are not handled directly, rather the ISR simply generates an event that is queued and handled asynchronously as quickly as possible by one of the WiFi tasks, ppT, pmT or rtT.

I listed the ISR handlers in the _xt_interrupt_table and handlers 8 (ETS_WDT_INUM, registered to esp_task_wdt_isr) and 10 (registered to ets_timer_handler_dsr) are the only ones in 0x402xxxxx address range (IROM - flash), while others are in IRAM. I don't know if this makes a difference.

IRAM access is much faster than IROM access using a cache. Furthermore, the code from IRAM is always available while the code from IROM is not, e.g., because the cache is disabled. Therefore, ISRs should be implemented in IRAM whenever possible, unless their execution is not time-critical. Since the IRAM is small, as much code as possible is placed in the IROM, probably the two ISRs you mentioned are not time critical.

If I remember correctly, I have observed that IROM isn't available sometimes when the WiFi interface is used. Maybe the WiFi hardware is connected internally using SPI or disables the IROM cache for any reason.

@gschorcht
Copy link
Contributor

gschorcht commented Oct 25, 2021

Note that the WiFi connection is not established. There's no AP around with the default name (RIOT_AP), but you would see sporadically messages like this in the console:

WiFi disconnected from ssid RIOT_AP, reason 201 (NO_AP_FOUND)

This message is OK and only says that the association could not be established because the configured AP with the BSSID RIOT_AP could not be found.

I replaced ISR 10 with a wrapper in IRAM that raises a GPIO, then calls the original handler and then lowers the GPIO after returning from the handler. I can see ISR 10 firing every 100ms in the esp_wifi case. This doesn't occur at all when esp_wifi module is not compiled (the ISR 10 is also not registered to anything).

FRC2 is used by the WiFi module to assert interrupt 10 on a regular base. However, interrupt 10 shouldn't be asserted at all when function spi_flash_erase_sector is running since all interrupts are then disabled with exception of the NMI (interrupt 14).

Just for clearificaition, could you observe that interrupt 10 is triggered regularly until the format command is executed, and that it is not triggered again once the format command has hung up?

@iosabi
Copy link
Contributor Author

iosabi commented Oct 25, 2021

Correct, int 10 is not triggered anymore after running format. I think that in general the thing just freezes.

I added a bit more of debug (using GPIOs with GPIO.OUT_SET = BIT(n) to avoid issues while flash is disabled) and I can tell that _panic_handler() is being called. There's no message on the terminal (for some reason ets_printf() doesn't work in that context, probably eventually uses a function from flash which is disabled) but after some more work I could print the address from where _xt_panic was called: 0x40100013 which is the call from _DebugExceptionVector. My guess here was that we are hitting a break assembly instruction somewhere else.

I attached the JTAG debugger and everything runs fine until I execute format, there we break to the debugger on the break 1, 4 at 0x40100070 (_DoubleExceptionVector). If I comment out that break instruction and retry again, then I see that _xt_panic is called from 0x40100070 (the call after the instruction we just removed).

So what I can conclude from this is that _DoubleExceptionVector is being called. If I understand correctly on a double exception the PC of the instruction that caused it should be in the DEPC register. Here I see this coming from this s32i store to the stack in _xt_user_exc, the stack when entering the DoubleException is 0x3ff7ffb0 which doesn't map to any reasonable address.

I dump the RAM region to a file (0x3ffe8000, len 0x18000) and I see a repeating pattern of 0xa0 bytes (not exactly the same, but clearly a pattern) from the beginning of the dump up to about the address of .NMIHandlerStackTop, so my guess here is that the double exception is caused by a stack overflow in the NMI stack due to an infinite recursion. Writing to 0x3ff80000-0x3ffe8000 probably doesn't cause a double exception, so it fails on the first write before that. The NMI stack is only 0x200 bytes long.

The repeating pattern is probably the interrupt stack frame created by _xt_user_exc, which somewhere causes an interrupt to be generated on the same stack, adding another stack frame. I see on the ram dump a pointer to the address where the instruction call0 _xt_context_restore about 100 lines below is, this is the return address from the callx0 a4 that calls the user handler. As I mentioned earlier the handler for int 10 is in flash, so that call will probably generate some exception, but this is a bit more speculation. I tried placing ets_timer.o in IRAM and I still get this problem. I don't know enough of the way stacks are handled in nested interrupts in xtensa, sorry.

I think all interrupts and all code the interrupts call should be placed in IRAM (or boot rom) since interrupts can be called at any point, even with the flash cache disabled, regardless of whether they are time critical.

@gschorcht
Copy link
Contributor

@iosabi Thank you very much for this detailed debugging. I'm really impressed, I've never been that lucky to get openocd running with JTAG to be able to investigate issues in such detail.

There's no message on the terminal (for some reason ets_printf() doesn't work in that context,

That's true because ets_printf uses putchar which is placed in IROM. Maybe it makes sense also to place putchar in IRAM section *libg.a:*putchar.o(.literal .text .literal.* .text.*).

As I mentioned earlier the handler for int 10 is in flash, so that call will probably generate some exception, but this is a bit more speculation.

I wonder why INT10 should be handled at all, since all interrupts are disabled during SPI flash operations. Are you sure that is still handled by ets_timer_handler_isr and crashes while being in spi_flash_erase_sector function?

@iosabi
Copy link
Contributor Author

iosabi commented Oct 28, 2021

No, I'm not sure what's causing the interrupt handler (_xt_user_exc) to run at all, but I also didn't see where the int10 (or all) is being disabled during the erase. Moving ets_timer.o to IRAM didn't help, so maybe there is some other failure. The failure I see is an overflow in the NMI stack, so maybe the NMI is being executed during the erase. As I said, this was more on the speculation side. I'll try to add some debug print outs to the assembly in startup.S to try trace the code.

@gschorcht
Copy link
Contributor

gschorcht commented Oct 29, 2021

Finally I could find out what the problem is. The function vTaskEnterCritical, which is also used by the ESP8266 spi_flash implementation to disable interrupts, is placed in the IROM.

In

*freertos/*(.literal .text .literal.* .text.*)
the functions implemented in cpu/esp8266/freetos/*.c are placed in IRAM, but functions implemented in cpu/esp_common/freetos/*.c are not placed in IRAM. The reason is that the object files for these files are created in the directory esp_freertos_common instead of something like *freertos, because the library is called esp_freertos_common.

Adding *esp_freertos_common/*.o(.literal .text .literal.* .text.*) to the IRAM section solves the problem. This might also be the cause for other instabilities. I will provide a PR.

@gschorcht
Copy link
Contributor

@iosabi This issue was closed automatically by merging #17080. Please test it again and reopen this issue if you think that the issue isn't solved.

@iosabi
Copy link
Contributor Author

iosabi commented Oct 30, 2021

Can't repro anymore after that patch. Sounds like it fixed it. I see 1332 bytes more in IRAM with that patch in the examples/filesystem which is pretty reasonable. Thanks for getting to the bottom of this!

I think we need to have a way to produce a more obvious crash when trying to run code from flash if the cache is disabled. This will likely happen again. Maybe a break instruction and a comment next to it with information about how to find out what flash address caused the fault to quickly know which function should be moved to IRAM.

@gschorcht
Copy link
Contributor

gschorcht commented Oct 30, 2021

Thanks for getting to the bottom of this!

I must thank you for finding this problem and investigating it. Perhaps it was also the cause for other instabilities, especially when using esp_wifi.

Maybe a break instruction and a comment next to it with information about how to find out what flash address caused the fault to quickly know which function should be moved to IRAM.

What break instruction should be used and especially where should it be placed. If understand you correctly, the break instruction should be executed each time a function in IROM is called while the cache is disabled.

@iosabi
Copy link
Contributor Author

iosabi commented Oct 30, 2021

Any break instruction works. xtensa ISA doc says that break 8, x to break 15, x are all user-defined.

I believe that if you try to execute from IROM when the cache is disabled you get an exception, probably NMI but haven't check, so we should add code that checks this condition and panics with a reasonable message or breaks in the debugger.

@iosabi
Copy link
Contributor Author

iosabi commented Oct 31, 2021

@gschorcht so, there are quite a few problems going on here.

  1. Nothing happens when accessing flash memory (read from 0x402xxxxx) when the flash cache is disabled. Nothing. You just read 0.
  2. 00 00 00 happens to be the opcode for ill, the illegal instruction so trying to execute from flash with the cache disable at least produces an IllegalInstructionCause exception. The PC will point to the ill instruction and there's no problem fetching it.
  3. exception_handler() is in IRAM, but calls functions in flash, in particular ps() and heap_stats(). In theory it should also call putchar() which is also in flash, but ets_printf() doesn't ever call it (see 5).
  4. ets_printf(), the .text function code is in IRAM (See
    *esp_idf_esp8266/ets_printf.o(.literal .text .literal.* .text.*)
    ) but, the xtensa-esp8266-elf toolchain places the literal strings in .rodata, not in .literal, so ets_printf doesn't actually work with the cache disabled because all its const data is not placed in IRAM. It is probably useful to have that function in RAM anyway.
  5. Going one step back, exception_handler() is tagged with IRAM in the source code (not mentioned in the linker script), but the string literals it uses don't end up in IRAM, those are in .rodata which according to the .map file end up in flash, and that includes the format strings we pass to ets_printf(). Reading those strings from flash is not a problem (not an exception), it just reads 0s which happens to be a valid empty C string and ets_printf() happily prints it without ever calling putchar, which is why we get no output on the serial but also no crash in putchar().
  6. Even after the specific fix for vTaskEnterCritical, causing any exception while the flash cache is disabled would trigger the infinite exception.

iosabi added a commit to iosabi/RIOT that referenced this issue Oct 31, 2021
When an exception occurs while handling an exception with
exception_handler() this function will be called again, increasing the
exception stack until the whole stack and the memory before it is
exhausted and a DoubleException occurs when trying to use an address in
the `0x3ff7ffxx` range, far away from any clue to the initial exception.

A common cause for an exception in exception_handler() is to try to
execute a function from IROM while the cache is disabled. This normally
results in an illegal instruction exception. A few of the functions
called from the exception_handler() are in IROM, triggering this
situation if an exception occurs while the flash cache is disabled.

This patch breaks to the debugger immediately if exception_handler is
called twice, so the location of the function causing the double
exception can be obtained in frame->pc, and the frame in that context
can be inspected too.

Second, to address the issue of printing the error message in the
terminal this patch enables the flash cache before calling ets_printf().
While exception_handler() is in IRAM, its read-only literal strings are
not, so calling ets_printf() will not work.

With this patch and without the fix in RIOT-OS#17080, reproducing the
conditions in issue RIOT-OS#16281 allows us to quickly identify the PC of the
function that was causing the issue (`vTaskEnterCritical`) on the first
place.
@gschorcht
Copy link
Contributor

2. IllegalInstructionCause exception

Yes, that is exactly what I saw when the cache was disabled when calling a function in IROM, but after not working on the ESP8266 port for at least 2 years, I forgot.

3. exception_handler() is in IRAM, but calls functions in flash, in particular ps() and heap_stats(). In theory it should also call putchar() which is also in flash, but ets_printf() doesn't ever call it (see 5).

That is, they would have to be placed in IRAM too.

4. xtensa-esp8266-elf toolchain places the literal strings in .rodata, not in .literal, so ets_printf doesn't actually work with the cache

Ah, thats why the original linker script includes .rodata for ets_printf.o:

#ifdef CONFIG_LINK_ETS_PRINTF_TO_IRAM
    *libesp8266.a:ets_printf.o(.literal .text .literal.* .text.* .rodata.* .rodata)
#endif

Finally, we would have to extend esp8266.riot-os.ld in IRAM section by:

diff --git a/cpu/esp8266/ld/esp8266.riot-os.ld b/cpu/esp8266/ld/esp8266.riot-os.ld
index e99aa855fe..06df3af98d 100644
--- a/cpu/esp8266/ld/esp8266.riot-os.ld
+++ b/cpu/esp8266/ld/esp8266.riot-os.ld
@@ -233,7 +233,7 @@ SECTIONS
     /* SDK libraries that expect their .text or .data sections to link to iram */
     /* TODO *libcore.a:(.bss .data .bss.* .data.* COMMON) */
     *esp_idf_spi_flash/spi_flash_raw.o(.literal .text .literal.* .text.*)
-    *esp_idf_esp8266/ets_printf.o(.literal .text .literal.* .text.*)
+    *esp_idf_esp8266/ets_printf.o(.literal .text .literal.* .text.* .rodata.* .rodata)
     /*
     *cpu.a:*.o(.literal .text .literal.* .text.*)
     */
@@ -241,6 +241,8 @@ SECTIONS
     *esp_wifi/*(.literal .text .literal.* .text.*)
     *freertos/*(.literal .text .literal.* .text.*)
     *periph/*(.literal .text .literal.* .text.*)
+    *ps/*(.literal .text .literal.* .text.* .rodata.* .rodata)
+    *newlib_syscalls_default/*(.literal .text .literal.* .text.* .rodata.* .rodata)
     *xtimer/*(.literal .text .literal.* .text.*)
 
     *libhal.a:clock.o(.literal .text .literal.* .text.*)
@@ -287,6 +289,7 @@ SECTIONS
     *libc.a:*findfp.o(.literal .text .literal.* .text.*)
     *libc.a:*fputwc.o(.literal .text .literal.* .text.*)
     */
+    *libg.a:*putchar.o(.literal .text .literal.* .text.*)
 
     *enc28j60/*(.literal .text .literal.* .text.*)

@gschorcht
Copy link
Contributor

I would prefer to place everything in IRAM. But with 48 kByte it is so small that we have to limit what goes into the IRAM. It always requires a bit of trial and error what absolutely has to go into the IRAM.

@iosabi
Copy link
Contributor Author

iosabi commented Oct 31, 2021

Why do we have both printf() in newlib and ets_printf()?

@gschorcht
Copy link
Contributor

If I remember correctly, there were the following reasons:

  1. The SDK redefines ets_printf which is usually a ROM function for libraries that require ets_printf. ets_printf source code in SDK gives some explanation. We just use a slightly adapted version.
  2. ets_printf is the only printf function that should work even if the IROM cache is disabled, which is not the case at the moment, as we learned.
  3. ets_printf is the only printf function that works before the IROM cache is enabled during startup.
  4. ets_printf was intended to work also inside interrupt context. The newlib in xtensa-gcc uses the reentrant version of printf which uses lock functions which in turn use mutexes. RIOT's mutex, however, doesn't work in interrupt context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: cpu Area: CPU/MCU ports Platform: ESP Platform: This PR/issue effects ESP-based platforms Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants