esp8266: Hangs when erasing spi sector on mtd0 if using esp_wifi #16281

iosabi · 2021-04-05T20:24:15Z

Description

esp8266 hangs when the esp_wifi module is built in and one attempts to format a spiffs on mtd0.

I'm using the ESP12x board (in my case the Adafruit Huzzah breakout board without anything connected to it other than the UART).

Steps to reproduce the issue

Download xtensa-esp8266-elf-gcc and make it available in your PATH
Download ESP RTOS SDK to /opt/esp/ESP8266-RTOS-SDK
Compile the following example:

USEMODULE="esp_spiffs esp_wifi lwip_netdev lwip_arp" make BOARD=esp8266-esp-12x -C examples/filesystem all flash

Connect with your favorite terminal. Check that the board is working (type help for example).
Run format (if it runs ok then rung format again).

Expected results

/sda successfully formatted

Actual results

The format command will most likely hang the board. Sometimes it doesn't hang the board the first time, just run it again in that case, but ~90% it does. The board doesn't respond anymore and after a few seconds it restarts with the following message:

 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset

Note that the WiFi connection is not established. There's no AP around with the default name (RIOT_AP), but you would see sporadically messages like this in the console:

WiFi disconnected from ssid RIOT_AP, reason 201 (NO_AP_FOUND)

Versions

Operating System: "Ubuntu" "20.04.2 LTS (Focal Fossa)"

xtensa-esp8266-elf-gcc: xtensa-esp8266-elf-gcc (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a5) 5.2.0

Extra info

The following examples work fine:

USEMODULE="esp_spiffs" make BOARD=esp8266-esp-12x -C examples/filesystem flash

The spiffs test behaves the same way. It works ok with just "esp_spiffs", but hangs when using esp_wifi as well.

CFLAGS=-DCONFIG_USE_HARDWARE_MTD USEMODULE="esp_spiffs" make BOARD=esp8266-esp-12x -C tests/pkg_spiffs flash

Note: tests/pkg_spiffs is a bit slow because it erases the whole device.

The text was updated successfully, but these errors were encountered:

iosabi · 2021-04-05T22:20:10Z

Maybe @gschorcht has some info.

Debugging

I enabled ENABLE_DEBUG in cpu/esp_common/periph/flash.c and I can see that the device hangs in the erase function. Adding a few printf I see that it always hangs inside spi_flash_erase_sector.

My guess is that this is related to a WiFi-related ISR triggering while we are waiting for the flash to erase a block. Erasing a block happens in the ROM function SPI_sector_erase, which is inside a block of Cache_Read_Disable_2 and Cache_Read_Enable_2 when called from spi_flash_erase_sector.

I listed the ISR handlers in the _xt_interrupt_table and handlers 8 (ETS_WDT_INUM, registered to esp_task_wdt_isr) and 10 (registered to ets_timer_handler_dsr) are the only ones in 0x402xxxxx address range (IROM - flash), while others are in IRAM. I don't know if this makes a difference.

I replaced ISR 10 with a wrapper in IRAM that raises a GPIO, then calls the original handler and then lowers the GPIO after returning from the handler. I can see ISR 10 firing every 100ms in the esp_wifi case. This doesn't occur at all when esp_wifi module is not compiled (the ISR 10 is also not registered to anything).

My guess would be that the CPU is trying to execute something from flash (IROM) but the flash is disabled / not mapped or something like that which causes the CPU to stall, and my guess would be that this code is the ISR 10; however I don't see the GPIO high when the CPU hangs which would indicate that's not inside the ISR 10.

Any other idea of where could it be hanging? I couldn't get the JTAG debugger to work so I don't know what's the CPU doing before WDT triggers. I'll try to debug a bit more but I'm out of ideas.

gschorcht · 2021-10-21T13:06:04Z

My guess is that this is related to a WiFi-related ISR triggering while we are waiting for the flash to erase a block. Erasing a block happens in the ROM function SPI_sector_erase, which is inside a block of Cache_Read_Disable_2 and Cache_Read_Enable_2 when called from spi_flash_erase_sector.

That's right, the IROM cache and thus the access to instructions from SPI flash is disabled for any SPI flash write access to avoid conflicts with read accesses from the IROM cache controller. Since erasing a sector takes some time, it is quite possible that the WiFi interface generates some interrupt while the IROM cache is disabled. However, since ISRs should usually be realized in IRAM (to be fast enough and always accessible) that shouldn't be a problem. Furthermore, Interrupts from the WiFi interface are not handled directly, rather the ISR simply generates an event that is queued and handled asynchronously as quickly as possible by one of the WiFi tasks, ppT, pmT or rtT.

I listed the ISR handlers in the _xt_interrupt_table and handlers 8 (ETS_WDT_INUM, registered to esp_task_wdt_isr) and 10 (registered to ets_timer_handler_dsr) are the only ones in 0x402xxxxx address range (IROM - flash), while others are in IRAM. I don't know if this makes a difference.

IRAM access is much faster than IROM access using a cache. Furthermore, the code from IRAM is always available while the code from IROM is not, e.g., because the cache is disabled. Therefore, ISRs should be implemented in IRAM whenever possible, unless their execution is not time-critical. Since the IRAM is small, as much code as possible is placed in the IROM, probably the two ISRs you mentioned are not time critical.

If I remember correctly, I have observed that IROM isn't available sometimes when the WiFi interface is used. Maybe the WiFi hardware is connected internally using SPI or disables the IROM cache for any reason.

gschorcht · 2021-10-25T13:31:59Z

Note that the WiFi connection is not established. There's no AP around with the default name (RIOT_AP), but you would see sporadically messages like this in the console:
WiFi disconnected from ssid RIOT_AP, reason 201 (NO_AP_FOUND)

This message is OK and only says that the association could not be established because the configured AP with the BSSID RIOT_AP could not be found.

I replaced ISR 10 with a wrapper in IRAM that raises a GPIO, then calls the original handler and then lowers the GPIO after returning from the handler. I can see ISR 10 firing every 100ms in the esp_wifi case. This doesn't occur at all when esp_wifi module is not compiled (the ISR 10 is also not registered to anything).

FRC2 is used by the WiFi module to assert interrupt 10 on a regular base. However, interrupt 10 shouldn't be asserted at all when function spi_flash_erase_sector is running since all interrupts are then disabled with exception of the NMI (interrupt 14).

Just for clearificaition, could you observe that interrupt 10 is triggered regularly until the format command is executed, and that it is not triggered again once the format command has hung up?

iosabi · 2021-10-25T23:45:23Z

Correct, int 10 is not triggered anymore after running format. I think that in general the thing just freezes.

I added a bit more of debug (using GPIOs with GPIO.OUT_SET = BIT(n) to avoid issues while flash is disabled) and I can tell that _panic_handler() is being called. There's no message on the terminal (for some reason ets_printf() doesn't work in that context, probably eventually uses a function from flash which is disabled) but after some more work I could print the address from where _xt_panic was called: 0x40100013 which is the call from _DebugExceptionVector. My guess here was that we are hitting a break assembly instruction somewhere else.

I attached the JTAG debugger and everything runs fine until I execute format, there we break to the debugger on the break 1, 4 at 0x40100070 (_DoubleExceptionVector). If I comment out that break instruction and retry again, then I see that _xt_panic is called from 0x40100070 (the call after the instruction we just removed).

So what I can conclude from this is that _DoubleExceptionVector is being called. If I understand correctly on a double exception the PC of the instruction that caused it should be in the DEPC register. Here I see this coming from this s32i store to the stack in _xt_user_exc, the stack when entering the DoubleException is 0x3ff7ffb0 which doesn't map to any reasonable address.

I dump the RAM region to a file (0x3ffe8000, len 0x18000) and I see a repeating pattern of 0xa0 bytes (not exactly the same, but clearly a pattern) from the beginning of the dump up to about the address of .NMIHandlerStackTop, so my guess here is that the double exception is caused by a stack overflow in the NMI stack due to an infinite recursion. Writing to 0x3ff80000-0x3ffe8000 probably doesn't cause a double exception, so it fails on the first write before that. The NMI stack is only 0x200 bytes long.

The repeating pattern is probably the interrupt stack frame created by _xt_user_exc, which somewhere causes an interrupt to be generated on the same stack, adding another stack frame. I see on the ram dump a pointer to the address where the instruction call0 _xt_context_restore about 100 lines below is, this is the return address from the callx0 a4 that calls the user handler. As I mentioned earlier the handler for int 10 is in flash, so that call will probably generate some exception, but this is a bit more speculation. I tried placing ets_timer.o in IRAM and I still get this problem. I don't know enough of the way stacks are handled in nested interrupts in xtensa, sorry.

I think all interrupts and all code the interrupts call should be placed in IRAM (or boot rom) since interrupts can be called at any point, even with the flash cache disabled, regardless of whether they are time critical.

gschorcht · 2021-10-26T07:56:02Z

@iosabi Thank you very much for this detailed debugging. I'm really impressed, I've never been that lucky to get openocd running with JTAG to be able to investigate issues in such detail.

There's no message on the terminal (for some reason ets_printf() doesn't work in that context,

That's true because ets_printf uses putchar which is placed in IROM. Maybe it makes sense also to place putchar in IRAM section *libg.a:*putchar.o(.literal .text .literal.* .text.*).

As I mentioned earlier the handler for int 10 is in flash, so that call will probably generate some exception, but this is a bit more speculation.

I wonder why INT10 should be handled at all, since all interrupts are disabled during SPI flash operations. Are you sure that is still handled by ets_timer_handler_isr and crashes while being in spi_flash_erase_sector function?

iosabi · 2021-10-28T22:16:31Z

No, I'm not sure what's causing the interrupt handler (_xt_user_exc) to run at all, but I also didn't see where the int10 (or all) is being disabled during the erase. Moving ets_timer.o to IRAM didn't help, so maybe there is some other failure. The failure I see is an overflow in the NMI stack, so maybe the NMI is being executed during the erase. As I said, this was more on the speculation side. I'll try to add some debug print outs to the assembly in startup.S to try trace the code.

gschorcht · 2021-10-29T09:05:04Z

Finally I could find out what the problem is. The function vTaskEnterCritical, which is also used by the ESP8266 spi_flash implementation to disable interrupts, is placed in the IROM.

In

RIOT/cpu/esp8266/ld/esp8266.riot-os.ld

Line 242 in 1dfcc6f

*freertos/*(.literal .text .literal.* .text.*)

the functions implemented in cpu/esp8266/freetos/*.c are placed in IRAM, but functions implemented in cpu/esp_common/freetos/*.c are not placed in IRAM. The reason is that the object files for these files are created in the directory esp_freertos_common instead of something like *freertos, because the library is called esp_freertos_common.

Adding *esp_freertos_common/*.o(.literal .text .literal.* .text.*) to the IRAM section solves the problem. This might also be the cause for other instabilities. I will provide a PR.

gschorcht · 2021-10-29T11:04:12Z

@iosabi This issue was closed automatically by merging #17080. Please test it again and reopen this issue if you think that the issue isn't solved.

iosabi · 2021-10-30T10:54:05Z

Can't repro anymore after that patch. Sounds like it fixed it. I see 1332 bytes more in IRAM with that patch in the examples/filesystem which is pretty reasonable. Thanks for getting to the bottom of this!

I think we need to have a way to produce a more obvious crash when trying to run code from flash if the cache is disabled. This will likely happen again. Maybe a break instruction and a comment next to it with information about how to find out what flash address caused the fault to quickly know which function should be moved to IRAM.

gschorcht · 2021-10-30T13:20:44Z

Thanks for getting to the bottom of this!

I must thank you for finding this problem and investigating it. Perhaps it was also the cause for other instabilities, especially when using esp_wifi.

Maybe a break instruction and a comment next to it with information about how to find out what flash address caused the fault to quickly know which function should be moved to IRAM.

What break instruction should be used and especially where should it be placed. If understand you correctly, the break instruction should be executed each time a function in IROM is called while the cache is disabled.

iosabi · 2021-10-30T18:41:26Z

Any break instruction works. xtensa ISA doc says that break 8, x to break 15, x are all user-defined.

I believe that if you try to execute from IROM when the cache is disabled you get an exception, probably NMI but haven't check, so we should add code that checks this condition and panics with a reasonable message or breaks in the debugger.

iosabi · 2021-10-31T08:22:54Z

@gschorcht so, there are quite a few problems going on here.

Nothing happens when accessing flash memory (read from 0x402xxxxx) when the flash cache is disabled. Nothing. You just read 0.
00 00 00 happens to be the opcode for ill, the illegal instruction so trying to execute from flash with the cache disable at least produces an IllegalInstructionCause exception. The PC will point to the ill instruction and there's no problem fetching it.
exception_handler() is in IRAM, but calls functions in flash, in particular ps() and heap_stats(). In theory it should also call putchar() which is also in flash, but ets_printf() doesn't ever call it (see 5).
ets_printf(), the .text function code is in IRAM (See

RIOT/cpu/esp8266/ld/esp8266.riot-os.ld

Line 236 in 9256970

*esp_idf_esp8266/ets_printf.o(.literal .text .literal.* .text.*)

) but, the xtensa-esp8266-elf toolchain places the literal strings in .rodata, not in .literal, so ets_printf doesn't actually work with the cache disabled because all its const data is not placed in IRAM. It is probably useful to have that function in RAM anyway.
Going one step back, exception_handler() is tagged with IRAM in the source code (not mentioned in the linker script), but the string literals it uses don't end up in IRAM, those are in .rodata which according to the .map file end up in flash, and that includes the format strings we pass to ets_printf(). Reading those strings from flash is not a problem (not an exception), it just reads 0s which happens to be a valid empty C string and ets_printf() happily prints it without ever calling putchar, which is why we get no output on the serial but also no crash in putchar().
Even after the specific fix for vTaskEnterCritical, causing any exception while the flash cache is disabled would trigger the infinite exception.

When an exception occurs while handling an exception with exception_handler() this function will be called again, increasing the exception stack until the whole stack and the memory before it is exhausted and a DoubleException occurs when trying to use an address in the `0x3ff7ffxx` range, far away from any clue to the initial exception. A common cause for an exception in exception_handler() is to try to execute a function from IROM while the cache is disabled. This normally results in an illegal instruction exception. A few of the functions called from the exception_handler() are in IROM, triggering this situation if an exception occurs while the flash cache is disabled. This patch breaks to the debugger immediately if exception_handler is called twice, so the location of the function causing the double exception can be obtained in frame->pc, and the frame in that context can be inspected too. Second, to address the issue of printing the error message in the terminal this patch enables the flash cache before calling ets_printf(). While exception_handler() is in IRAM, its read-only literal strings are not, so calling ets_printf() will not work. With this patch and without the fix in RIOT-OS#17080, reproducing the conditions in issue RIOT-OS#16281 allows us to quickly identify the PC of the function that was causing the issue (`vTaskEnterCritical`) on the first place.

gschorcht · 2021-10-31T09:47:19Z

2. IllegalInstructionCause exception

Yes, that is exactly what I saw when the cache was disabled when calling a function in IROM, but after not working on the ESP8266 port for at least 2 years, I forgot.

3. exception_handler() is in IRAM, but calls functions in flash, in particular ps() and heap_stats(). In theory it should also call putchar() which is also in flash, but ets_printf() doesn't ever call it (see 5).

That is, they would have to be placed in IRAM too.

4. xtensa-esp8266-elf toolchain places the literal strings in .rodata, not in .literal, so ets_printf doesn't actually work with the cache

Ah, thats why the original linker script includes .rodata for ets_printf.o:

#ifdef CONFIG_LINK_ETS_PRINTF_TO_IRAM
    *libesp8266.a:ets_printf.o(.literal .text .literal.* .text.* .rodata.* .rodata)
#endif

Finally, we would have to extend esp8266.riot-os.ld in IRAM section by:

diff --git a/cpu/esp8266/ld/esp8266.riot-os.ld b/cpu/esp8266/ld/esp8266.riot-os.ld
index e99aa855fe..06df3af98d 100644
--- a/cpu/esp8266/ld/esp8266.riot-os.ld
+++ b/cpu/esp8266/ld/esp8266.riot-os.ld
@@ -233,7 +233,7 @@ SECTIONS
     /* SDK libraries that expect their .text or .data sections to link to iram */
     /* TODO *libcore.a:(.bss .data .bss.* .data.* COMMON) */
     *esp_idf_spi_flash/spi_flash_raw.o(.literal .text .literal.* .text.*)
-    *esp_idf_esp8266/ets_printf.o(.literal .text .literal.* .text.*)
+    *esp_idf_esp8266/ets_printf.o(.literal .text .literal.* .text.* .rodata.* .rodata)
     /*
     *cpu.a:*.o(.literal .text .literal.* .text.*)
     */
@@ -241,6 +241,8 @@ SECTIONS
     *esp_wifi/*(.literal .text .literal.* .text.*)
     *freertos/*(.literal .text .literal.* .text.*)
     *periph/*(.literal .text .literal.* .text.*)
+    *ps/*(.literal .text .literal.* .text.* .rodata.* .rodata)
+    *newlib_syscalls_default/*(.literal .text .literal.* .text.* .rodata.* .rodata)
     *xtimer/*(.literal .text .literal.* .text.*)
 
     *libhal.a:clock.o(.literal .text .literal.* .text.*)
@@ -287,6 +289,7 @@ SECTIONS
     *libc.a:*findfp.o(.literal .text .literal.* .text.*)
     *libc.a:*fputwc.o(.literal .text .literal.* .text.*)
     */
+    *libg.a:*putchar.o(.literal .text .literal.* .text.*)
 
     *enc28j60/*(.literal .text .literal.* .text.*)

gschorcht · 2021-10-31T09:50:39Z

I would prefer to place everything in IRAM. But with 48 kByte it is so small that we have to limit what goes into the IRAM. It always requires a bit of trial and error what absolutely has to go into the IRAM.

iosabi · 2021-10-31T15:24:16Z

Why do we have both printf() in newlib and ets_printf()?

gschorcht · 2021-10-31T17:12:02Z

If I remember correctly, there were the following reasons:

The SDK redefines ets_printf which is usually a ROM function for libraries that require ets_printf. ets_printf source code in SDK gives some explanation. We just use a slightly adapted version.
ets_printf is the only printf function that should work even if the IROM cache is disabled, which is not the case at the moment, as we learned.
ets_printf is the only printf function that works before the IROM cache is enabled during startup.
ets_printf was intended to work also inside interrupt context. The newlib in xtensa-gcc uses the reentrant version of printf which uses lock functions which in turn use mutexes. RIOT's mutex, however, doesn't work in interrupt context.

benpicco assigned gschorcht Apr 9, 2021

jeandudey added Area: cpu Area: CPU/MCU ports Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) labels Apr 13, 2021

aabadie added the Platform: ESP Platform: This PR/issue effects ESP-based platforms label May 20, 2021

MrKevinWeiss added this to the Release 2021.07 milestone Jun 21, 2021

MrKevinWeiss removed this from the Release 2021.07 milestone Jul 15, 2021

gschorcht mentioned this issue Oct 25, 2021

[TRACKING] cpu/esp8266: Open problems of the ESP8266 port based on the ESP8266 RTOS SDK #12707

Open

9 tasks

gschorcht mentioned this issue Oct 29, 2021

cpu/esp8266: place freertos functions in IRAM #17080

Merged

gschorcht closed this as completed in #17080 Oct 29, 2021

gschorcht mentioned this issue Oct 30, 2021

cpu/esp32: place freertos and periph in IRAM #17087

Merged

iosabi mentioned this issue Oct 31, 2021

esp8266: Handle exceptions in the exception_handler(). #17089

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

esp8266: Hangs when erasing spi sector on mtd0 if using esp_wifi #16281

esp8266: Hangs when erasing spi sector on mtd0 if using esp_wifi #16281

iosabi commented Apr 5, 2021 •

edited

Loading

iosabi commented Apr 5, 2021

gschorcht commented Oct 21, 2021

gschorcht commented Oct 25, 2021 •

edited

Loading

iosabi commented Oct 25, 2021

gschorcht commented Oct 26, 2021

iosabi commented Oct 28, 2021

gschorcht commented Oct 29, 2021 •

edited

Loading

gschorcht commented Oct 29, 2021

iosabi commented Oct 30, 2021

gschorcht commented Oct 30, 2021 •

edited

Loading

iosabi commented Oct 30, 2021

iosabi commented Oct 31, 2021

gschorcht commented Oct 31, 2021

gschorcht commented Oct 31, 2021

iosabi commented Oct 31, 2021

gschorcht commented Oct 31, 2021

esp8266: Hangs when erasing spi sector on mtd0 if using esp_wifi #16281

esp8266: Hangs when erasing spi sector on mtd0 if using esp_wifi #16281

Comments

iosabi commented Apr 5, 2021 • edited Loading

Description

Steps to reproduce the issue

Expected results

Actual results

Versions

Extra info

iosabi commented Apr 5, 2021

Debugging

gschorcht commented Oct 21, 2021

gschorcht commented Oct 25, 2021 • edited Loading

iosabi commented Oct 25, 2021

gschorcht commented Oct 26, 2021

iosabi commented Oct 28, 2021

gschorcht commented Oct 29, 2021 • edited Loading

gschorcht commented Oct 29, 2021

iosabi commented Oct 30, 2021

gschorcht commented Oct 30, 2021 • edited Loading

iosabi commented Oct 30, 2021

iosabi commented Oct 31, 2021

gschorcht commented Oct 31, 2021

gschorcht commented Oct 31, 2021

iosabi commented Oct 31, 2021

gschorcht commented Oct 31, 2021

iosabi commented Apr 5, 2021 •

edited

Loading

gschorcht commented Oct 25, 2021 •

edited

Loading

gschorcht commented Oct 29, 2021 •

edited

Loading

gschorcht commented Oct 30, 2021 •

edited

Loading