-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
esp8266: Hangs when erasing spi sector on mtd0 if using esp_wifi #16281
Comments
Maybe @gschorcht has some info. DebuggingI enabled My guess is that this is related to a WiFi-related ISR triggering while we are waiting for the flash to erase a block. Erasing a block happens in the ROM function I listed the ISR handlers in the I replaced ISR 10 with a wrapper in IRAM that raises a GPIO, then calls the original handler and then lowers the GPIO after returning from the handler. I can see ISR 10 firing every 100ms in the esp_wifi case. This doesn't occur at all when esp_wifi module is not compiled (the ISR 10 is also not registered to anything). My guess would be that the CPU is trying to execute something from flash (IROM) but the flash is disabled / not mapped or something like that which causes the CPU to stall, and my guess would be that this code is the ISR 10; however I don't see the GPIO high when the CPU hangs which would indicate that's not inside the ISR 10. Any other idea of where could it be hanging? I couldn't get the JTAG debugger to work so I don't know what's the CPU doing before WDT triggers. I'll try to debug a bit more but I'm out of ideas. |
That's right, the IROM cache and thus the access to instructions from SPI flash is disabled for any SPI flash write access to avoid conflicts with read accesses from the IROM cache controller. Since erasing a sector takes some time, it is quite possible that the WiFi interface generates some interrupt while the IROM cache is disabled. However, since ISRs should usually be realized in IRAM (to be fast enough and always accessible) that shouldn't be a problem. Furthermore, Interrupts from the WiFi interface are not handled directly, rather the ISR simply generates an event that is queued and handled asynchronously as quickly as possible by one of the WiFi tasks,
IRAM access is much faster than IROM access using a cache. Furthermore, the code from IRAM is always available while the code from IROM is not, e.g., because the cache is disabled. Therefore, ISRs should be implemented in IRAM whenever possible, unless their execution is not time-critical. Since the IRAM is small, as much code as possible is placed in the IROM, probably the two ISRs you mentioned are not time critical. If I remember correctly, I have observed that IROM isn't available sometimes when the WiFi interface is used. Maybe the WiFi hardware is connected internally using SPI or disables the IROM cache for any reason. |
This message is OK and only says that the association could not be established because the configured AP with the BSSID
FRC2 is used by the WiFi module to assert interrupt 10 on a regular base. However, interrupt 10 shouldn't be asserted at all when function Just for clearificaition, could you observe that interrupt 10 is triggered regularly until the |
Correct, int 10 is not triggered anymore after running I added a bit more of debug (using GPIOs with I attached the JTAG debugger and everything runs fine until I execute So what I can conclude from this is that I dump the RAM region to a file ( The repeating pattern is probably the interrupt stack frame created by I think all interrupts and all code the interrupts call should be placed in IRAM (or boot rom) since interrupts can be called at any point, even with the flash cache disabled, regardless of whether they are time critical. |
@iosabi Thank you very much for this detailed debugging. I'm really impressed, I've never been that lucky to get
That's true because
I wonder why INT10 should be handled at all, since all interrupts are disabled during SPI flash operations. Are you sure that is still handled by |
No, I'm not sure what's causing the interrupt handler ( |
Finally I could find out what the problem is. The function In RIOT/cpu/esp8266/ld/esp8266.riot-os.ld Line 242 in 1dfcc6f
cpu/esp8266/freetos/*.c are placed in IRAM, but functions implemented in cpu/esp_common/freetos/*.c are not placed in IRAM. The reason is that the object files for these files are created in the directory esp_freertos_common instead of something like *freertos , because the library is called esp_freertos_common .
Adding |
Can't repro anymore after that patch. Sounds like it fixed it. I see 1332 bytes more in IRAM with that patch in the examples/filesystem which is pretty reasonable. Thanks for getting to the bottom of this! I think we need to have a way to produce a more obvious crash when trying to run code from flash if the cache is disabled. This will likely happen again. Maybe a |
I must thank you for finding this problem and investigating it. Perhaps it was also the cause for other instabilities, especially when using
What |
Any I believe that if you try to execute from IROM when the cache is disabled you get an exception, probably NMI but haven't check, so we should add code that checks this condition and panics with a reasonable message or breaks in the debugger. |
@gschorcht so, there are quite a few problems going on here.
|
When an exception occurs while handling an exception with exception_handler() this function will be called again, increasing the exception stack until the whole stack and the memory before it is exhausted and a DoubleException occurs when trying to use an address in the `0x3ff7ffxx` range, far away from any clue to the initial exception. A common cause for an exception in exception_handler() is to try to execute a function from IROM while the cache is disabled. This normally results in an illegal instruction exception. A few of the functions called from the exception_handler() are in IROM, triggering this situation if an exception occurs while the flash cache is disabled. This patch breaks to the debugger immediately if exception_handler is called twice, so the location of the function causing the double exception can be obtained in frame->pc, and the frame in that context can be inspected too. Second, to address the issue of printing the error message in the terminal this patch enables the flash cache before calling ets_printf(). While exception_handler() is in IRAM, its read-only literal strings are not, so calling ets_printf() will not work. With this patch and without the fix in RIOT-OS#17080, reproducing the conditions in issue RIOT-OS#16281 allows us to quickly identify the PC of the function that was causing the issue (`vTaskEnterCritical`) on the first place.
Yes, that is exactly what I saw when the cache was disabled when calling a function in IROM, but after not working on the ESP8266 port for at least 2 years, I forgot.
That is, they would have to be placed in IRAM too.
Ah, thats why the original linker script includes
Finally, we would have to extend diff --git a/cpu/esp8266/ld/esp8266.riot-os.ld b/cpu/esp8266/ld/esp8266.riot-os.ld
index e99aa855fe..06df3af98d 100644
--- a/cpu/esp8266/ld/esp8266.riot-os.ld
+++ b/cpu/esp8266/ld/esp8266.riot-os.ld
@@ -233,7 +233,7 @@ SECTIONS
/* SDK libraries that expect their .text or .data sections to link to iram */
/* TODO *libcore.a:(.bss .data .bss.* .data.* COMMON) */
*esp_idf_spi_flash/spi_flash_raw.o(.literal .text .literal.* .text.*)
- *esp_idf_esp8266/ets_printf.o(.literal .text .literal.* .text.*)
+ *esp_idf_esp8266/ets_printf.o(.literal .text .literal.* .text.* .rodata.* .rodata)
/*
*cpu.a:*.o(.literal .text .literal.* .text.*)
*/
@@ -241,6 +241,8 @@ SECTIONS
*esp_wifi/*(.literal .text .literal.* .text.*)
*freertos/*(.literal .text .literal.* .text.*)
*periph/*(.literal .text .literal.* .text.*)
+ *ps/*(.literal .text .literal.* .text.* .rodata.* .rodata)
+ *newlib_syscalls_default/*(.literal .text .literal.* .text.* .rodata.* .rodata)
*xtimer/*(.literal .text .literal.* .text.*)
*libhal.a:clock.o(.literal .text .literal.* .text.*)
@@ -287,6 +289,7 @@ SECTIONS
*libc.a:*findfp.o(.literal .text .literal.* .text.*)
*libc.a:*fputwc.o(.literal .text .literal.* .text.*)
*/
+ *libg.a:*putchar.o(.literal .text .literal.* .text.*)
*enc28j60/*(.literal .text .literal.* .text.*) |
I would prefer to place everything in IRAM. But with 48 kByte it is so small that we have to limit what goes into the IRAM. It always requires a bit of trial and error what absolutely has to go into the IRAM. |
Why do we have both printf() in newlib and ets_printf()? |
If I remember correctly, there were the following reasons:
|
Description
esp8266 hangs when the esp_wifi module is built in and one attempts to format a spiffs on mtd0.
I'm using the ESP12x board (in my case the Adafruit Huzzah breakout board without anything connected to it other than the UART).
Steps to reproduce the issue
xtensa-esp8266-elf-gcc
and make it available in your PATH/opt/esp/ESP8266-RTOS-SDK
USEMODULE="esp_spiffs esp_wifi lwip_netdev lwip_arp" make BOARD=esp8266-esp-12x -C examples/filesystem all flash
help
for example).format
(if it runs ok then rungformat
again).Expected results
Actual results
The
format
command will most likely hang the board. Sometimes it doesn't hang the board the first time, just run it again in that case, but ~90% it does. The board doesn't respond anymore and after a few seconds it restarts with the following message:Note that the WiFi connection is not established. There's no AP around with the default name (RIOT_AP), but you would see sporadically messages like this in the console:
Versions
Operating System: "Ubuntu" "20.04.2 LTS (Focal Fossa)"
xtensa-esp8266-elf-gcc: xtensa-esp8266-elf-gcc (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a5) 5.2.0
Extra info
The following examples work fine:
USEMODULE="esp_spiffs" make BOARD=esp8266-esp-12x -C examples/filesystem flash
The spiffs test behaves the same way. It works ok with just "esp_spiffs", but hangs when using esp_wifi as well.
Note: tests/pkg_spiffs is a bit slow because it erases the whole device.
The text was updated successfully, but these errors were encountered: