Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drivers: serial: uart_ns16550: add missing isr locking #23693

Merged

Conversation

jenmwms
Copy link
Collaborator

@jenmwms jenmwms commented Mar 23, 2020

ISR locking was missing in the UART driver for x86_64. This PR adds a spinlock mechanism to interrupt related services in the driver.

Fixes #23026

@jenmwms jenmwms added the In progress For PRs: is work in progress and should not be merged yet. For issues: Is being worked on label Mar 23, 2020
@zephyrbot zephyrbot added area: X86 x86 Architecture (32-bit) area: API Changes to public APIs labels Mar 23, 2020
@zephyrbot
Copy link
Collaborator

zephyrbot commented Mar 23, 2020

All checks are passing now.

Tip: The bot edits this comment instead of posting a new one, so you can check the comment's history to see earlier messages.

@jenmwms jenmwms force-pushed the drivers_uart_fix_isrlock_23026 branch 3 times, most recently from dbd47d9 to 4436150 Compare March 23, 2020 17:33
@jenmwms jenmwms force-pushed the drivers_uart_fix_isrlock_23026 branch 2 times, most recently from 06a850b to 2c0074c Compare March 23, 2020 23:13
@jenmwms
Copy link
Collaborator Author

jenmwms commented Mar 23, 2020

The automated tests are passing, however the issue/bug #23026 is still observed with:
west build -b qemu_x86_64 samples/subsys/console/echo/ && west build -t run

I'll keep it as WIP until it fixes the issue.

@jenmwms jenmwms force-pushed the drivers_uart_fix_isrlock_23026 branch from 2c0074c to 3a9ed99 Compare March 24, 2020 16:21
@jenmwms
Copy link
Collaborator Author

jenmwms commented Mar 24, 2020

To reproduce the failed Shippable test -
./scripts/sanitycheck --platform=qemu_x86_64 -T tests/drivers/ipm/

The test is successful before this patch, but fails with it. Trying to figure out why then I'll submit an update.

@jenmwms jenmwms force-pushed the drivers_uart_fix_isrlock_23026 branch from 3a9ed99 to 4abc53b Compare March 25, 2020 20:56
@jenmwms
Copy link
Collaborator Author

jenmwms commented Mar 25, 2020

To reproduce the failed Shippable test -
./scripts/sanitycheck --platform=qemu_x86_64 -T tests/drivers/ipm/

The test is successful before this patch, but fails with it. Trying to figure out why then I'll submit an update.

This 1 test is still failing, but moving out of Draft will get more eyes on it as a PR so we can get code review.

@jenmwms jenmwms marked this pull request as ready for review March 25, 2020 21:01
@jenmwms jenmwms requested a review from maksimmasalski April 1, 2020 20:13
@nashif
Copy link
Member

nashif commented Apr 1, 2020

please do not put bug numbers in commit titles:

drivers: serial: uart: ns16550 add missing isr locking #23026

@jenmwms jenmwms changed the title drivers: uart: add isr locking 23026 drivers: uart: add isr locking Apr 2, 2020
@jenmwms jenmwms removed the In progress For PRs: is work in progress and should not be merged yet. For issues: Is being worked on label Apr 2, 2020
@jenmwms jenmwms force-pushed the drivers_uart_fix_isrlock_23026 branch from 3e55462 to 7b3bf49 Compare May 11, 2020 22:17
@jenmwms
Copy link
Collaborator Author

jenmwms commented May 11, 2020

I tested today on master 7201c1d - No modification or cherry-pick.
Omitting hello_world tests this time:

Test (master) (master + PR) : Qemu_x86 Qemu_x86_64 Up_squared
tests/drivers/uart/uart_basic_api PASS FAIL PASS FAIL PASS FAIL
samples/subsys/console/echo PASS PASS PASS FAIL PASS PASS
tests/drivers/ipm PASS PASS PASS PASS tbd PASS
  1. tests/drivers/uart/uart_basic_api
    Qemu_x86
    Master + PR
Booting from ROM..Optimal CONFIG_X86_MMU_PAGE_POOL_PAGES 7
*** Booting Zephyr OS build zephyr-v2.2.0-2727-g7b3bf498010d  ***
Running test suite uart_basic_test
===================================================================
starting test - test_uart_configure
PASS - test_uart_configure
===================================================================
starting test - test_uart_config_get
This is a configure_get test.
PASS - test_uart_config_get
===================================================================
starting test - test_uart_fifo_fill

Qemu_x86_64
Master + PR

Booting from ROM..Optimal CONFIG_X86_MMU_PAGE_POOL_PAGES 8
*** Booting Zephyr OS build zephyr-v2.2.0-2727-g7b3bf498010d  ***
Running test suite uart_basic_test
===================================================================
starting test - test_uart_configure
PASS - test_uart_configure
===================================================================
starting test - test_uart_config_get
This is a configure_get test.
PASS - test_uart_config_get
===================================================================
starting test - test_uart_fifo_fill

Up_squared
Master + PR

*** Booting Zephyr OS build zephyr-v2.2.0-2727-g7b3bf498010d  ***
Running test suite uart_basic_test
===================================================================
PASS - test_uart_configureconfigure
===================================================================
starting test - test_uart_config_get
PASS - ▒est_uart_config_gett.
===================================================================
starting test - test_uart_fifo_fill
ASSERTION FAIL [z_spin_lock_valid(l)] @ ../include/spinlock.h:92

  1. samples/subsys/console/echo
    Qemu_x86
    Master + PR
Booting from ROM..Optimal CONFIG_X86_MMU_PAGE_POOL_PAGES 7
*** Booting Zephyr OS build zephyr-v2.2.0-2727-g7b3bf498010d  ***
You should see another line with instructions below. If not,
the (interrupt-driven) console device doesn't work as expected:
Start typing characters to see them echoed back
fdafda

Qemu_x86_64
Master + PR

Booting from ROM..Optimal CONFIG_X86_MMU_PAGE_POOL_PAGES 8
*** Booting Zephyr OS build zephyr-v2.2.0-2727-g7b3bf498010d  ***
You should see another line with instructions below. If not,
the (interrupt-driven) console device doesn't work as expected:

Up_squared
Master + PR

*** Booting Zephyr OS build zephyr-v2.2.0-2726-g7201c1d6e983  ***
You should see another line with in{tructions below. If not,
the (interrupt-driven) console device doesn't work as expected:
Start typing characters to see them echoed back
Fdafda

  1. tests/drivers/ipm/
    Qemu_x86
    Master + PR
Booting from ROM..Optimal CONFIG_X86_MMU_PAGE_POOL_PAGES 7
*** Booting Zephyr OS build zephyr-v2.2.0-2727-g7b3bf498010d  ***
starting test - Test IPM
ipm_recv0: 'everything is awesome'
ipm_recv0: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor '
ipm_recv0: 'incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nost'
ipm_recv0: 'rud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis '
ipm_recv0: 'aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugi'
ipm_recv0: 'at nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culp'
ipm_recv0: 'a qui officia deserunt mollit anim id est laborum.'
PASS - main
===================================================================
===================================================================
PROJECT EXECUTION SUCCESSFUL

Qemu_x86_64
Master + PR

Booting from ROM..Optimal CONFIG_X86_MMU_PAGE_POOL_PAGES 8
*** Booting Zephyr OS build zephyr-v2.2.0-2727-gd337e80fec97  ***
starting test - Test IPM
ipm_recv0: 'everything is awesome'
ipm_recv0: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor '
ipm_recv0: 'incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nost'
ipm_recv0: 'rud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis '
ipm_recv0: 'aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugi'
ipm_recv0: 'at nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culp'
ipm_recv0: 'a qui officia deserunt mollit anim id est laborum.'
PASS - main
===================================================================
===================================================================
PROJECT EXECUTION SUCCESSFUL

Up_squared
Master + PR

*** Booting Zephyr OS build zephyr-v2.2.0-2727-g7b3bf498010d  ***
starting test - Test IPM
ipm_recv0: 'everything is awesome'
ipm_recv0: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor '
ipm_recv0: 'incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nost'
ipm_recv0: 'rud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis '
ipm_recv0: 'aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugi'
ipm_recv0: 'at nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culp'
ipm_recv0: 'a qui officia deserunt mollit anim id est laborum.'
PASS - main
===================================================================
===================================================================
PROJECT EXECUTION SUCCESSFUL

@andrewboie
Copy link
Contributor

Looks like shippable is still getting a stack overflow failure in the mailbox test:

[0mSeaBIOS (version rel-1.12.1-0-ga5cab58-dirty-20200214_052440-f7294c49af13-zephyr
)
Booting from ROM..Optimal CONFIG_X86_MMU_PAGE_POOL_PAGES 8
*** Booting Zephyr OS build v2.3.0-rc1-17-gac7fec1a8218  ***
starting test - Test IPM
E: Page fault at address 0x11dff8 (error code 0x3)
E: Access violation: supervisor thread not allowed to write
E: PML4E: 0x000000000011a827 Writable, User, Execute Enabled
E: PDPTE: 0x0000000000119827 Writable, User, Execute Enabled
E:   PDE: 0x0000000000118827 Writable, User, Execute Enabled
E:   PTE: 0x800000000011d001 Read-only, Supervisor, Execute Disable
E: RAX: 0x00000000001080c0 RBX: 0x000000000010019a RCX: 0x00000000001086d1 RDX: 0x0000000000000010
E: RSI: 0x0000000000000069 RDI: 0x0000000000129150 RBP: 0x000000000011e09c RSP: 0x000000000011e000
E:  R8: 0x00000000001086d0  R9: 0x0000000000000015 R10: 0x0000000000000000 R11: 0x0000000000000000
E: R12: 0x0000000000000069 R13: 0x000000000011e0b8 R14: 0x0000000000000000 R15: 0x0000000000000001
E: RSP: 0x000000000011e000 RFLAGS: 0x0000000000000216 CS: 0x0018 CR3: 0x000000000010a000
E: RIP: 0x00000000001035b7
E: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
E: Current thread: 0x000000000011b050 (unknown)
E: Halting system

This looks like a stack overflow, it's trying to write to a guard page at 0x11dff8 , notice the RSP is 0x000000000011e000

It's probably one of the test threads, can you find out which thread is 0x000000000011b050 and kick up its stack space a bit? Are the stack sizes in the test adding CONFIG_TEST_EXTRA_STACKSIZE (or whatever it's called) to their stack sizes?

@jenmwms
Copy link
Collaborator Author

jenmwms commented May 12, 2020

can you find out which thread is 0x000000000011b050 and kick up its stack space a bit? Are the stack sizes in the test adding CONFIG_TEST_EXTRA_STACKSIZE (or whatever it's called) to their stack sizes?

Thanks for looking into it! I tried this in the past,

--- a/tests/drivers/ipm/prj.conf
+++ b/tests/drivers/ipm/prj.conf
@@ -6,3 +6,4 @@ CONFIG_IPM_CONSOLE_RECEIVER=y
 CONFIG_IPM_CONSOLE_SENDER=y
 CONFIG_IRQ_OFFLOAD=y
 CONFIG_MP_NUM_CPUS=1
+CONFIG_MAIN_STACK_SIZE=2048

but no effect. I'm not sure if it uses CONFIG_TEST_EXTRA_STACKSIZE, I'll take a look. Thanks for pointing that out.

@jenmwms jenmwms force-pushed the drivers_uart_fix_isrlock_23026 branch from 7b3bf49 to d337e80 Compare May 12, 2020 00:36
@jenmwms
Copy link
Collaborator Author

jenmwms commented May 12, 2020

I added CONFIG_IPM_CONSOLE_STACK_SIZE=2048 in tests/drivers/ipm/prj.conf and it passes now.

@zephyrbot zephyrbot added the area: Tests Issues related to a particular existing or missing test label May 12, 2020
@jenmwms
Copy link
Collaborator Author

jenmwms commented May 12, 2020

Still investigating why test_uart_fifo_fill is getting stuck on qemu_x86 , qemu_x86_64, and up_squared with this PR. Also, samples/subsys/console/echo for qemu_x86_64.

@carlescufi
Copy link
Member

@jenmwms please let me know when this is ready to merge

@@ -505,11 +521,15 @@ static int uart_ns16550_poll_in(struct device *dev, unsigned char *c)
static void uart_ns16550_poll_out(struct device *dev,
unsigned char c)
{
k_spinlock_key_t key = k_spin_lock(&DEV_DATA(dev)->lock);

/* wait for transmitter to ready to accept a character */
while ((INBYTE(LSR(dev)) & LSR_THRE) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant to the discussion earlier: this is a potential deadlock right here, though unlikely to be the source of the hang mentioned above. This is going to spin with the lock held waiting for the hardware FIFO to drain. But some uarts (at least in theory) have hardware implementations of flow control, which means that the device on the other side is capable of stalling the FIFO at will and causing a hang locally.

This while need to go outside a single block that does (atomically) "can I send a byte? if so, send it and break out of the loop". And it's probably good practice to put a busy wait of ~1 byte transfer time (87us at 115200bps) to avoid spamming the lock and/or banging the hardware too fast.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, wonderful. makes sense

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds similar to what's going on in https://github.com/zephyrproject-rtos/zephyr/pull/25064/files? I tried that approach but did not resolve the hang. Still investigating

Copy link
Collaborator Author

@jenmwms jenmwms May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyross Interesting. There are a few functions that seem to fit this scenario...perhaps I need to make the same adjustment with a while loop outside a single block that does the checking/doing, yeah? uart_ns16550_poll_out, uart_ns16550_poll_in, uart_ns16550_fifo_fill, and uart_ns16550_fifo_read
FWIW: I tried fixing all those too but still hanging as before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jenmwms: Mind that there's a huge difference between poll_out/poll_in vs fifo_fill/fifo_read. Please refer to the docstrings: https://github.com/zephyrproject-rtos/zephyr/blob/master/include/drivers/uart.h#L703 . To quote import points:

This function is expected to be called from UART
interrupt handler (ISR)

Result of calling this function not from an ISR is undefined

Likewise, not calling this function
from an ISR if uart_irq_tx_ready() returns true may lead to
undefined behavior

As pointed by @andyross, normally, one doesn't use locking in ISRs. Thus, a solution suitable for poll_out/poll_in, wouldn't be suitable for fifo_fill/fifo_read, and vice versa, solution for fifo_fill/fifo_read might be not the best for poll_out/poll_in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfalcon Thanks for the info and link, it is helpful. I'll keep that in mind as I keep working on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jenmwms Almost always yes. The point to a spinlock[1] is that it needs to enclose a critical section with firmly bounded runtime. It's not like you can never use a "while" in there syntactically, but never in a circumstance where you aren't 100% sure exactly how long it will be until you unlock.

[1] In production code. In the test suite we do all kinds of abusive things to locking paradigms to isolate stuff we want to test.

@pfalcon: You still need locking in ISRs, though the rules get more complicated. Almost all Zephyr platforms support nested interrupts, so it's rarely actually true that you know you won't be interrupted. And the driver in question is used on SMP systems, where you can have an interrupt racing against code on another CPU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyross, Right, someone may need locking in ISRs, and the rules for deal with those cases are complicated, so I didn't try to mention them in my comment.

@andyross
Copy link
Contributor

Sorry, forgot this was in two places. See comment at #25306 (comment)

That's almost certainly where the deadlock is.

@carlescufi carlescufi added the DNM This PR should not be merged (Do Not Merge) label May 18, 2020
@carlescufi
Copy link
Member

Tagged as DNM until you've had a chance to fix any unresolved issues.

Copy link
Member

@nashif nashif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deadlocks on a few tests

@nashif nashif removed the DNM This PR should not be merged (Do Not Merge) label May 18, 2020
drivers/serial/uart_ns16550.c Outdated Show resolved Hide resolved
drivers/serial/uart_ns16550.c Outdated Show resolved Hide resolved
@jenmwms
Copy link
Collaborator Author

jenmwms commented May 18, 2020

As @andyross said, this lock can be removed.
Thanks for your help! I thought that might be the case, but wasn't sure. I'll update it.

The existing uart driver ns16550 did not have ISR locking that
effected IO APIC working in fixed delivery mode in SMP system
x86_64. This commit adds ISR locking mechanism using spinlock
for the interrupt related services.

The CONFIG_IPM_CONSOLE_STACK_SIZE is increased to lift
limitation of stack size experienced in IPM driver test with
this spinlock impelentation.

Fixes zephyrproject-rtos#23026

Signed-off-by: Jennifer Williams <jennifer.m.williams@intel.com>
@jenmwms jenmwms force-pushed the drivers_uart_fix_isrlock_23026 branch from d337e80 to e948cee Compare May 18, 2020 18:24
@jenmwms
Copy link
Collaborator Author

jenmwms commented May 18, 2020

I incorporated the suggested changes, namely I removed the locking on the _isr() and _init(). I rebased to have the latest from master. And I tested with qemu_x86, qemu_x86_64, and up_squared using tests/drivers/uart/uart_basic_api. All were successful.

Here is the result for up_squared:
image

@jenmwms
Copy link
Collaborator Author

jenmwms commented May 18, 2020

This suggested edit still remains (but could be an enhancement in the interest of time?) from @andyross to implement:

And it's probably good practice to put a busy wait of ~1 byte transfer time (87us at 115200bps) to avoid spamming the lock and/or banging the hardware too fast.

@nashif nashif requested a review from dcpleung May 18, 2020 20:15
@nashif nashif added this to the v2.3.0 milestone May 18, 2020
@nashif nashif merged commit d2c74eb into zephyrproject-rtos:master May 18, 2020
@schouleu
Copy link

schouleu commented Sep 3, 2024

Hi,

Sorry to revive this discussion.

As explained by @andyross blocking all interrupts while waiting for the UART fifo to be empty is a very bad idea:

  • Flow controlled UART can be waiting forever if the remote isn't ready
  • UART is usually slow, waiting for 1 character to be flushed will spend 87us with interrupts masked, this is not acceptable in a RTOS.

I don't understand how this patch could be merged as-is. Plus I don't see the need to lock interrupts while waiting for the UART fifo to be empty.

        k_spinlock_key_t key = k_spin_lock(&DEV_DATA(dev)->lock);
	while (1) {
		/* wait for transmitter to ready to accept a character */
		if ((INBYTE(LSR(dev)) & LSR_THRE) == 0) {
			continue;
		}
		OUTBYTE(THR(dev), c);
		break;
	}
	k_spin_unlock(&DEV_DATA(dev)->lock, key);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: API Changes to public APIs area: Tests Issues related to a particular existing or missing test area: UART Universal Asynchronous Receiver-Transmitter area: X86 x86 Architecture (32-bit) block: HW Test Testing on hardware required before merging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

missing ISR locking in UART driver?