Small optimizations #781

jeanlemotan · 2015-02-01T12:02:38Z

Optimized redundant dereferencing of the state
Cached the register for FIQ_WRITE and FIQ_READ
collapsed come redundant branches and added early outs where possible

- Optimized redundant dereferencing of the state - Cached the register for FIQ_WRITE and FIQ_READ - collapsed come redundant branches and added early outs where possible

popcornmix · 2015-02-01T12:15:24Z

@P33M for review.

P33M · 2015-02-01T12:25:29Z

Have you looked at the assembler output for the functions before and after your "optimisations"?

GCC will perform the computation of an offset once if the same offset is referenced multiple times in a function call. This applies to both register offsets and struct offsets. I suspect that GCC is already doing automatically what you are doing manually.

Likewise in fiq_fsm_tt_in_use() - is there any evidence in the emitted code that the fast return is any quicker in terms of number of instructions than the subsequent value test and break?

Your choice of hw register name causes the code to be less readable - "reg + offset"? Shouldn't that be a "channel_base + offset"?

jeanlemotan · 2015-02-01T19:19:37Z

You're right, reg_base would be clearer than "reg" as some offset is always added.
I didn't look at the generated assembler but I did notice the object file got significantly smaller. No numbers though as I did this change some weeks ago. And of course that doesn't tell much so I'll try to check the asm tonight.

The early return in fiq_fsm_tt_in_use is mostly for clarity IMO. The code was essentially breaking the loop to return if the channel was used and I think the intention is clearer now. Collapsing the branch (st->channel[i].fsm == FIQ_PASSTHROUGH) into the switch statement is probably irrelevant in terms of speed but again - I think it streamlines the function a bit. All the decisions based on the fsm state are taken in the switch.

jeanlemotan · 2015-02-01T21:19:59Z

Here's the asm for fiq_fsm_tt_in_use:

Before the commit:
0xc0255ad8 <+0>: add r3, r2, r2, lsl #4
0xc0255adc <+4>: cmp r1, #0
0xc0255ae0 <+8>: add r3, r0, r3, lsl #2
0xc0255ae4 <+12>: push {r4, r5, r6, lr}
0xc0255ae8 <+16>: ldr r5, [r3, #68] ; 0x44
0xc0255aec <+20>: ldr r6, [r3, #72] ; 0x48
0xc0255af0 <+24>: ble 0xc0255b44 <fiq_fsm_tt_in_use+108>
0xc0255af4 <+28>: mov r3, #0
0xc0255af8 <+32>: mov r4, #1
0xc0255afc <+36>: cmp r3, r2
0xc0255b00 <+40>: add r3, r3, #1
0xc0255b04 <+44>: beq 0xc0255b38 <fiq_fsm_tt_in_use+96>
0xc0255b08 <+48>: ldr r12, [r0, #60] ; 0x3c
0xc0255b0c <+52>: cmp r12, #0
0xc0255b10 <+56>: sub lr, r12, #9
0xc0255b14 <+60>: beq 0xc0255b38 <fiq_fsm_tt_in_use+96>
0xc0255b18 <+64>: cmp lr, #6
0xc0255b1c <+68>: lsl r12, r4, lr
0xc0255b20 <+72>: bhi 0xc0255b38 <fiq_fsm_tt_in_use+96>
0xc0255b24 <+76>: tst r12, #121 ; 0x79
0xc0255b28 <+80>: beq 0xc0255b38 <fiq_fsm_tt_in_use+96>
0xc0255b2c <+84>: ldr r12, [r0, #68] ; 0x44
0xc0255b30 <+88>: cmp r5, r12
0xc0255b34 <+92>: beq 0xc0255b4c <fiq_fsm_tt_in_use+116>
0xc0255b38 <+96>: cmp r3, r1
0xc0255b3c <+100>: add r0, r0, #68 ; 0x44
0xc0255b40 <+104>: bne 0xc0255afc <fiq_fsm_tt_in_use+36>
0xc0255b44 <+108>: mov r0, #0
0xc0255b48 <+112>: pop {r4, r5, r6, pc}
0xc0255b4c <+116>: ldr r12, [r0, #72] ; 0x48
0xc0255b50 <+120>: cmp r6, r12
0xc0255b54 <+124>: bne 0xc0255b38 <fiq_fsm_tt_in_use+96>
0xc0255b58 <+128>: mov r0, #1
0xc0255b5c <+132>: pop {r4, r5, r6, pc}
End of assembler dump.

After:
0xc0255ad8 <+0>: add r3, r2, r2, lsl #4
0xc0255adc <+4>: cmp r1, #0
0xc0255ae0 <+8>: add r3, r0, r3, lsl #2
0xc0255ae4 <+12>: push {r4, r5, lr}
0xc0255ae8 <+16>: ldr r4, [r3, #68] ; 0x44
0xc0255aec <+20>: ldr r5, [r3, #72] ; 0x48
0xc0255af0 <+24>: ble 0xc0255b50 <fiq_fsm_tt_in_use+120>
0xc0255af4 <+28>: mov r3, #0
0xc0255af8 <+32>: b 0xc0255b14 <fiq_fsm_tt_in_use+60>
0xc0255afc <+36>: bcc 0xc0255b08 <fiq_fsm_tt_in_use+48>
0xc0255b00 <+40>: cmp lr, #3
0xc0255b04 <+44>: bls 0xc0255b30 <fiq_fsm_tt_in_use+88>
0xc0255b08 <+48>: cmp r3, r1
0xc0255b0c <+52>: add r0, r0, #68 ; 0x44
0xc0255b10 <+56>: beq 0xc0255b50 <fiq_fsm_tt_in_use+120>
0xc0255b14 <+60>: cmp r3, r2
0xc0255b18 <+64>: add r3, r3, #1
0xc0255b1c <+68>: beq 0xc0255b08 <fiq_fsm_tt_in_use+48>
0xc0255b20 <+72>: ldr r12, [r0, #60] ; 0x3c
0xc0255b24 <+76>: cmp r12, #9
0xc0255b28 <+80>: sub lr, r12, #12
0xc0255b2c <+84>: bne 0xc0255afc <fiq_fsm_tt_in_use+36>
0xc0255b30 <+88>: ldr r12, [r0, #68] ; 0x44
0xc0255b34 <+92>: cmp r4, r12
0xc0255b38 <+96>: bne 0xc0255b08 <fiq_fsm_tt_in_use+48>
0xc0255b3c <+100>: ldr r12, [r0, #72] ; 0x48
0xc0255b40 <+104>: cmp r5, r12
0xc0255b44 <+108>: bne 0xc0255b08 <fiq_fsm_tt_in_use+48>
0xc0255b48 <+112>: mov r0, #1
0xc0255b4c <+116>: pop {r4, r5, pc}
0xc0255b50 <+120>: mov r0, #0
0xc0255b54 <+124>: pop {r4, r5, pc}
End of assembler dump.

The difference is not significant - only 2 instruction. It looks that GCC does better register allocation. I expected a more dramatic difference tbh.

So I would discard the register caching and the st->channel[n] change as it's very big and with insignificant impact.
I can prepare a new commit with just the early outs if needed.

jeanlemotan · 2015-02-03T21:50:47Z

I'm closing this. I will make a new pull request for the cosmetic changes later on.

The renesas-intc-irqpin interrupt controller is cascaded to the GIC. Hence when propagating wake-up settings to its parent interrupt controller, the following lockdep warning is printed: ============================================= [ INFO: possible recursive locking detected ] 4.2.0-armadillo-10725-g50fcd7643c034198 raspberrypi#781 Not tainted --------------------------------------------- s2ram/1179 is trying to acquire lock: (&irq_desc_lock_class){-.-...}, at: [<c005bb54>] __irq_get_desc_lock+0x78/0x94 but task is already holding lock: (&irq_desc_lock_class){-.-...}, at: [<c005bb54>] __irq_get_desc_lock+0x78/0x94 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&irq_desc_lock_class); lock(&irq_desc_lock_class); *** DEADLOCK *** May be due to missing lock nesting notation 7 locks held by s2ram/1179: #0: (sb_writers#7){.+.+.+}, at: [<c00c9708>] __sb_start_write+0x64/0xb8 #1: (&of->mutex){+.+.+.}, at: [<c0125a00>] kernfs_fop_write+0x78/0x1a0 #2: (s_active#23){.+.+.+}, at: [<c0125a08>] kernfs_fop_write+0x80/0x1a0 #3: (autosleep_lock){+.+.+.}, at: [<c0058244>] pm_autosleep_lock+0x18/0x20 #4: (pm_mutex){+.+.+.}, at: [<c0057e50>] pm_suspend+0x54/0x248 #5: (&dev->mutex){......}, at: [<c0243a20>] __device_suspend+0xdc/0x240 #6: (&irq_desc_lock_class){-.-...}, at: [<c005bb54>] __irq_get_desc_lock+0x78/0x94 stack backtrace: CPU: 0 PID: 1179 Comm: s2ram Not tainted 4.2.0-armadillo-10725-g50fcd7643c034198 Hardware name: Generic R8A7740 (Flattened Device Tree) [<c00129f4>] (dump_backtrace) from [<c0012bec>] (show_stack+0x18/0x1c) [<c0012bd4>] (show_stack) from [<c03f5d94>] (dump_stack+0x20/0x28) [<c03f5d74>] (dump_stack) from [<c00514d4>] (__lock_acquire+0x67c/0x1b88) [<c0050e58>] (__lock_acquire) from [<c0052df8>] (lock_acquire+0x9c/0xbc) [<c0052d5c>] (lock_acquire) from [<c03fb068>] (_raw_spin_lock_irqsave+0x44/0x58) [<c03fb024>] (_raw_spin_lock_irqsave) from [<c005bb54>] (__irq_get_desc_lock+0x78/0x94 [<c005badc>] (__irq_get_desc_lock) from [<c005c3d8>] (irq_set_irq_wake+0x28/0x100) [<c005c3b0>] (irq_set_irq_wake) from [<c01e50d0>] (intc_irqpin_irq_set_wake+0x24/0x4c) [<c01e50ac>] (intc_irqpin_irq_set_wake) from [<c005c17c>] (set_irq_wake_real+0x3c/0x50 [<c005c140>] (set_irq_wake_real) from [<c005c414>] (irq_set_irq_wake+0x64/0x100) [<c005c3b0>] (irq_set_irq_wake) from [<c02a19b4>] (gpio_keys_suspend+0x60/0xa0) [<c02a1954>] (gpio_keys_suspend) from [<c023b750>] (platform_pm_suspend+0x3c/0x5c) Avoid this false positive by using a separate lockdep class for INTC External IRQ Pin interrupts. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Magnus Damm <magnus.damm@gmail.com> Cc: Jason Cooper <jason@lakedaemon.net> Link: http://lkml.kernel.org/r/1441798974-25716-3-git-send-email-geert%2Brenesas@glider.be Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

[ Upstream commit f0212a5 ] Running with KASAN on a VIM3L systems leads to the following splat when probing the Ethernet device: ================================================================== BUG: KASAN: global-out-of-bounds in _get_maxdiv+0x74/0xd8 Read of size 4 at addr ffffa000090615f4 by task systemd-udevd/139 CPU: 1 PID: 139 Comm: systemd-udevd Tainted: G E 5.7.0-rc1-00101-g8624b7577b9c raspberrypi#781 Hardware name: amlogic w400/w400, BIOS 2020.01-rc5 03/12/2020 Call trace: dump_backtrace+0x0/0x2a0 show_stack+0x20/0x30 dump_stack+0xec/0x148 print_address_description.isra.12+0x70/0x35c __kasan_report+0xfc/0x1d4 kasan_report+0x4c/0x68 __asan_load4+0x9c/0xd8 _get_maxdiv+0x74/0xd8 clk_divider_bestdiv+0x74/0x5e0 clk_divider_round_rate+0x80/0x1a8 clk_core_determine_round_nolock.part.9+0x9c/0xd0 clk_core_round_rate_nolock+0xf0/0x108 clk_hw_round_rate+0xac/0xf0 clk_factor_round_rate+0xb8/0xd0 clk_core_determine_round_nolock.part.9+0x9c/0xd0 clk_core_round_rate_nolock+0xf0/0x108 clk_core_round_rate_nolock+0xbc/0x108 clk_core_set_rate_nolock+0xc4/0x2e8 clk_set_rate+0x58/0xe0 meson8b_dwmac_probe+0x588/0x72c [dwmac_meson8b] platform_drv_probe+0x78/0xd8 really_probe+0x158/0x610 driver_probe_device+0x140/0x1b0 device_driver_attach+0xa4/0xb0 __driver_attach+0xcc/0x1c8 bus_for_each_dev+0xf4/0x168 driver_attach+0x3c/0x50 bus_add_driver+0x238/0x2e8 driver_register+0xc8/0x1e8 __platform_driver_register+0x88/0x98 meson8b_dwmac_driver_init+0x28/0x1000 [dwmac_meson8b] do_one_initcall+0xa8/0x328 do_init_module+0xe8/0x368 load_module+0x3300/0x36b0 __do_sys_finit_module+0x120/0x1a8 __arm64_sys_finit_module+0x4c/0x60 el0_svc_common.constprop.2+0xe4/0x268 do_el0_svc+0x98/0xa8 el0_svc+0x24/0x68 el0_sync_handler+0x12c/0x318 el0_sync+0x158/0x180 The buggy address belongs to the variable: div_table.63646+0x34/0xfffffffffffffa40 [dwmac_meson8b] Memory state around the buggy address: ffffa00009061480: fa fa fa fa 00 00 00 01 fa fa fa fa 00 00 00 00 ffffa00009061500: 05 fa fa fa fa fa fa fa 00 04 fa fa fa fa fa fa >ffffa00009061580: 00 03 fa fa fa fa fa fa 00 00 00 00 00 00 fa fa ^ ffffa00009061600: fa fa fa fa 00 01 fa fa fa fa fa fa 01 fa fa fa ffffa00009061680: fa fa fa fa 00 01 fa fa fa fa fa fa 04 fa fa fa ================================================================== Digging into this indeed shows that the clock divider array is lacking a final fence, and that the clock subsystems goes in the weeds. Oh well. Let's add the empty structure that indicates the end of the array. Fixes: bd6f485 ("net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs") Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

[ Upstream commit f0212a5 ] Running with KASAN on a VIM3L systems leads to the following splat when probing the Ethernet device: ================================================================== BUG: KASAN: global-out-of-bounds in _get_maxdiv+0x74/0xd8 Read of size 4 at addr ffffa000090615f4 by task systemd-udevd/139 CPU: 1 PID: 139 Comm: systemd-udevd Tainted: G E 5.7.0-rc1-00101-g8624b7577b9c #781 Hardware name: amlogic w400/w400, BIOS 2020.01-rc5 03/12/2020 Call trace: dump_backtrace+0x0/0x2a0 show_stack+0x20/0x30 dump_stack+0xec/0x148 print_address_description.isra.12+0x70/0x35c __kasan_report+0xfc/0x1d4 kasan_report+0x4c/0x68 __asan_load4+0x9c/0xd8 _get_maxdiv+0x74/0xd8 clk_divider_bestdiv+0x74/0x5e0 clk_divider_round_rate+0x80/0x1a8 clk_core_determine_round_nolock.part.9+0x9c/0xd0 clk_core_round_rate_nolock+0xf0/0x108 clk_hw_round_rate+0xac/0xf0 clk_factor_round_rate+0xb8/0xd0 clk_core_determine_round_nolock.part.9+0x9c/0xd0 clk_core_round_rate_nolock+0xf0/0x108 clk_core_round_rate_nolock+0xbc/0x108 clk_core_set_rate_nolock+0xc4/0x2e8 clk_set_rate+0x58/0xe0 meson8b_dwmac_probe+0x588/0x72c [dwmac_meson8b] platform_drv_probe+0x78/0xd8 really_probe+0x158/0x610 driver_probe_device+0x140/0x1b0 device_driver_attach+0xa4/0xb0 __driver_attach+0xcc/0x1c8 bus_for_each_dev+0xf4/0x168 driver_attach+0x3c/0x50 bus_add_driver+0x238/0x2e8 driver_register+0xc8/0x1e8 __platform_driver_register+0x88/0x98 meson8b_dwmac_driver_init+0x28/0x1000 [dwmac_meson8b] do_one_initcall+0xa8/0x328 do_init_module+0xe8/0x368 load_module+0x3300/0x36b0 __do_sys_finit_module+0x120/0x1a8 __arm64_sys_finit_module+0x4c/0x60 el0_svc_common.constprop.2+0xe4/0x268 do_el0_svc+0x98/0xa8 el0_svc+0x24/0x68 el0_sync_handler+0x12c/0x318 el0_sync+0x158/0x180 The buggy address belongs to the variable: div_table.63646+0x34/0xfffffffffffffa40 [dwmac_meson8b] Memory state around the buggy address: ffffa00009061480: fa fa fa fa 00 00 00 01 fa fa fa fa 00 00 00 00 ffffa00009061500: 05 fa fa fa fa fa fa fa 00 04 fa fa fa fa fa fa >ffffa00009061580: 00 03 fa fa fa fa fa fa 00 00 00 00 00 00 fa fa ^ ffffa00009061600: fa fa fa fa 00 01 fa fa fa fa fa fa 01 fa fa fa ffffa00009061680: fa fa fa fa 00 01 fa fa fa fa fa fa 04 fa fa fa ================================================================== Digging into this indeed shows that the clock divider array is lacking a final fence, and that the clock subsystems goes in the weeds. Oh well. Let's add the empty structure that indicates the end of the array. Fixes: bd6f485 ("net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs") Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Small optimizations

8866799

- Optimized redundant dereferencing of the state - Cached the register for FIQ_WRITE and FIQ_READ - collapsed come redundant branches and added early outs where possible

jeanlemotan closed this Feb 3, 2015

jeanlemotan deleted the patch-3 branch February 12, 2015 19:45

gregt590 mentioned this pull request May 31, 2015

MMC falling to PIO #985

Closed

herbertkoenig mentioned this pull request Aug 24, 2015

File Permissions for exported GPIO pins changed in Kernel 4.1.6 #1117

Closed

J-Kerman mentioned this pull request Aug 24, 2015

"tc filter show dev eth0" - does not show match conditions #1116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small optimizations #781

Small optimizations #781

jeanlemotan commented Feb 1, 2015

popcornmix commented Feb 1, 2015

P33M commented Feb 1, 2015

jeanlemotan commented Feb 1, 2015

jeanlemotan commented Feb 1, 2015

jeanlemotan commented Feb 3, 2015

Small optimizations #781

Small optimizations #781

Conversation

jeanlemotan commented Feb 1, 2015

popcornmix commented Feb 1, 2015

P33M commented Feb 1, 2015

jeanlemotan commented Feb 1, 2015

jeanlemotan commented Feb 1, 2015

jeanlemotan commented Feb 3, 2015