Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small optimizations #781

Closed
wants to merge 1 commit into from
Closed

Small optimizations #781

wants to merge 1 commit into from

Conversation

jeanlemotan
Copy link

  • Optimized redundant dereferencing of the state
  • Cached the register for FIQ_WRITE and FIQ_READ
  • collapsed come redundant branches and added early outs where possible

- Optimized redundant dereferencing of the state
- Cached the register for FIQ_WRITE and FIQ_READ
- collapsed come redundant branches and added early outs where possible
@popcornmix
Copy link
Collaborator

@P33M for review.

@P33M
Copy link
Contributor

P33M commented Feb 1, 2015

Have you looked at the assembler output for the functions before and after your "optimisations"?

GCC will perform the computation of an offset once if the same offset is referenced multiple times in a function call. This applies to both register offsets and struct offsets. I suspect that GCC is already doing automatically what you are doing manually.

Likewise in fiq_fsm_tt_in_use() - is there any evidence in the emitted code that the fast return is any quicker in terms of number of instructions than the subsequent value test and break?

Your choice of hw register name causes the code to be less readable - "reg + offset"? Shouldn't that be a "channel_base + offset"?

@jeanlemotan
Copy link
Author

You're right, reg_base would be clearer than "reg" as some offset is always added.
I didn't look at the generated assembler but I did notice the object file got significantly smaller. No numbers though as I did this change some weeks ago. And of course that doesn't tell much so I'll try to check the asm tonight.

The early return in fiq_fsm_tt_in_use is mostly for clarity IMO. The code was essentially breaking the loop to return if the channel was used and I think the intention is clearer now. Collapsing the branch (st->channel[i].fsm == FIQ_PASSTHROUGH) into the switch statement is probably irrelevant in terms of speed but again - I think it streamlines the function a bit. All the decisions based on the fsm state are taken in the switch.

@jeanlemotan
Copy link
Author

Here's the asm for fiq_fsm_tt_in_use:

Before the commit:
0xc0255ad8 <+0>: add r3, r2, r2, lsl #4
0xc0255adc <+4>: cmp r1, #0
0xc0255ae0 <+8>: add r3, r0, r3, lsl #2
0xc0255ae4 <+12>: push {r4, r5, r6, lr}
0xc0255ae8 <+16>: ldr r5, [r3, #68] ; 0x44
0xc0255aec <+20>: ldr r6, [r3, #72] ; 0x48
0xc0255af0 <+24>: ble 0xc0255b44 <fiq_fsm_tt_in_use+108>
0xc0255af4 <+28>: mov r3, #0
0xc0255af8 <+32>: mov r4, #1
0xc0255afc <+36>: cmp r3, r2
0xc0255b00 <+40>: add r3, r3, #1
0xc0255b04 <+44>: beq 0xc0255b38 <fiq_fsm_tt_in_use+96>
0xc0255b08 <+48>: ldr r12, [r0, #60] ; 0x3c
0xc0255b0c <+52>: cmp r12, #0
0xc0255b10 <+56>: sub lr, r12, #9
0xc0255b14 <+60>: beq 0xc0255b38 <fiq_fsm_tt_in_use+96>
0xc0255b18 <+64>: cmp lr, #6
0xc0255b1c <+68>: lsl r12, r4, lr
0xc0255b20 <+72>: bhi 0xc0255b38 <fiq_fsm_tt_in_use+96>
0xc0255b24 <+76>: tst r12, #121 ; 0x79
0xc0255b28 <+80>: beq 0xc0255b38 <fiq_fsm_tt_in_use+96>
0xc0255b2c <+84>: ldr r12, [r0, #68] ; 0x44
0xc0255b30 <+88>: cmp r5, r12
0xc0255b34 <+92>: beq 0xc0255b4c <fiq_fsm_tt_in_use+116>
0xc0255b38 <+96>: cmp r3, r1
0xc0255b3c <+100>: add r0, r0, #68 ; 0x44
0xc0255b40 <+104>: bne 0xc0255afc <fiq_fsm_tt_in_use+36>
0xc0255b44 <+108>: mov r0, #0
0xc0255b48 <+112>: pop {r4, r5, r6, pc}
0xc0255b4c <+116>: ldr r12, [r0, #72] ; 0x48
0xc0255b50 <+120>: cmp r6, r12
0xc0255b54 <+124>: bne 0xc0255b38 <fiq_fsm_tt_in_use+96>
0xc0255b58 <+128>: mov r0, #1
0xc0255b5c <+132>: pop {r4, r5, r6, pc}
End of assembler dump.

After:
0xc0255ad8 <+0>: add r3, r2, r2, lsl #4
0xc0255adc <+4>: cmp r1, #0
0xc0255ae0 <+8>: add r3, r0, r3, lsl #2
0xc0255ae4 <+12>: push {r4, r5, lr}
0xc0255ae8 <+16>: ldr r4, [r3, #68] ; 0x44
0xc0255aec <+20>: ldr r5, [r3, #72] ; 0x48
0xc0255af0 <+24>: ble 0xc0255b50 <fiq_fsm_tt_in_use+120>
0xc0255af4 <+28>: mov r3, #0
0xc0255af8 <+32>: b 0xc0255b14 <fiq_fsm_tt_in_use+60>
0xc0255afc <+36>: bcc 0xc0255b08 <fiq_fsm_tt_in_use+48>
0xc0255b00 <+40>: cmp lr, #3
0xc0255b04 <+44>: bls 0xc0255b30 <fiq_fsm_tt_in_use+88>
0xc0255b08 <+48>: cmp r3, r1
0xc0255b0c <+52>: add r0, r0, #68 ; 0x44
0xc0255b10 <+56>: beq 0xc0255b50 <fiq_fsm_tt_in_use+120>
0xc0255b14 <+60>: cmp r3, r2
0xc0255b18 <+64>: add r3, r3, #1
0xc0255b1c <+68>: beq 0xc0255b08 <fiq_fsm_tt_in_use+48>
0xc0255b20 <+72>: ldr r12, [r0, #60] ; 0x3c
0xc0255b24 <+76>: cmp r12, #9
0xc0255b28 <+80>: sub lr, r12, #12
0xc0255b2c <+84>: bne 0xc0255afc <fiq_fsm_tt_in_use+36>
0xc0255b30 <+88>: ldr r12, [r0, #68] ; 0x44
0xc0255b34 <+92>: cmp r4, r12
0xc0255b38 <+96>: bne 0xc0255b08 <fiq_fsm_tt_in_use+48>
0xc0255b3c <+100>: ldr r12, [r0, #72] ; 0x48
0xc0255b40 <+104>: cmp r5, r12
0xc0255b44 <+108>: bne 0xc0255b08 <fiq_fsm_tt_in_use+48>
0xc0255b48 <+112>: mov r0, #1
0xc0255b4c <+116>: pop {r4, r5, pc}
0xc0255b50 <+120>: mov r0, #0
0xc0255b54 <+124>: pop {r4, r5, pc}
End of assembler dump.

The difference is not significant - only 2 instruction. It looks that GCC does better register allocation. I expected a more dramatic difference tbh.

So I would discard the register caching and the st->channel[n] change as it's very big and with insignificant impact.
I can prepare a new commit with just the early outs if needed.

@jeanlemotan
Copy link
Author

I'm closing this. I will make a new pull request for the cosmetic changes later on.

@jeanlemotan jeanlemotan closed this Feb 3, 2015
@jeanlemotan jeanlemotan deleted the patch-3 branch February 12, 2015 19:45
@gregt590 gregt590 mentioned this pull request May 31, 2015
anholt pushed a commit to anholt/linux that referenced this pull request Oct 12, 2015
The renesas-intc-irqpin interrupt controller is cascaded to the GIC.
Hence when propagating wake-up settings to its parent interrupt
controller, the following lockdep warning is printed:

    =============================================
    [ INFO: possible recursive locking detected ]
    4.2.0-armadillo-10725-g50fcd7643c034198 raspberrypi#781 Not tainted
    ---------------------------------------------
    s2ram/1179 is trying to acquire lock:
    (&irq_desc_lock_class){-.-...}, at: [<c005bb54>] __irq_get_desc_lock+0x78/0x94

    but task is already holding lock:
    (&irq_desc_lock_class){-.-...}, at: [<c005bb54>] __irq_get_desc_lock+0x78/0x94

    other info that might help us debug this:
    Possible unsafe locking scenario:

	  CPU0
	  ----
     lock(&irq_desc_lock_class);
     lock(&irq_desc_lock_class);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    7 locks held by s2ram/1179:
    #0:  (sb_writers#7){.+.+.+}, at: [<c00c9708>] __sb_start_write+0x64/0xb8
    #1:  (&of->mutex){+.+.+.}, at: [<c0125a00>] kernfs_fop_write+0x78/0x1a0
    #2:  (s_active#23){.+.+.+}, at: [<c0125a08>] kernfs_fop_write+0x80/0x1a0
    #3:  (autosleep_lock){+.+.+.}, at: [<c0058244>] pm_autosleep_lock+0x18/0x20
    #4:  (pm_mutex){+.+.+.}, at: [<c0057e50>] pm_suspend+0x54/0x248
    #5:  (&dev->mutex){......}, at: [<c0243a20>] __device_suspend+0xdc/0x240
    #6:  (&irq_desc_lock_class){-.-...}, at: [<c005bb54>] __irq_get_desc_lock+0x78/0x94

    stack backtrace:
    CPU: 0 PID: 1179 Comm: s2ram Not tainted 4.2.0-armadillo-10725-g50fcd7643c034198

    Hardware name: Generic R8A7740 (Flattened Device Tree)
    [<c00129f4>] (dump_backtrace) from [<c0012bec>] (show_stack+0x18/0x1c)
    [<c0012bd4>] (show_stack) from [<c03f5d94>] (dump_stack+0x20/0x28)
    [<c03f5d74>] (dump_stack) from [<c00514d4>] (__lock_acquire+0x67c/0x1b88)
    [<c0050e58>] (__lock_acquire) from [<c0052df8>] (lock_acquire+0x9c/0xbc)
    [<c0052d5c>] (lock_acquire) from [<c03fb068>] (_raw_spin_lock_irqsave+0x44/0x58)
    [<c03fb024>] (_raw_spin_lock_irqsave) from [<c005bb54>] (__irq_get_desc_lock+0x78/0x94
    [<c005badc>] (__irq_get_desc_lock) from [<c005c3d8>] (irq_set_irq_wake+0x28/0x100)
    [<c005c3b0>] (irq_set_irq_wake) from [<c01e50d0>] (intc_irqpin_irq_set_wake+0x24/0x4c)
    [<c01e50ac>] (intc_irqpin_irq_set_wake) from [<c005c17c>] (set_irq_wake_real+0x3c/0x50
    [<c005c140>] (set_irq_wake_real) from [<c005c414>] (irq_set_irq_wake+0x64/0x100)
    [<c005c3b0>] (irq_set_irq_wake) from [<c02a19b4>] (gpio_keys_suspend+0x60/0xa0)
    [<c02a1954>] (gpio_keys_suspend) from [<c023b750>] (platform_pm_suspend+0x3c/0x5c)

Avoid this false positive by using a separate lockdep class for INTC
External IRQ Pin interrupts.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Magnus Damm <magnus.damm@gmail.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/1441798974-25716-3-git-send-email-geert%2Brenesas@glider.be
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
ED6E0F17 pushed a commit to ED6E0F17/linux that referenced this pull request May 1, 2020
[ Upstream commit f0212a5 ]

Running with KASAN on a VIM3L systems leads to the following splat
when probing the Ethernet device:

==================================================================
BUG: KASAN: global-out-of-bounds in _get_maxdiv+0x74/0xd8
Read of size 4 at addr ffffa000090615f4 by task systemd-udevd/139
CPU: 1 PID: 139 Comm: systemd-udevd Tainted: G            E     5.7.0-rc1-00101-g8624b7577b9c raspberrypi#781
Hardware name: amlogic w400/w400, BIOS 2020.01-rc5 03/12/2020
Call trace:
 dump_backtrace+0x0/0x2a0
 show_stack+0x20/0x30
 dump_stack+0xec/0x148
 print_address_description.isra.12+0x70/0x35c
 __kasan_report+0xfc/0x1d4
 kasan_report+0x4c/0x68
 __asan_load4+0x9c/0xd8
 _get_maxdiv+0x74/0xd8
 clk_divider_bestdiv+0x74/0x5e0
 clk_divider_round_rate+0x80/0x1a8
 clk_core_determine_round_nolock.part.9+0x9c/0xd0
 clk_core_round_rate_nolock+0xf0/0x108
 clk_hw_round_rate+0xac/0xf0
 clk_factor_round_rate+0xb8/0xd0
 clk_core_determine_round_nolock.part.9+0x9c/0xd0
 clk_core_round_rate_nolock+0xf0/0x108
 clk_core_round_rate_nolock+0xbc/0x108
 clk_core_set_rate_nolock+0xc4/0x2e8
 clk_set_rate+0x58/0xe0
 meson8b_dwmac_probe+0x588/0x72c [dwmac_meson8b]
 platform_drv_probe+0x78/0xd8
 really_probe+0x158/0x610
 driver_probe_device+0x140/0x1b0
 device_driver_attach+0xa4/0xb0
 __driver_attach+0xcc/0x1c8
 bus_for_each_dev+0xf4/0x168
 driver_attach+0x3c/0x50
 bus_add_driver+0x238/0x2e8
 driver_register+0xc8/0x1e8
 __platform_driver_register+0x88/0x98
 meson8b_dwmac_driver_init+0x28/0x1000 [dwmac_meson8b]
 do_one_initcall+0xa8/0x328
 do_init_module+0xe8/0x368
 load_module+0x3300/0x36b0
 __do_sys_finit_module+0x120/0x1a8
 __arm64_sys_finit_module+0x4c/0x60
 el0_svc_common.constprop.2+0xe4/0x268
 do_el0_svc+0x98/0xa8
 el0_svc+0x24/0x68
 el0_sync_handler+0x12c/0x318
 el0_sync+0x158/0x180

The buggy address belongs to the variable:
 div_table.63646+0x34/0xfffffffffffffa40 [dwmac_meson8b]

Memory state around the buggy address:
 ffffa00009061480: fa fa fa fa 00 00 00 01 fa fa fa fa 00 00 00 00
 ffffa00009061500: 05 fa fa fa fa fa fa fa 00 04 fa fa fa fa fa fa
>ffffa00009061580: 00 03 fa fa fa fa fa fa 00 00 00 00 00 00 fa fa
                                                             ^
 ffffa00009061600: fa fa fa fa 00 01 fa fa fa fa fa fa 01 fa fa fa
 ffffa00009061680: fa fa fa fa 00 01 fa fa fa fa fa fa 04 fa fa fa
==================================================================

Digging into this indeed shows that the clock divider array is
lacking a final fence, and that the clock subsystems goes in the
weeds. Oh well.

Let's add the empty structure that indicates the end of the array.

Fixes: bd6f485 ("net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this pull request May 4, 2020
[ Upstream commit f0212a5 ]

Running with KASAN on a VIM3L systems leads to the following splat
when probing the Ethernet device:

==================================================================
BUG: KASAN: global-out-of-bounds in _get_maxdiv+0x74/0xd8
Read of size 4 at addr ffffa000090615f4 by task systemd-udevd/139
CPU: 1 PID: 139 Comm: systemd-udevd Tainted: G            E     5.7.0-rc1-00101-g8624b7577b9c #781
Hardware name: amlogic w400/w400, BIOS 2020.01-rc5 03/12/2020
Call trace:
 dump_backtrace+0x0/0x2a0
 show_stack+0x20/0x30
 dump_stack+0xec/0x148
 print_address_description.isra.12+0x70/0x35c
 __kasan_report+0xfc/0x1d4
 kasan_report+0x4c/0x68
 __asan_load4+0x9c/0xd8
 _get_maxdiv+0x74/0xd8
 clk_divider_bestdiv+0x74/0x5e0
 clk_divider_round_rate+0x80/0x1a8
 clk_core_determine_round_nolock.part.9+0x9c/0xd0
 clk_core_round_rate_nolock+0xf0/0x108
 clk_hw_round_rate+0xac/0xf0
 clk_factor_round_rate+0xb8/0xd0
 clk_core_determine_round_nolock.part.9+0x9c/0xd0
 clk_core_round_rate_nolock+0xf0/0x108
 clk_core_round_rate_nolock+0xbc/0x108
 clk_core_set_rate_nolock+0xc4/0x2e8
 clk_set_rate+0x58/0xe0
 meson8b_dwmac_probe+0x588/0x72c [dwmac_meson8b]
 platform_drv_probe+0x78/0xd8
 really_probe+0x158/0x610
 driver_probe_device+0x140/0x1b0
 device_driver_attach+0xa4/0xb0
 __driver_attach+0xcc/0x1c8
 bus_for_each_dev+0xf4/0x168
 driver_attach+0x3c/0x50
 bus_add_driver+0x238/0x2e8
 driver_register+0xc8/0x1e8
 __platform_driver_register+0x88/0x98
 meson8b_dwmac_driver_init+0x28/0x1000 [dwmac_meson8b]
 do_one_initcall+0xa8/0x328
 do_init_module+0xe8/0x368
 load_module+0x3300/0x36b0
 __do_sys_finit_module+0x120/0x1a8
 __arm64_sys_finit_module+0x4c/0x60
 el0_svc_common.constprop.2+0xe4/0x268
 do_el0_svc+0x98/0xa8
 el0_svc+0x24/0x68
 el0_sync_handler+0x12c/0x318
 el0_sync+0x158/0x180

The buggy address belongs to the variable:
 div_table.63646+0x34/0xfffffffffffffa40 [dwmac_meson8b]

Memory state around the buggy address:
 ffffa00009061480: fa fa fa fa 00 00 00 01 fa fa fa fa 00 00 00 00
 ffffa00009061500: 05 fa fa fa fa fa fa fa 00 04 fa fa fa fa fa fa
>ffffa00009061580: 00 03 fa fa fa fa fa fa 00 00 00 00 00 00 fa fa
                                                             ^
 ffffa00009061600: fa fa fa fa 00 01 fa fa fa fa fa fa 01 fa fa fa
 ffffa00009061680: fa fa fa fa 00 01 fa fa fa fa fa fa 04 fa fa fa
==================================================================

Digging into this indeed shows that the clock divider array is
lacking a final fence, and that the clock subsystems goes in the
weeds. Oh well.

Let's add the empty structure that indicates the end of the array.

Fixes: bd6f485 ("net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this pull request May 4, 2020
[ Upstream commit f0212a5 ]

Running with KASAN on a VIM3L systems leads to the following splat
when probing the Ethernet device:

==================================================================
BUG: KASAN: global-out-of-bounds in _get_maxdiv+0x74/0xd8
Read of size 4 at addr ffffa000090615f4 by task systemd-udevd/139
CPU: 1 PID: 139 Comm: systemd-udevd Tainted: G            E     5.7.0-rc1-00101-g8624b7577b9c #781
Hardware name: amlogic w400/w400, BIOS 2020.01-rc5 03/12/2020
Call trace:
 dump_backtrace+0x0/0x2a0
 show_stack+0x20/0x30
 dump_stack+0xec/0x148
 print_address_description.isra.12+0x70/0x35c
 __kasan_report+0xfc/0x1d4
 kasan_report+0x4c/0x68
 __asan_load4+0x9c/0xd8
 _get_maxdiv+0x74/0xd8
 clk_divider_bestdiv+0x74/0x5e0
 clk_divider_round_rate+0x80/0x1a8
 clk_core_determine_round_nolock.part.9+0x9c/0xd0
 clk_core_round_rate_nolock+0xf0/0x108
 clk_hw_round_rate+0xac/0xf0
 clk_factor_round_rate+0xb8/0xd0
 clk_core_determine_round_nolock.part.9+0x9c/0xd0
 clk_core_round_rate_nolock+0xf0/0x108
 clk_core_round_rate_nolock+0xbc/0x108
 clk_core_set_rate_nolock+0xc4/0x2e8
 clk_set_rate+0x58/0xe0
 meson8b_dwmac_probe+0x588/0x72c [dwmac_meson8b]
 platform_drv_probe+0x78/0xd8
 really_probe+0x158/0x610
 driver_probe_device+0x140/0x1b0
 device_driver_attach+0xa4/0xb0
 __driver_attach+0xcc/0x1c8
 bus_for_each_dev+0xf4/0x168
 driver_attach+0x3c/0x50
 bus_add_driver+0x238/0x2e8
 driver_register+0xc8/0x1e8
 __platform_driver_register+0x88/0x98
 meson8b_dwmac_driver_init+0x28/0x1000 [dwmac_meson8b]
 do_one_initcall+0xa8/0x328
 do_init_module+0xe8/0x368
 load_module+0x3300/0x36b0
 __do_sys_finit_module+0x120/0x1a8
 __arm64_sys_finit_module+0x4c/0x60
 el0_svc_common.constprop.2+0xe4/0x268
 do_el0_svc+0x98/0xa8
 el0_svc+0x24/0x68
 el0_sync_handler+0x12c/0x318
 el0_sync+0x158/0x180

The buggy address belongs to the variable:
 div_table.63646+0x34/0xfffffffffffffa40 [dwmac_meson8b]

Memory state around the buggy address:
 ffffa00009061480: fa fa fa fa 00 00 00 01 fa fa fa fa 00 00 00 00
 ffffa00009061500: 05 fa fa fa fa fa fa fa 00 04 fa fa fa fa fa fa
>ffffa00009061580: 00 03 fa fa fa fa fa fa 00 00 00 00 00 00 fa fa
                                                             ^
 ffffa00009061600: fa fa fa fa 00 01 fa fa fa fa fa fa 01 fa fa fa
 ffffa00009061680: fa fa fa fa 00 01 fa fa fa fa fa fa 04 fa fa fa
==================================================================

Digging into this indeed shows that the clock divider array is
lacking a final fence, and that the clock subsystems goes in the
weeds. Oh well.

Let's add the empty structure that indicates the end of the array.

Fixes: bd6f485 ("net: stmmac: dwmac-meson8b: Fix the RGMII TX delay on Meson8b/8m2 SoCs")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants