net/tcp: Add sockopt to allow TIME_WAIT reuse #3

dhobsd · 2014-12-27T00:30:18Z

When both the requesting socket and the socket found in TIME_WAIT state
are requesting to reuse TIME_WAIT-state sockets, allow us to find the TW
socket as unique.

Previously, this was gated on a global sysctl, which is not ideal for our
use case.

Add a userspace visible knob to tell the VM to keep an extra amount of memory free, by increasing the gap between each zone's min and low watermarks. This is useful for realtime applications that call system calls and have a bound on the number of allocations that happen in any short time period. In this application, extra_free_kbytes would be left at an amount equal to or larger than than the maximum number of allocations that happen in any burst. It may also be useful to reduce the memory use of virtual machines (temporarily?), in a way that does not cause memory fragmentation like ballooning does.

fucks us right up, it does.

…read dealing with the irq save the cpu id so we dont have to go over it again

pivoting to 99 to make future maintenance simpler.

Conflicts: net/ipv4/tcp.c

This reverts commit dcdfdf5.

When both the requesting socket and the socket found in TIME_WAIT state are requesting to reuse TIME_WAIT-state sockets, allow us to find the TW socket as unique. Previously, this was gated on a global sysctl, which is not ideal for our use case.

dormando · 2014-12-27T00:33:39Z

How does this not break NAT in the same way tw_reuse always does?

dhobsd · 2014-12-27T00:47:34Z

It doesn't, unless you only use it as a client, in which case it's probably always fine. The intent of the sockopt is for us to only be using it as a client, when we connect out to origins; @crucially seemed to think this would be OK behavior.

dhobsd · 2014-12-27T01:35:55Z

Oh, it will also only reuse the socket if both the discovered tw sock and the connecting sock agree for reuse. So we won't reuse a socket that is in TW state that happens to belong to some connection where this would be problematic, or when our connecting socket belongs to an application that doesn't want to do this. Since both sockets had to request this behavior when they were created, this differs from tw_reuse in that it is not global and will not have any impact on applications that do not want to risk it.

I do think that our very low tw timer value (5 seconds) is low enough that we would probably already break old NAT gateways where there's enough loss that they'd be retransmitting; 5 seconds is way under what such a gateway would consider 2*MSL (which is probably always going to be 120s).

So those are the reasons I think it's a safe change.

commit db93fac upstream. This patch is to fix two deadlock cases. Deadlock 1: CPU #1 pinctrl_register-> pinctrl_get -> create_pinctrl (Holding lock pinctrl_maps_mutex) -> get_pinctrl_dev_from_devname (Trying to acquire lock pinctrldev_list_mutex) CPU #0 pinctrl_unregister (Holding lock pinctrldev_list_mutex) -> pinctrl_put ->> pinctrl_free -> pinctrl_dt_free_maps -> pinctrl_unregister_map (Trying to acquire lock pinctrl_maps_mutex) Simply to say CPU#1 is holding lock A and trying to acquire lock B, CPU#0 is holding lock B and trying to acquire lock A. Deadlock 2: CPU #3 pinctrl_register-> pinctrl_get -> create_pinctrl (Holding lock pinctrl_maps_mutex) -> get_pinctrl_dev_from_devname (Trying to acquire lock pinctrldev_list_mutex) CPU #2 pinctrl_unregister (Holding lock pctldev->mutex) -> pinctrl_put ->> pinctrl_free -> pinctrl_dt_free_maps -> pinctrl_unregister_map (Trying to acquire lock pinctrl_maps_mutex) CPU #0 tegra_gpio_request (Holding lock pinctrldev_list_mutex) -> pinctrl_get_device_gpio_range (Trying to acquire lock pctldev->mutex) Simply to say CPU#3 is holding lock A and trying to acquire lock D, CPU#2 is holding lock B and trying to acquire lock A, CPU#0 is holding lock D and trying to acquire lock B. Signed-off-by: Jim Lin <jilin@nvidia.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit ce75145 upstream. It is possible for ata_sff_flush_pio_task() to set ap->hsm_task_state to HSM_ST_IDLE in between the time __ata_sff_port_intr() checks for HSM_ST_IDLE and before it calls ata_sff_hsm_move() causing ata_sff_hsm_move() to BUG(). This problem is hard to reproduce making this patch hard to verify, but this fix will prevent the race. I have not been able to reproduce the problem, but here is a crash dump from a 2.6.32 kernel. On examining the ata port's state, its hsm_task_state field has a value of HSM_ST_IDLE: crash> struct ata_port.hsm_task_state ffff881c1121c000 hsm_task_state = 0 Normally, this should not be possible as ata_sff_hsm_move() was called from ata_sff_host_intr(), which checks hsm_task_state and won't call ata_sff_hsm_move() if it has a HSM_ST_IDLE value. PID: 11053 TASK: ffff8816e846cae0 CPU: 0 COMMAND: "sshd" #0 [ffff88008ba03960] machine_kexec at ffffffff81038f3b #1 [ffff88008ba039c0] crash_kexec at ffffffff810c5d92 #2 [ffff88008ba03a90] oops_end at ffffffff8152b510 #3 [ffff88008ba03ac0] die at ffffffff81010e0b #4 [ffff88008ba03af0] do_trap at ffffffff8152ad74 #5 [ffff88008ba03b50] do_invalid_op at ffffffff8100cf95 #6 [ffff88008ba03bf0] invalid_op at ffffffff8100bf9b [exception RIP: ata_sff_hsm_move+317] RIP: ffffffff813a77ad RSP: ffff88008ba03ca0 RFLAGS: 00010097 RAX: 0000000000000000 RBX: ffff881c1121dc60 RCX: 0000000000000000 RDX: ffff881c1121dd10 RSI: ffff881c1121dc60 RDI: ffff881c1121c000 RBP: ffff88008ba03d00 R8: 0000000000000000 R9: 000000000000002e R10: 000000000001003f R11: 000000000000009b R12: ffff881c1121c000 R13: 0000000000000000 R14: 0000000000000050 R15: ffff881c1121dd78 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 torvalds#7 [ffff88008ba03d08] ata_sff_host_intr at ffffffff813a7fbd torvalds#8 [ffff88008ba03d38] ata_sff_interrupt at ffffffff813a821e torvalds#9 [ffff88008ba03d78] handle_IRQ_event at ffffffff810e6ec0

commit db93fac upstream. This patch is to fix two deadlock cases. Deadlock 1: CPU #1 pinctrl_register-> pinctrl_get -> create_pinctrl (Holding lock pinctrl_maps_mutex) -> get_pinctrl_dev_from_devname (Trying to acquire lock pinctrldev_list_mutex) CPU #0 pinctrl_unregister (Holding lock pinctrldev_list_mutex) -> pinctrl_put ->> pinctrl_free -> pinctrl_dt_free_maps -> pinctrl_unregister_map (Trying to acquire lock pinctrl_maps_mutex) Simply to say CPU#1 is holding lock A and trying to acquire lock B, CPU#0 is holding lock B and trying to acquire lock A. Deadlock 2: CPU #3 pinctrl_register-> pinctrl_get -> create_pinctrl (Holding lock pinctrl_maps_mutex) -> get_pinctrl_dev_from_devname (Trying to acquire lock pinctrldev_list_mutex) CPU #2 pinctrl_unregister (Holding lock pctldev->mutex) -> pinctrl_put ->> pinctrl_free -> pinctrl_dt_free_maps -> pinctrl_unregister_map (Trying to acquire lock pinctrl_maps_mutex) CPU #0 tegra_gpio_request (Holding lock pinctrldev_list_mutex) -> pinctrl_get_device_gpio_range (Trying to acquire lock pctldev->mutex) Simply to say CPU#3 is holding lock A and trying to acquire lock D, CPU#2 is holding lock B and trying to acquire lock A, CPU#0 is holding lock D and trying to acquire lock B. Signed-off-by: Jim Lin <jilin@nvidia.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

commit ce75145 upstream. It is possible for ata_sff_flush_pio_task() to set ap->hsm_task_state to HSM_ST_IDLE in between the time __ata_sff_port_intr() checks for HSM_ST_IDLE and before it calls ata_sff_hsm_move() causing ata_sff_hsm_move() to BUG(). This problem is hard to reproduce making this patch hard to verify, but this fix will prevent the race. I have not been able to reproduce the problem, but here is a crash dump from a 2.6.32 kernel. On examining the ata port's state, its hsm_task_state field has a value of HSM_ST_IDLE: crash> struct ata_port.hsm_task_state ffff881c1121c000 hsm_task_state = 0 Normally, this should not be possible as ata_sff_hsm_move() was called from ata_sff_host_intr(), which checks hsm_task_state and won't call ata_sff_hsm_move() if it has a HSM_ST_IDLE value. PID: 11053 TASK: ffff8816e846cae0 CPU: 0 COMMAND: "sshd" #0 [ffff88008ba03960] machine_kexec at ffffffff81038f3b #1 [ffff88008ba039c0] crash_kexec at ffffffff810c5d92 #2 [ffff88008ba03a90] oops_end at ffffffff8152b510 #3 [ffff88008ba03ac0] die at ffffffff81010e0b #4 [ffff88008ba03af0] do_trap at ffffffff8152ad74 #5 [ffff88008ba03b50] do_invalid_op at ffffffff8100cf95 #6 [ffff88008ba03bf0] invalid_op at ffffffff8100bf9b [exception RIP: ata_sff_hsm_move+317] RIP: ffffffff813a77ad RSP: ffff88008ba03ca0 RFLAGS: 00010097 RAX: 0000000000000000 RBX: ffff881c1121dc60 RCX: 0000000000000000 RDX: ffff881c1121dd10 RSI: ffff881c1121dc60 RDI: ffff881c1121c000 RBP: ffff88008ba03d00 R8: 0000000000000000 R9: 000000000000002e R10: 000000000001003f R11: 000000000000009b R12: ffff881c1121c000 R13: 0000000000000000 R14: 0000000000000050 R15: ffff881c1121dd78 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 torvalds#7 [ffff88008ba03d08] ata_sff_host_intr at ffffffff813a7fbd torvalds#8 [ffff88008ba03d38] ata_sff_interrupt at ffffffff813a821e torvalds#9 [ffff88008ba03d78] handle_IRQ_event at ffffffff810e6ec0

dormando and others added 10 commits December 17, 2014 13:23

initcwnd from userspace tunable

e644a6a

Don't change initcwnd's magic number

b53ab0a

fucks us right up, it does.

match the reuseport sk based on the smp_processor_id of the kernel th…

2ebc8b7

…read dealing with the irq save the cpu id so we dont have to go over it again

pass syncookie down and dont do multipath for them

2085c3a

skip the first queue entry from RSS

76b4fe2

accept a second number for TCP_CWND

ef585d0

pivoting to 99 to make future maintenance simpler.

Add TCP_FASTLY_INFO, export nexthop used.

e54595d

Conflicts: net/ipv4/tcp.c

Revert "ipv4: fix race in concurrent ip_route_input_slow()"

e68dc40

This reverts commit dcdfdf5.

dormando force-pushed the fastly314-stable branch from e68dc40 to e60d1e8 Compare April 20, 2015 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net/tcp: Add sockopt to allow TIME_WAIT reuse #3

net/tcp: Add sockopt to allow TIME_WAIT reuse #3

dhobsd commented Dec 27, 2014

dormando commented Dec 27, 2014

dhobsd commented Dec 27, 2014

dhobsd commented Dec 27, 2014

net/tcp: Add sockopt to allow TIME_WAIT reuse #3

Are you sure you want to change the base?

net/tcp: Add sockopt to allow TIME_WAIT reuse #3

Conversation

dhobsd commented Dec 27, 2014

dormando commented Dec 27, 2014

dhobsd commented Dec 27, 2014

dhobsd commented Dec 27, 2014