scheduler: "penalise" some subflows by sending less than their cwnd #345

matttbe · 2023-02-01T17:15:12Z

Resources might be limited at MPTCP level (sending/receiving window). Also some terrible subflows can badly impact the performances of the MPTCP subflows.

MPTCP has a view of all the different subflows and it can tell which subflow is "bad" according to different criteria: high latency, losses, with bufferbloat, unstable, stale, etc.
The packet scheduler should then use the limited resources the best way and not just fill the cwnd of all subflows!

Such optimisation is in place in mptcp.org, see mptcp_rcv_buf_optimization(). In mptcp.org, the cwnd of the slow flows is halved max once per RTT but it looks like a hack: it is strange to modify this without telling the CC and the CC will potentially reset the modification quickly after.

Instead, the idea here would be to keep some states per subflow (similar to #342) to send a fraction of the cwnd. It might be needed to do more than just halving the cwnd. Probably keeping a shift and a multiplication is enough (for the moment): (cwnd >> x) * y, x and y being u8 values. (Maybe new hooks for new schedulers will be needed but that will be evaluated later, in a different ticket)

It is also important to reset the penalisation at some points. Would it be handled by the core (after each burst?) or by the scheduler (e.g. checking after each received ACK and once per RTT if this is still needed)?

The default scheduler should do that and new ones should be able to change the default behaviour.

The text was updated successfully, but these errors were encountered:

sferlin · 2023-05-03T08:03:01Z

To the comment "The packet scheduler should then use the limited resources the best way and not just fill the cwnd of all subflows!"
This is the design of some schedulers, e.g., BLEST.

"the cwnd of the slow flows is halved max once per RTT but it looks like a hack: it is strange to modify this without telling the CC and the CC will potentially reset the modification quickly after."
This is the bevaviour of the penalisation&retransmission algorithm, correct? IMHO, CWND change is the job of the CC and not from an outer loop to interfere in the operation. That said, the scheduler has to make better predictions about the situation of each subflow and schedule data in the CWND space offered by the CC.

I opt to completely remove the penalisation&retransmission loop from the scheduler and CC operation altogether in MPTCP. This has been done while working on scheduler algorithms, e.g., BLEST. If this is not possible, I would suggest to limit its operation when minRTT is selected, as this was the default scheduler when P&R was designed. Other schedulers that came thereafter (whether default or not) had often not considered P&R operation in their loops.

matttbe · 2023-06-02T15:54:30Z

the cwnd of the slow flows is halved max once per RTT but it looks like a hack: it is strange to modify this without telling the CC and the CC will potentially reset the modification quickly after.

This is the bevaviour of the penalisation&retransmission algorithm, correct?

@sferlin that's the behaviour of the out-of-tree kernel. We don't do that in the upstream kernel.

IMHO, CWND change is the job of the CC and not from an outer loop to interfere in the operation. That said, the scheduler has to make better predictions about the situation of each subflow and schedule data in the CWND space offered by the CC.

Yes, we agree on that! That's why we would prefer not to modify the CWND from the packet scheduler but use a part of it (so at least saving what part of the CWND we are using + maybe a timestamps)

I opt to completely remove the penalisation&retransmission loop from the scheduler and CC operation altogether in MPTCP. This has been done while working on scheduler algorithms, e.g., BLEST.

In the upstream kernel, we currently don't do that (compared to the out-of-tree kernel and the BLEST scheduler in this version also does that because it uses mptcp_next_segment() like the default scheduler and mptcp_rcv_buf_optimization() is called from there) and I think we need a way to limit the utilisation of one subflow. Currently in the upstream kernel with a BLEST-like implementation (best to ask Paolo for more details :-) ), we are impacted by subflows taking all resources: e.g. very high latency or losses, etc.

A64_LDRSW() takes three registers: Xt, Xn, Xm as arguments and it loads and sign extends the value at address Xn + Xm into register Xt. Currently, the offset is being directly used in place of the tmp register which has the offset already loaded by the last emitted instruction. This will cause JIT failures. The easiest way to reproduce this is to test the following code through test_bpf module: { "BPF_LDX_MEMSX | BPF_W", .u.insns_int = { BPF_LD_IMM64(R1, 0x00000000deadbeefULL), BPF_LD_IMM64(R2, 0xffffffffdeadbeefULL), BPF_STX_MEM(BPF_DW, R10, R1, -7), BPF_LDX_MEMSX(BPF_W, R0, R10, -7), BPF_JMP_REG(BPF_JNE, R0, R2, 1), BPF_ALU64_IMM(BPF_MOV, R0, 0), BPF_EXIT_INSN(), }, INTERNAL, { }, { { 0, 0 } }, .stack_depth = 7, }, We need to use the offset as -7 to trigger this code path, there could be other valid ways to trigger this from proper BPF programs as well. This code is rejected by the JIT because -7 is passed to A64_LDRSW() but it expects a valid register (0 - 31). roott@pjy:~# modprobe test_bpf test_name="BPF_LDX_MEMSX | BPF_W" [11300.490371] test_bpf: test_bpf: set 'test_bpf' as the default test_suite. [11300.491750] test_bpf: #345 BPF_LDX_MEMSX | BPF_W [11300.493179] aarch64_insn_encode_register: unknown register encoding -7 [11300.494133] aarch64_insn_encode_register: unknown register encoding -7 [11300.495292] FAIL to select_runtime err=-524 [11300.496804] test_bpf: Summary: 0 PASSED, 1 FAILED, [0/0 JIT'ed] modprobe: ERROR: could not insert 'test_bpf': Invalid argument Applying this patch fixes the issue. root@pjy:~# modprobe test_bpf test_name="BPF_LDX_MEMSX | BPF_W" [ 292.837436] test_bpf: test_bpf: set 'test_bpf' as the default test_suite. [ 292.839416] test_bpf: #345 BPF_LDX_MEMSX | BPF_W jited:1 156 PASS [ 292.844794] test_bpf: Summary: 1 PASSED, 0 FAILED, [1/1 JIT'ed] Fixes: cc88f54 ("bpf, arm64: Support sign-extension load instructions") Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Message-ID: <20240312235917.103626-1-puranjay12@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

@2

Recent additions in BPF like cpu v4 instructions, test_bpf module exhibits the following failures: test_bpf: #82 ALU_MOVSX | BPF_B jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times) test_bpf: #83 ALU_MOVSX | BPF_H jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times) test_bpf: #84 ALU64_MOVSX | BPF_B jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times) test_bpf: #85 ALU64_MOVSX | BPF_H jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times) test_bpf: #86 ALU64_MOVSX | BPF_W jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times) test_bpf: #165 ALU_SDIV_X: -6 / 2 = -3 jited:1 ret 2147483645 != -3 (0x7ffffffd != 0xfffffffd)FAIL (1 times) test_bpf: #166 ALU_SDIV_K: -6 / 2 = -3 jited:1 ret 2147483645 != -3 (0x7ffffffd != 0xfffffffd)FAIL (1 times) test_bpf: #169 ALU_SMOD_X: -7 % 2 = -1 jited:1 ret 1 != -1 (0x1 != 0xffffffff)FAIL (1 times) test_bpf: #170 ALU_SMOD_K: -7 % 2 = -1 jited:1 ret 1 != -1 (0x1 != 0xffffffff)FAIL (1 times) test_bpf: #172 ALU64_SMOD_K: -7 % 2 = -1 jited:1 ret 1 != -1 (0x1 != 0xffffffff)FAIL (1 times) test_bpf: #313 BSWAP 16: 0x0123456789abcdef -> 0xefcd eBPF filter opcode 00d7 (@2) unsupported jited:0 301 PASS test_bpf: #314 BSWAP 32: 0x0123456789abcdef -> 0xefcdab89 eBPF filter opcode 00d7 (@2) unsupported jited:0 555 PASS test_bpf: #315 BSWAP 64: 0x0123456789abcdef -> 0x67452301 eBPF filter opcode 00d7 (@2) unsupported jited:0 268 PASS test_bpf: #316 BSWAP 64: 0x0123456789abcdef >> 32 -> 0xefcdab89 eBPF filter opcode 00d7 (@2) unsupported jited:0 269 PASS test_bpf: #317 BSWAP 16: 0xfedcba9876543210 -> 0x1032 eBPF filter opcode 00d7 (@2) unsupported jited:0 460 PASS test_bpf: #318 BSWAP 32: 0xfedcba9876543210 -> 0x10325476 eBPF filter opcode 00d7 (@2) unsupported jited:0 320 PASS test_bpf: #319 BSWAP 64: 0xfedcba9876543210 -> 0x98badcfe eBPF filter opcode 00d7 (@2) unsupported jited:0 222 PASS test_bpf: #320 BSWAP 64: 0xfedcba9876543210 >> 32 -> 0x10325476 eBPF filter opcode 00d7 (@2) unsupported jited:0 273 PASS test_bpf: #344 BPF_LDX_MEMSX | BPF_B eBPF filter opcode 0091 (@5) unsupported jited:0 432 PASS test_bpf: #345 BPF_LDX_MEMSX | BPF_H eBPF filter opcode 0089 (@5) unsupported jited:0 381 PASS test_bpf: #346 BPF_LDX_MEMSX | BPF_W eBPF filter opcode 0081 (@5) unsupported jited:0 505 PASS test_bpf: #490 JMP32_JA: Unconditional jump: if (true) return 1 eBPF filter opcode 0006 (@1) unsupported jited:0 261 PASS test_bpf: Summary: 1040 PASSED, 10 FAILED, [924/1038 JIT'ed] Fix them by adding missing processing. Fixes: daabb2b ("bpf/tests: add tests for cpuv4 instructions") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/91de862dda99d170697eb79ffb478678af7e0b27.1709652689.git.christophe.leroy@csgroup.eu

Add a test case to assert that the skb->pkt_type which was set from the BPF program is retained from the netkit xmit side to the peer's device at tcx ingress location. # ./vmtest.sh -- ./test_progs -t netkit [...] ./test_progs -t netkit [ 1.140780] bpf_testmod: loading out-of-tree module taints kernel. [ 1.141127] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel [ 1.284601] tsc: Refined TSC clocksource calibration: 3408.006 MHz [ 1.286672] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fd9b189d, max_idle_ns: 440795225691 ns [ 1.290384] clocksource: Switched to clocksource tsc #345 tc_netkit_basic:OK #346 tc_netkit_device:OK #347 tc_netkit_multi_links:OK #348 tc_netkit_multi_opts:OK #349 tc_netkit_neigh_links:OK #350 tc_netkit_pkt_type:OK Summary: 6/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20240524163619.26001-4-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Add a test case which replaces an active ingress qdisc while keeping the miniq in-tact during the transition period to the new clsact qdisc. # ./vmtest.sh -- ./test_progs -t tc_link [...] ./test_progs -t tc_link [ 3.412871] bpf_testmod: loading out-of-tree module taints kernel. [ 3.413343] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel #332 tc_links_after:OK #333 tc_links_append:OK #334 tc_links_basic:OK #335 tc_links_before:OK #336 tc_links_chain_classic:OK #337 tc_links_chain_mixed:OK #338 tc_links_dev_chain0:OK #339 tc_links_dev_cleanup:OK #340 tc_links_dev_mixed:OK #341 tc_links_ingress:OK #342 tc_links_invalid:OK #343 tc_links_prepend:OK #344 tc_links_replace:OK #345 tc_links_revision:OK Summary: 14/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20240708133130.11609-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

matttbe added enhancement sched packets scheduler labels Feb 1, 2023

This was referenced Feb 1, 2023

scheduler: API changes (tasks) #350

Open

BPF: packet scheduler #75

Open

matttbe moved this to Needs triage in MPTCP Upstream: Future Feb 22, 2023

matttbe added this to MPTCP Upstream: Future Feb 22, 2023

This was referenced Mar 29, 2023

Why does total throughput increase when the delay on one of the paths increases? #381

Closed

TCP performance better than MPTCP #307

Closed

daire-byrne mentioned this issue Sep 5, 2023

mptcp vs tcp performance over long fat networks #437

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: "penalise" some subflows by sending less than their cwnd #345

scheduler: "penalise" some subflows by sending less than their cwnd #345

matttbe commented Feb 1, 2023

sferlin commented May 3, 2023 •

edited

Loading

matttbe commented Jun 2, 2023

scheduler: "penalise" some subflows by sending less than their cwnd #345

scheduler: "penalise" some subflows by sending less than their cwnd #345

Comments

matttbe commented Feb 1, 2023

sferlin commented May 3, 2023 • edited Loading

matttbe commented Jun 2, 2023

sferlin commented May 3, 2023 •

edited

Loading