Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Make neqo pass amplificationlimit QNS test #1875

Merged
merged 27 commits into from
Aug 1, 2024

Conversation

larseggert
Copy link
Collaborator

Fixes #1183

neqo-transport/src/connection/mod.rs Outdated Show resolved Hide resolved
neqo-transport/src/connection/tests/handshake.rs Outdated Show resolved Hide resolved
neqo-transport/src/recovery.rs Outdated Show resolved Hide resolved
qns/interop.sh Outdated Show resolved Hide resolved
@larseggert larseggert marked this pull request as ready for review May 6, 2024 09:24
@larseggert
Copy link
Collaborator Author

One test not passing, see #1490 (comment)

@larseggert larseggert linked an issue May 6, 2024 that may be closed by this pull request
Copy link

github-actions bot commented May 7, 2024

Failed Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

@larseggert
Copy link
Collaborator Author

@martinthomson @KershawChang if any of you understands the timer code, I'd appreciate some help here. The idle_timeout_crazy_rtt persistently crashes with assertion failed: earliest > now, which seems to be a long-standing issue.

It's straightforward to reproduce:

# RUST_LOG=trace cargo nextest run --no-capture idle_timeout_crazy_rtt
...
0s 35ms DEBUG [idle_timeout_crazy_rtt] processing Server c11688b68f20d6a8b4f512e565e90a
0s 35ms TRACE [Server c11688b68f20d6a8b4f512e565e90a] process_output Handshaking Instant { tv_sec: 5648807, tv_nsec: 122490916 }
0s 35ms TRACE IdleTimeout::expiry@Instant { tv_sec: 5648807, tv_nsec: 122490916 } pto=325ms, ka=false => Instant { tv_sec: 610448805, tv_nsec: 622490916 }
0s 35ms TRACE [LossRecovery] timeout Instant { tv_sec: 5648807, tv_nsec: 122490916 }
0s 35ms TRACE detect lost ap: now=Instant { tv_sec: 5648807, tv_nsec: 122490916 } delay=112.5ms
0s 35ms TRACE detect lost hs: now=Instant { tv_sec: 5648807, tv_nsec: 122490916 } delay=112.5ms
0s 35ms TRACE lost=11, time sent Instant { tv_sec: 5648806, tv_nsec: 822490916 } is before lost_delay 112.5ms
0s 35ms DEBUG packet_lost this=0x600002f741e0, pn=11, ps=675
0s 35ms DEBUG [NewReno CongCtrl 662/2674 ssthresh 2674] Cong event -> recovery; cwnd 2674, ssthresh 2674
0s 35ms DEBUG [NewReno CongCtrl 662/2674 ssthresh 2674] state -> RecoveryStart
0s 35ms DEBUG on_packets_lost this=0x600002f741e0, bytes_in_flight=662, cwnd=2674, state=RecoveryStart
0s 35ms TRACE detect lost in: now=Instant { tv_sec: 5648807, tv_nsec: 122490916 } delay=112.5ms
0s 35ms TRACE lost=11, time sent Instant { tv_sec: 5648806, tv_nsec: 822490916 } is before lost_delay 112.5ms
0s 35ms DEBUG packet_lost this=0x600002f741e0, pn=11, ps=662
0s 35ms DEBUG on_packets_lost this=0x600002f741e0, bytes_in_flight=0, cwnd=2674, state=RecoveryStart
0s 35ms DEBUG [Server c11688b68f20d6a8b4f512e565e90a] Lost: Crypto(CryptoRecoveryToken { space: Handshake, offset: 0, length: 628 })
0s 35ms INFO Lost crypto frame space=hs offset=0 length=628
0s 35ms DEBUG [Server c11688b68f20d6a8b4f512e565e90a] Lost: Crypto(CryptoRecoveryToken { space: Initial, offset: 0, length: 90 })
0s 35ms INFO Lost crypto frame space=in offset=0 length=90
0s 35ms TRACE [Server c11688b68f20d6a8b4f512e565e90a] output Instant { tv_sec: 5648807, tv_nsec: 122490916 }
0s 35ms DEBUG [LossRecovery] get send profile Instant { tv_sec: 5648807, tv_nsec: 122490916 }
0s 35ms DEBUG [Server c11688b68f20d6a8b4f512e565e90a] output_path send_profile SendProfile { limit: 255, pto: None, probe: (), paced: false }
0s 35ms DEBUG Building Initial dcid 0900000000c1f2c64a scid 0800000000c32312
0s 35ms DEBUG Building Handshake dcid 0900000000c1f2c64a scid 0800000000c32312
0s 35ms DEBUG Building Short dcid 0900000000c1f2c64a
0s 35ms DEBUG TX blocked, profile=SendProfile { limit: 255, pto: None, probe: (), paced: false }
0s 35ms TRACE [Server c11688b68f20d6a8b4f512e565e90a] Get callback delay Instant { tv_sec: 5648807, tv_nsec: 122490916 }
0s 35ms TRACE ack_time for ap = None
0s 35ms TRACE ack_time for hs = None
0s 35ms TRACE ack_time for in = None
0s 35ms TRACE IdleTimeout::expiry@Instant { tv_sec: 5648807, tv_nsec: 122490916 } pto=325ms, ka=false => Instant { tv_sec: 610448805, tv_nsec: 622490916 }
0s 35ms TRACE [Server c11688b68f20d6a8b4f512e565e90a] Idle/keepalive timer Instant { tv_sec: 610448805, tv_nsec: 622490916 }
0s 35ms TRACE [LossRecovery] next_timeout loss=None pto=Some(Instant { tv_sec: 5648807, tv_nsec: 122490916 })
0s 35ms TRACE [Server c11688b68f20d6a8b4f512e565e90a] Loss recovery timer Instant { tv_sec: 5648807, tv_nsec: 122490916 }
thread 'idle_timeout_crazy_rtt' panicked at neqo-transport/src/connection/mod.rs:1068:9:
assertion failed: earliest > now

@KershawChang
Copy link
Collaborator

KershawChang commented May 7, 2024

@martinthomson @KershawChang if any of you understands the timer code, I'd appreciate some help here. The idle_timeout_crazy_rtt persistently crashes with assertion failed: earliest > now, which seems to be a long-standing issue.

I think we need to implement what Martin suggested to do in this comment.
I'll try to give it a shot.
In order to unblock this, maybe we can remove that debug assertion first?

@larseggert
Copy link
Collaborator Author

larseggert commented May 7, 2024

In order to unblock this, maybe we can remove that debug assertion first?

Then it dies here

thread 'idle_timeout_crazy_rtt' panicked at test-fixture/src/sim/mod.rs:202:25:
assertion `left != right` failed
  left: 0ns
 right: 0ns

And when I take that out, too, it spins forever.

Copy link

github-actions bot commented May 7, 2024

Firefox builds for this PR

The following builds are available for testing. Crossed-out builds did not succeed.

@larseggert
Copy link
Collaborator Author

Making this a draft until the test failure is fixed.

@larseggert larseggert marked this pull request as draft May 31, 2024 12:50
@larseggert larseggert marked this pull request as draft May 31, 2024 12:50
Copy link

codecov bot commented Jun 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.00%. Comparing base (b22512c) to head (1336572).
Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1875   +/-   ##
=======================================
  Coverage   95.00%   95.00%           
=======================================
  Files         112      112           
  Lines       36364    36394   +30     
=======================================
+ Hits        34546    34577   +31     
+ Misses       1818     1817    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@larseggert larseggert marked this pull request as ready for review June 18, 2024 11:00
Copy link

github-actions bot commented Jun 18, 2024

Benchmark results

Performance differences relative to 67bd43b.

coalesce_acked_from_zero 1+1 entries: No change in performance detected.
       time:   [190.84 ns 191.36 ns 191.94 ns]
       change: [-0.6924% -0.2954% +0.1953%] (p = 0.21 > 0.05)

Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe

coalesce_acked_from_zero 3+1 entries: Change within noise threshold.
       time:   [233.03 ns 233.52 ns 234.06 ns]
       change: [+0.1246% +0.4227% +0.7165%] (p = 0.01 < 0.05)

Found 10 outliers among 100 measurements (10.00%)
10 (10.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.
       time:   [232.09 ns 232.85 ns 233.76 ns]
       change: [-0.7160% -0.1483% +0.3799%] (p = 0.60 > 0.05)

Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.
       time:   [217.17 ns 226.00 ns 245.59 ns]
       change: [+0.4699% +2.6386% +6.2690%] (p = 0.08 > 0.05)

Found 12 outliers among 100 measurements (12.00%)
2 (2.00%) high mild
10 (10.00%) high severe

RxStreamOrderer::inbound_frame(): 💔 Performance has regressed.
       time:   [120.39 ms 120.53 ms 120.78 ms]
       change: [+1.5859% +1.7501% +1.9817%] (p = 0.00 < 0.05)

Found 14 outliers among 100 measurements (14.00%)
5 (5.00%) low mild
7 (7.00%) high mild
2 (2.00%) high severe

transfer/pacing-false/varying-seeds: Change within noise threshold.
       time:   [37.250 ms 39.312 ms 41.415 ms]
       change: [-14.128% -7.7901% -1.0772%] (p = 0.02 < 0.05)

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild

transfer/pacing-true/varying-seeds: No change in performance detected.
       time:   [54.310 ms 57.384 ms 60.386 ms]
       change: [-8.9755% -2.4794% +4.4966%] (p = 0.50 > 0.05)

Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) low mild
1 (1.00%) high mild

transfer/pacing-false/same-seed: No change in performance detected.
       time:   [47.344 ms 49.264 ms 51.102 ms]
       change: [-4.8322% +0.0803% +5.5773%] (p = 0.97 > 0.05)

Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) low mild

transfer/pacing-true/same-seed: No change in performance detected.
       time:   [66.303 ms 73.430 ms 80.605 ms]
       change: [-0.3494% +15.143% +32.082%] (p = 0.05 > 0.05)
1-conn/1-100mb-resp (aka. Download)/client: No change in performance detected.
       time:   [258.05 ms 268.21 ms 277.82 ms]
       thrpt:  [359.94 MiB/s 372.85 MiB/s 387.53 MiB/s]
change:
       time:   [-0.5393% +5.7654% +12.104%] (p = 0.07 > 0.05)
       thrpt:  [-10.797% -5.4511% +0.5422%]

Found 22 outliers among 100 measurements (22.00%)
20 (20.00%) low severe
1 (1.00%) low mild
1 (1.00%) high mild

1-conn/10_000-parallel-1b-resp (aka. RPS)/client: No change in performance detected.
       time:   [413.12 ms 416.64 ms 420.16 ms]
       thrpt:  [23.801 Kelem/s 24.002 Kelem/s 24.206 Kelem/s]
change:
       time:   [-1.2584% -0.0342% +1.2301%] (p = 0.95 > 0.05)
       thrpt:  [-1.2151% +0.0342% +1.2745%]

Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild

1-conn/1-1b-resp (aka. HPS)/client: No change in performance detected.
       time:   [66.876 ms 67.159 ms 67.479 ms]
       thrpt:  [14.819  elem/s 14.890  elem/s 14.953  elem/s]
change:
       time:   [-0.4509% +0.2527% +0.8850%] (p = 0.48 > 0.05)
       thrpt:  [-0.8772% -0.2521% +0.4530%]

Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe

Client/server transfer results

Transfer of 33554432 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative
msquic msquic 177.5 ± 119.8 101.8 603.9 1.00
neqo msquic reno on 299.9 ± 59.4 255.6 457.2 1.00
neqo msquic reno 289.1 ± 67.6 252.8 479.7 1.00
neqo msquic cubic on 285.2 ± 29.3 260.7 354.5 1.00
neqo msquic cubic 262.5 ± 8.1 252.9 279.0 1.00
msquic neqo reno on 225.3 ± 85.2 157.9 421.0 1.00
msquic neqo reno 171.9 ± 70.7 118.0 395.9 1.00
msquic neqo cubic on 171.1 ± 56.0 141.8 351.3 1.00
msquic neqo cubic 229.1 ± 81.7 141.6 399.7 1.00
neqo neqo reno on 275.7 ± 120.3 174.5 533.0 1.00
neqo neqo reno 212.4 ± 75.4 150.7 392.3 1.00
neqo neqo cubic on 261.9 ± 114.6 156.4 569.6 1.00
neqo neqo cubic 183.3 ± 11.6 166.3 202.2 1.00

⬇️ Download logs

@mozilla mozilla deleted a comment from github-actions bot Jun 18, 2024
@larseggert
Copy link
Collaborator Author

Removing the early return makes it not pass the amplificationlimit test anymore though. Sigh.

@larseggert
Copy link
Collaborator Author

This again is waiting for a fix to the idle_timeout_crazy_rtt test failure.

@KershawChang
Copy link
Collaborator

I've tried to give this another look and I found that the problem might be about when the function discard_packet is called.

pub fn discard_packet(&mut self, sent: &SentPacket, now: Instant, stats: &mut Stats) {

Before this change, the server was able to update the RTT to the value set here

0s326ms INFO [pri-unv-path:090000000037faf420 [fe80::1]:443->[fe80::1]:443] discarding a packet without an RTT estimate; guessing RTT=2999654.700000001s

With this change, the server can only update the RTT to a small value:

0s202ms INFO [pri-unv-path:0800000000b3b9a6 [fe80::1]:443->[fe80::1]:443] discarding a packet without an RTT estimate; guessing RTT=300.000001ms

Unfortunately, I still can't figure out the correct way to fix this.

@larseggert
Copy link
Collaborator Author

With this change, the server can only update the RTT to a small value:

0s202ms INFO [pri-unv-path:0800000000b3b9a6 [fe80::1]:443->[fe80::1]:443] discarding a packet without an RTT estimate; guessing RTT=300.000001ms

I don't see this log line when I run the test for this PR? (I do see the one for the main branch.)

Unfortunately, I still can't figure out the correct way to fix this.

What seems to happen is that set_valid() is never called on the server-side for the initial path, because this patch moves that to after the path is actually validated. So the server stays amplification-limited. I'm not sure how that would affect the timers though.

@larseggert larseggert force-pushed the fix-1183 branch 2 times, most recently from 2fb43fc to f574643 Compare July 17, 2024 05:42
@larseggert
Copy link
Collaborator Author

Some QNS L1/C1 tests are failing because #1981 is needed.

@larseggert larseggert added the needs-review Needs a review by a CODEOWNER. label Jul 17, 2024
Copy link
Collaborator

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change makes sense to me. One minor comment.

neqo-transport/src/recovery/mod.rs Outdated Show resolved Hide resolved
larseggert added a commit to larseggert/neqo that referenced this pull request Jul 19, 2024
* Cancel runs after 20 min, and report that that happened
* Build NSS from source, to make SSLKEYLOGFILE work
* Shorten retention for some artifacts
* Run interop server at `debug` log level

Factored out of mozilla#1875, which was a convenient test case.
@larseggert
Copy link
Collaborator Author

So this seems to work, in the sense that there are no more A failures when we are the server.

But maybe there are also more L1 and C1 failures now, and possibly more frequent deadlocks (= runs that get cancelled by the timeout)?

It's difficult to say something definitive here, since we have had these occur randomly all along, and they may also not be due to bugs on our end.

neqo-transport/src/recovery/mod.rs Show resolved Hide resolved
neqo-transport/src/recovery/mod.rs Outdated Show resolved Hide resolved
@larseggert larseggert added this pull request to the merge queue Aug 1, 2024
Merged via the queue into mozilla:main with commit c80630b Aug 1, 2024
54 of 56 checks passed
@larseggert larseggert deleted the fix-1183 branch August 1, 2024 09:22
@larseggert larseggert removed the needs-review Needs a review by a CODEOWNER. label Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mark amplificationlimit QNS test as unsupported Disable PTO at server if path is not validated
4 participants