Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Improve afring throughput using code optimization #74

Merged

Conversation

fako1024
Copy link
Owner

@fako1024 fako1024 commented Sep 18, 2023

@els0r Things done here:

  • Instead of performing individual calls / casts via unsafe for all data fields, the portion of the TPacketHeaderV3 struct is cast directly (memory-aligned) and then worked with locally
  • Compiler inlining is optimized (in the sense that functions are either guaranteed to be inlined by the compiler or they are manually inlined at the expense of a little bit of code duplication on the innermost layers)
  • Complete restructuring of the core PPOLL loop and handling to focus on fast path (non-poll extraction from current active block) while improving code readability
  • Some micro-optimizations to assembly code
  • Significant improvements to reproducibility of benchmarks (by means of minimizing scheduler overhead when handling goroutines)
  • Fix NextPayloadInPlace to actually do what it's supposed to (instead of performing a zero-copy operation, which is why performance is "worse")
  • Some minor code cleanup

Benchmarks (on three different systems) further below. The improvements for the non-zero-copy cases are nice to have, but they mostly serve as a consistency check that the performed changes are beneficial across the board (and that no regressions are introduced). Most important are the improvements for the ZeroCopy methods, which are in the range of 12-17% (depending on system), which, given the level of optimization we're looking at, is quite a lot:

cpu: Intel(R) Core(TM) i3-2100T CPU @ 2.50GHz
CaptureMethods/NextPacket_10485kiBx4-4             103.4n ± 1%   103.3n ± 1%        ~ (p=0.735 n=50)
CaptureMethods/NextPacketInPlace_10485kiBx4-4      40.20n ± 1%   39.94n ± 1%   -0.67% (p=0.043 n=50)
CaptureMethods/NextPayload_10485kiBx4-4            91.19n ± 1%   88.76n ± 1%   -2.66% (p=0.000 n=50)
CaptureMethods/NextPayloadInPlace_10485kiBx4-4     21.56n ± 1%   32.90n ± 3%  +52.57% (p=0.000 n=50)
CaptureMethods/NextPayloadZeroCopy_10485kiBx4-4    19.73n ± 1%   16.38n ± 0%  -16.98% (n=50)
CaptureMethods/NextIPPacket_10485kiBx4-4           91.75n ± 1%   89.80n ± 1%   -2.14% (p=0.000 n=50)
CaptureMethods/NextIPPacketInPlace_10485kiBx4-4    33.33n ± 2%   33.51n ± 2%        ~ (p=0.945 n=50)
CaptureMethods/NextIPPacketZeroCopy_10485kiBx4-4   19.75n ± 1%   16.68n ± 0%  -15.59% (n=50)
CaptureMethods/NextPacketFn_10485kiBx4-4           20.70n ± 1%   19.52n ± 1%   -5.72% (n=50)
CaptureMethods/NextPacket_10kiBx512-4              211.2n ± 1%   214.0n ± 1%   +1.35% (p=0.046 n=50)
CaptureMethods/NextPacketInPlace_10kiBx512-4       210.8n ± 4%   197.1n ± 2%   -6.48% (p=0.000 n=50)
CaptureMethods/NextPayload_10kiBx512-4             190.8n ± 2%   182.6n ± 1%   -4.32% (p=0.000 n=50)
CaptureMethods/NextPayloadInPlace_10kiBx512-4      200.4n ± 2%   195.5n ± 1%        ~ (p=0.058 n=50)
CaptureMethods/NextPayloadZeroCopy_10kiBx512-4     200.8n ± 2%   189.8n ± 2%   -5.48% (p=0.000 n=50)
CaptureMethods/NextIPPacket_10kiBx512-4            189.4n ± 1%   180.1n ± 1%   -4.91% (p=0.000 n=50)
CaptureMethods/NextIPPacketInPlace_10kiBx512-4     203.0n ± 2%   192.9n ± 2%   -4.98% (p=0.000 n=50)
CaptureMethods/NextIPPacketZeroCopy_10kiBx512-4    199.6n ± 2%   177.3n ± 2%  -11.20% (n=50)
CaptureMethods/NextPacketFn_10kiBx512-4            199.1n ± 2%   198.8n ± 4%        ~ (p=0.168 n=50)
cpu: Intel(R) Celeron(R) N5105 @ 2.00GHz
CaptureMethods/NextPacket_10485kiBx4-4             80.36n ± 0%   79.39n ± 0%   -1.20% (p=0.000 n=50)
CaptureMethods/NextPacketInPlace_10485kiBx4-4      37.06n ± 1%   34.04n ± 1%   -8.16% (p=0.000 n=50)
CaptureMethods/NextPayload_10485kiBx4-4            74.45n ± 0%   68.41n ± 1%   -8.12% (n=50)
CaptureMethods/NextPayloadInPlace_10485kiBx4-4     20.16n ± 0%   29.77n ± 0%  +47.69% (p=0.000 n=50)
CaptureMethods/NextPayloadZeroCopy_10485kiBx4-4    18.32n ± 1%   15.91n ± 0%  -13.15% (n=50)
CaptureMethods/NextIPPacket_10485kiBx4-4           71.47n ± 0%   70.93n ± 0%   -0.75% (p=0.000 n=50)
CaptureMethods/NextIPPacketInPlace_10485kiBx4-4    31.10n ± 1%   31.86n ± 1%   +2.46% (p=0.002 n=50)
CaptureMethods/NextIPPacketZeroCopy_10485kiBx4-4   18.25n ± 0%   16.15n ± 0%  -11.51% (n=50)
CaptureMethods/NextPacketFn_10485kiBx4-4           19.24n ± 0%   21.73n ± 9%        ~ (p=0.051 n=50)
CaptureMethods/NextPacket_10kiBx512-4              128.6n ± 1%   129.5n ± 2%        ~ (p=0.934 n=50)
CaptureMethods/NextPacketInPlace_10kiBx512-4       124.3n ± 0%   122.0n ± 1%   -1.85% (p=0.000 n=50)
CaptureMethods/NextPayload_10kiBx512-4             126.7n ± 1%   122.6n ± 1%   -3.27% (p=0.000 n=50)
CaptureMethods/NextPayloadInPlace_10kiBx512-4      128.9n ± 1%   122.0n ± 1%   -5.39% (p=0.000 n=50)
CaptureMethods/NextPayloadZeroCopy_10kiBx512-4     130.0n ± 1%   126.8n ± 1%   -2.42% (p=0.000 n=50)
CaptureMethods/NextIPPacket_10kiBx512-4            125.0n ± 1%   122.9n ± 1%   -1.64% (p=0.000 n=50)
CaptureMethods/NextIPPacketInPlace_10kiBx512-4     125.0n ± 1%   122.0n ± 1%   -2.44% (p=0.000 n=50)
CaptureMethods/NextIPPacketZeroCopy_10kiBx512-4    129.3n ± 1%   127.3n ± 1%   -1.55% (p=0.000 n=50)
CaptureMethods/NextPacketFn_10kiBx512-4            129.7n ± 1%   124.9n ± 1%   -3.74% (p=0.000 n=50)
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
CaptureMethods/NextPacket_10485kiBx4-4             54.54n ± 0%   53.58n ± 0%   -1.75% (p=0.000 n=50)
CaptureMethods/NextPacketInPlace_10485kiBx4-4      22.50n ± 1%   22.33n ± 0%   -0.78% (p=0.002 n=50)
CaptureMethods/NextPayload_10485kiBx4-4            48.56n ± 0%   47.34n ± 0%   -2.50% (p=0.000 n=50)
CaptureMethods/NextPayloadInPlace_10485kiBx4-4     13.76n ± 0%   18.67n ± 6%  +35.68% (p=0.000 n=50)
CaptureMethods/NextPayloadZeroCopy_10485kiBx4-4    12.89n ± 0%   11.16n ± 0%  -13.42% (n=50)
CaptureMethods/NextIPPacket_10485kiBx4-4           48.20n ± 0%   47.38n ± 0%   -1.70% (p=0.000 n=50)
CaptureMethods/NextIPPacketInPlace_10485kiBx4-4    19.24n ± 3%   19.57n ± 0%        ~ (p=0.422 n=50)
CaptureMethods/NextIPPacketZeroCopy_10485kiBx4-4   12.84n ± 0%   11.19n ± 0%  -12.85% (n=50)
CaptureMethods/NextPacketFn_10485kiBx4-4           13.40n ± 0%   12.86n ± 0%   -4.03% (n=50)
CaptureMethods/NextPacket_10kiBx512-4              138.4n ± 3%   132.4n ± 2%        ~ (p=0.361 n=50)
CaptureMethods/NextPacketInPlace_10kiBx512-4       155.6n ± 6%   165.1n ± 2%   +6.14% (p=0.010 n=50)
CaptureMethods/NextPayload_10kiBx512-4             149.4n ± 7%   150.7n ± 2%   +0.87% (p=0.016 n=50)
CaptureMethods/NextPayloadInPlace_10kiBx512-4      173.9n ± 5%   165.2n ± 3%        ~ (p=0.622 n=50)
CaptureMethods/NextPayloadZeroCopy_10kiBx512-4     167.4n ± 6%   167.4n ± 4%        ~ (p=0.051 n=50)
CaptureMethods/NextIPPacket_10kiBx512-4            151.2n ± 3%   151.4n ± 4%        ~ (p=0.136 n=50)
CaptureMethods/NextIPPacketInPlace_10kiBx512-4     157.9n ± 5%   162.4n ± 2%   +2.82% (p=0.034 n=50)
CaptureMethods/NextIPPacketZeroCopy_10kiBx512-4    163.1n ± 6%   171.5n ± 3%   +5.12% (p=0.003 n=50)
CaptureMethods/NextPacketFn_10kiBx512-4            168.6n ± 4%   167.1n ± 4%        ~ (p=0.480 n=50)

Closes #71

@fako1024 fako1024 added bug Something isn't working enhancement New feature or request performance Performance / optimization related topics labels Sep 18, 2023
@fako1024 fako1024 requested a review from els0r September 18, 2023 07:52
@fako1024 fako1024 self-assigned this Sep 18, 2023
@fako1024 fako1024 linked an issue Sep 18, 2023 that may be closed by this pull request
4 tasks
@fako1024 fako1024 merged commit 92124d4 into main Sep 26, 2023
4 checks passed
@fako1024 fako1024 deleted the 71-improve-afring-throughput-by-minimizing-individual-unsafe-casts branch September 26, 2023 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request performance Performance / optimization related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve afring throughput by minimizing individual unsafe casts
2 participants