-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align XDP Queues with RSS Better #4484
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4484 +/- ##
==========================================
- Coverage 87.16% 86.83% -0.33%
==========================================
Files 56 56
Lines 17354 17354
==========================================
- Hits 15127 15070 -57
- Misses 2227 2284 +57 ☔ View full report in Codecov by Sentry. |
src/platform/datapath_raw_xdp_win.c
Outdated
} | ||
if (QueueIndex == UINT32_MAX) { | ||
Status = QUIC_STATUS_INVALID_STATE; | ||
QuicTraceEvent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtfriesen I'm hitting this error in test (definitely over duonic, but also netperf is failing, but I don't have logs). I'm assuming that means we're trying to bind a queue to a processor that doesn't have an RSS queue? I'm trying to decide how to handle this. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More queues than processors in the indirection table is completely legal. I'd just leave any leftover queues unaffinitized and hope for the best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually thinking it might be the opposite. We have more processors than queues. I actually don't know what duonic does here. Do you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ok, I'm not that familiar with this QUIC code. NICs just indicate packets on the processors in the indirection table - the rest are idle.
src/platform/datapath_raw_xdp_win.c
Outdated
} | ||
} | ||
if (!Found) { | ||
CXPLAT_FRE_ASSERT(FALSE); // TODO - What do we do if there is no partition for this processor? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtfriesen thoughts here? We allow an app to specify the CPUs they want us to run on. What do we do if there is an RSS queue that doesn't have a CPU being run on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the question you are asking here is the question I answered in the incorrect location above - a queue without an assigned processor is effectively not a queue.
if (0 == uReturn) { | ||
uint32_t Flags = XSK_BIND_FLAG_TX; | ||
Status = Xdp->XdpApi->XskBind(TxXsk, InterfaceIndex, i, Flags); | ||
if (QUIC_FAILED(Status)) { // No more queues. Break out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there could be other errors than just queue count exceeded. probably worth logging the error code and current index explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want me to just log this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and the "activate" are the only two that might fail in nontrivial and unpredictable ways. it's probably worth logging them all if you have time.
free(rssTable); | ||
} | ||
XSK_NOTIFY_RESULT_FLAGS OutFlags; | ||
Status = Xdp->XdpApi->XskNotifySocket(TxXsk, XSK_NOTIFY_FLAG_POKE_TX|XSK_NOTIFY_FLAG_WAIT_TX, 1000, &OutFlags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1000 magic number - should be #define
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
||
Cleanup: | ||
const CXPLAT_PROCESSOR_GROUP_INFO* Group = &CxPlatProcessorGroupInfo[ProcNumber.Group]; | ||
Queues[i] = Group->Offset + (ProcNumber.Number % Group->Count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the modulo necessary? i probably just need to go read more of the QUIC code, but i would have expected all ProcNumber.Numbers to be less than Group->Count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it shouldn't be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it for hot add? where QUIC has a fixed set of CPUs per group that might see more CPUs added to the group later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we fix this list at process start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, so if the group is fixed at size 1 during process start, and the OS later adds proc 2 to the group, then that's why you have the modulo?
@@ -480,6 +372,9 @@ CxPlatDpRawInterfaceInitialize( | |||
Interface->OffloadStatus.Transmit.NetworkLayerXsum = Xdp->SkipXsum; | |||
Interface->Xdp = Xdp; | |||
|
|||
uint32_t Processors[256]; // TODO - Use max processor count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfinished TODO
@@ -521,6 +427,7 @@ CxPlatDpRawInterfaceInitialize( | |||
for (uint8_t i = 0; i < Interface->QueueCount; i++) { | |||
XDP_QUEUE* Queue = &Interface->Queues[i]; | |||
|
|||
Queue->RssProcessor = (uint16_t)Processors[i]; // TODO - Should memory be aligned with this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in, should all allocations for this queue be aligned to that processor? ideally yes, for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, this is a larger issue too. For all memory allocation that is meant to be used on a particular process, should we temporarily affinitize the thread, allocate the memory and then revert the affinitization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in user mode, there are fancy APIs to allocate from a specific node, even if you're not currently running on that node. in kernel mode, it's more difficult to allocate pool from an arbitrary node, though i think newer OSes have better APIs. one sec.
that said, XDP itself treats the calling thread as the "ideal" thread when it performs its own allocations for queues, so temporarily setting the affinity would probably solve the problem the most universally and with the least fuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, PoolExtendedParameterNumaNode is supported for ExAllocatePool3. i'd still just set the thread affinity and be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if we're all on a single NUMA, how much would this actually matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have UMA (idk if that's a proper term) then yeah, just do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For VMs that are currently affixed to just one of the NUMA nodes, are we sure there's no per-proc memory perf involved? Also, can we be guaranteed that the VM actually knows the NUMA of the HW?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the VM can't do anything to optimize that scenario. It's all hidden and controlled by the host/hypervisor.
Description
Updates the XDP datapath layer (on Windows) to try to query the RSS processor before fully creating the queues, so that the real queue may be correctly aligned with RSS.
Testing
CI/CD
Documentation
N/A