-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gateway manager retry kernel updates #1256
Gateway manager retry kernel updates #1256
Conversation
The GatewayMappingKernelManager keeps an internal record of all the remote kernels, and periodically syncs it with the Gateway server. When it finds that a kernel it previously knew about is no longer in the Gateway server's list of kernels, it has to decide how to reconcile that. Previously, it was assuming that the kernel was probably culled by the Gateway server, and thus removed it from its internal records. However, it is conceivable that the list from the upstream Gateway server might have been incomplete due to any combination of bugs, race conditions, or transient network connectivity issues, etc. If one of those such scenarios occurred, then the previous logic would have declared the kernel as lost. This change makes the GatewayMappingKernelManager more resilient to such issues by double checking whether or not the kernel was actually removed in the upstream Gateway server. It does this by attempting to update the GatewayKernelManager instance's model before deciding that the kernel has been culled.
This change extends the test_gateway.py suite to simulate kernels being transiently missing from kernel list responses. The new test fails without the update to the GatewayMappingKernelManager to double check if kernels have been culled, and passes with it.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1256 +/- ##
==========================================
+ Coverage 80.39% 80.41% +0.01%
==========================================
Files 68 68
Lines 8263 8270 +7
Branches 1600 1601 +1
==========================================
+ Hits 6643 6650 +7
+ Misses 1198 1197 -1
- Partials 422 423 +1
... and 3 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
I don't think the @codecov report is correctly accounting for the included test change. Edit: It changed after I posted this comment. I'm still not 100% sure that it is right, but it does seem to be more believable now. |
@kevin-bates FYI, this issue was something I hit while (using my hacky prototype version of) multiplexing local and remote kernels like described in #1187 This might seem like a questionable benefit when using a single gateway server, but if/when we get support for multiplexing between multiple ones I think this will be an important reliability detail. |
Hi @ojarjur,
I think this change makes perfect sense - thank you! I really apologize we haven't made progress on Kernel Servers! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ojarjur!
There's nothing to apologize about. Thanks for all of your guidance and feedback! |
This change updates the GatewayMappingKernelManager class to double check when it believes kernels may have been culled in the Gateway server.
Previously, if a known kernel was missing from the list of kernels returned by the Gateway server, then the GatewayMappingKernelManager class assumed it was culled. With this change, it will first attempt to fetch
the kernel from the Gateway server and only treat it as culled if this second check also fails.
This increases reliability in the scenario of a Gateway server erroneously returning a partial list of kernels due to things like bugs around race conditions for newly created kernels, or transient network errors when combining results from multiple backends.