-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDMA: Support user keepalive command #916
Conversation
9853289
to
36c9418
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #916 +/- ##
============================================
+ Coverage 70.16% 70.37% +0.20%
============================================
Files 112 112
Lines 61489 61506 +17
============================================
+ Hits 43145 43282 +137
+ Misses 18344 18224 -120 |
Hi @zuiderkwast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me.
We already released Valkey 8.0.0 rc1 and don't want to include any new features. Is it OK to release this in the next minor version or is this important to have in the initial release of RDMA so we should try to include it in 8.0.0?
It's not an urgent feature, please release it in the next version. Thanks! |
If the client side crashes by any issue or exits normally, the kernel will try to disconnect RDMA QPs. Then the kernel of server side receives CM packets, valkey-server handles CM disconnected event and close connection successfully. However, there is a lack of keepalive mechanism from RDMA transport layer. Once the kernel of client side crashes, the server side will not be notified. To avoid this issue, valkey server sents Keepaliv command periodically to detect any dead QPs. An example of mlx-cx5, make client side kernel crash, then server side handles keepalive command error like: # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
36c9418
to
cd0032e
Compare
If the client side crashes by any issue or exits normally, the kernel will try to disconnect RDMA QPs. Then the kernel of server side receives CM packets, valkey-server handles CM disconnected event and close connection. However, there is a lack of keepalive mechanism from RDMA transport layer. Once the kernel of client side crashes, the server side will not be notified. To avoid this issue, valkey server sents Keepaliv command periodically to detect any dead QPs. An example of mlx-cx5: ``` # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 ``` Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: mwish <maplewish117@gmail.com>
If the client side crashes by any issue or exits normally, the kernel will try to disconnect RDMA QPs. Then the kernel of server side receives CM packets, valkey-server handles CM disconnected event and close connection. However, there is a lack of keepalive mechanism from RDMA transport layer. Once the kernel of client side crashes, the server side will not be notified. To avoid this issue, valkey server sents Keepaliv command periodically to detect any dead QPs. An example of mlx-cx5: ``` # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 ``` Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: mwish <maplewish117@gmail.com>
If the client side crashes by any issue or exits normally, the kernel will try to disconnect RDMA QPs. Then the kernel of server side receives CM packets, valkey-server handles CM disconnected event and close connection. However, there is a lack of keepalive mechanism from RDMA transport layer. Once the kernel of client side crashes, the server side will not be notified. To avoid this issue, valkey server sents Keepaliv command periodically to detect any dead QPs. An example of mlx-cx5: ``` # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 ``` Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
If the client side crashes by any issue or exits normally, the kernel will try to disconnect RDMA QPs. Then the kernel of server side receives CM packets, valkey-server handles CM disconnected event and close connection. However, there is a lack of keepalive mechanism from RDMA transport layer. Once the kernel of client side crashes, the server side will not be notified. To avoid this issue, valkey server sents Keepaliv command periodically to detect any dead QPs. An example of mlx-cx5: ``` # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 ``` Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
If the client side crashes by any issue or exits normally, the kernel will try to disconnect RDMA QPs. Then the kernel of server side receives CM packets, valkey-server handles CM disconnected event and close connection.
However, there is a lack of keepalive mechanism from RDMA transport layer. Once the kernel of client side crashes, the server side will not be notified. To avoid this issue, valkey server sents Keepaliv command periodically to detect any dead QPs.
An example of mlx-cx5: