RDMA: Support user keepalive command #916

pizhenwei · 2024-08-15T06:24:36Z

If the client side crashes by any issue or exits normally, the kernel will try to disconnect RDMA QPs. Then the kernel of server side receives CM packets, valkey-server handles CM disconnected event and close connection.

However, there is a lack of keepalive mechanism from RDMA transport layer. Once the kernel of client side crashes, the server side will not be notified. To avoid this issue, valkey server sents Keepaliv command periodically to detect any dead QPs.

An example of mlx-cx5:

 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0

codecov · 2024-08-15T06:50:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.37%. Comparing base (4d284da) to head (cd0032e).
Report is 52 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #916      +/-   ##
============================================
+ Coverage     70.16%   70.37%   +0.20%     
============================================
  Files           112      112              
  Lines         61489    61506      +17     
============================================
+ Hits          43145    43282     +137     
+ Misses        18344    18224     -120

see 17 files with indirect coverage changes

pizhenwei · 2024-08-20T10:37:54Z

Hi @zuiderkwast
Could you please take a look at this PR?

zuiderkwast

Looks fine to me.

We already released Valkey 8.0.0 rc1 and don't want to include any new features. Is it OK to release this in the next minor version or is this important to have in the initial release of RDMA so we should try to include it in 8.0.0?

src/rdma.c

pizhenwei · 2024-08-21T01:08:59Z

Looks fine to me.

We already released Valkey 8.0.0 rc1 and don't want to include any new features. Is it OK to release this in the next minor version or is this important to have in the initial release of RDMA so we should try to include it in 8.0.0?

It's not an urgent feature, please release it in the next version. Thanks!

If the client side crashes by any issue or exits normally, the kernel will try to disconnect RDMA QPs. Then the kernel of server side receives CM packets, valkey-server handles CM disconnected event and close connection successfully. However, there is a lack of keepalive mechanism from RDMA transport layer. Once the kernel of client side crashes, the server side will not be notified. To avoid this issue, valkey server sents Keepaliv command periodically to detect any dead QPs. An example of mlx-cx5, make client side kernel crash, then server side handles keepalive command error like: # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>

If the client side crashes by any issue or exits normally, the kernel will try to disconnect RDMA QPs. Then the kernel of server side receives CM packets, valkey-server handles CM disconnected event and close connection. However, there is a lack of keepalive mechanism from RDMA transport layer. Once the kernel of client side crashes, the server side will not be notified. To avoid this issue, valkey server sents Keepaliv command periodically to detect any dead QPs. An example of mlx-cx5: ``` # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 ``` Signed-off-by: zhenwei pi <pizhenwei@bytedance.com> Signed-off-by: mwish <maplewish117@gmail.com>

If the client side crashes by any issue or exits normally, the kernel will try to disconnect RDMA QPs. Then the kernel of server side receives CM packets, valkey-server handles CM disconnected event and close connection. However, there is a lack of keepalive mechanism from RDMA transport layer. Once the kernel of client side crashes, the server side will not be notified. To avoid this issue, valkey server sents Keepaliv command periodically to detect any dead QPs. An example of mlx-cx5: ``` # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0 ``` Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>

pizhenwei force-pushed the rdma-keepalive branch 2 times, most recently from 9853289 to 36c9418 Compare August 15, 2024 06:36

zuiderkwast reviewed Aug 20, 2024

View reviewed changes

src/rdma.c Outdated Show resolved Hide resolved

pizhenwei force-pushed the rdma-keepalive branch from 36c9418 to cd0032e Compare August 21, 2024 01:33

zuiderkwast approved these changes Aug 21, 2024

View reviewed changes

zuiderkwast merged commit 2673320 into valkey-io:unstable Aug 21, 2024
47 checks passed

madolson added the release-notes This issue should get a line item in the release notes label Sep 2, 2024

pizhenwei deleted the rdma-keepalive branch September 26, 2024 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDMA: Support user keepalive command #916

RDMA: Support user keepalive command #916

pizhenwei commented Aug 15, 2024 •

edited by enjoy-binbin

Loading

codecov bot commented Aug 15, 2024 •

edited

Loading

pizhenwei commented Aug 20, 2024

zuiderkwast left a comment •

edited

Loading

pizhenwei commented Aug 21, 2024

RDMA: Support user keepalive command #916

RDMA: Support user keepalive command #916

Conversation

pizhenwei commented Aug 15, 2024 • edited by enjoy-binbin Loading

codecov bot commented Aug 15, 2024 • edited Loading

Codecov Report

pizhenwei commented Aug 20, 2024

zuiderkwast left a comment • edited Loading

Choose a reason for hiding this comment

pizhenwei commented Aug 21, 2024

pizhenwei commented Aug 15, 2024 •

edited by enjoy-binbin

Loading

codecov bot commented Aug 15, 2024 •

edited

Loading

zuiderkwast left a comment •

edited

Loading