Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDMA: Support user keepalive command #916

Merged
merged 1 commit into from
Aug 21, 2024

Conversation

pizhenwei
Copy link
Contributor

@pizhenwei pizhenwei commented Aug 15, 2024

If the client side crashes by any issue or exits normally, the kernel will try to disconnect RDMA QPs. Then the kernel of server side receives CM packets, valkey-server handles CM disconnected event and close connection.

However, there is a lack of keepalive mechanism from RDMA transport layer. Once the kernel of client side crashes, the server side will not be notified. To avoid this issue, valkey server sents Keepaliv command periodically to detect any dead QPs.

An example of mlx-cx5:

 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0

@pizhenwei pizhenwei force-pushed the rdma-keepalive branch 2 times, most recently from 9853289 to 36c9418 Compare August 15, 2024 06:36
Copy link

codecov bot commented Aug 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.37%. Comparing base (4d284da) to head (cd0032e).
Report is 52 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #916      +/-   ##
============================================
+ Coverage     70.16%   70.37%   +0.20%     
============================================
  Files           112      112              
  Lines         61489    61506      +17     
============================================
+ Hits          43145    43282     +137     
+ Misses        18344    18224     -120     

see 17 files with indirect coverage changes

@pizhenwei
Copy link
Contributor Author

Hi @zuiderkwast
Could you please take a look at this PR?

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me.

We already released Valkey 8.0.0 rc1 and don't want to include any new features. Is it OK to release this in the next minor version or is this important to have in the initial release of RDMA so we should try to include it in 8.0.0?

src/rdma.c Outdated Show resolved Hide resolved
@pizhenwei
Copy link
Contributor Author

Looks fine to me.

We already released Valkey 8.0.0 rc1 and don't want to include any new features. Is it OK to release this in the next minor version or is this important to have in the initial release of RDMA so we should try to include it in 8.0.0?

It's not an urgent feature, please release it in the next version. Thanks!

If the client side crashes by any issue or exits normally, the kernel
will try to disconnect RDMA QPs. Then the kernel of server side
receives CM packets, valkey-server handles CM disconnected event and
close connection successfully.

However, there is a lack of keepalive mechanism from RDMA transport
layer. Once the kernel of client side crashes, the server side will
not be notified. To avoid this issue, valkey server sents Keepaliv
command periodically to detect any dead QPs.

An example of mlx-cx5, make client side kernel crash, then server side
handles keepalive command error like:
 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
@zuiderkwast zuiderkwast merged commit 2673320 into valkey-io:unstable Aug 21, 2024
47 checks passed
mapleFU pushed a commit to mapleFU/valkey that referenced this pull request Aug 21, 2024
If the client side crashes by any issue or exits normally, the kernel
will try to disconnect RDMA QPs. Then the kernel of server side receives
CM packets, valkey-server handles CM disconnected event and close
connection.

However, there is a lack of keepalive mechanism from RDMA transport
layer. Once the kernel of client side crashes, the server side will not
be notified. To avoid this issue, valkey server sents Keepaliv command
periodically to detect any dead QPs.

An example of mlx-cx5:

```
 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
```

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: mwish <maplewish117@gmail.com>
mapleFU pushed a commit to mapleFU/valkey that referenced this pull request Aug 22, 2024
If the client side crashes by any issue or exits normally, the kernel
will try to disconnect RDMA QPs. Then the kernel of server side receives
CM packets, valkey-server handles CM disconnected event and close
connection.

However, there is a lack of keepalive mechanism from RDMA transport
layer. Once the kernel of client side crashes, the server side will not
be notified. To avoid this issue, valkey server sents Keepaliv command
periodically to detect any dead QPs.

An example of mlx-cx5:

```
 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
```

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: mwish <maplewish117@gmail.com>
madolson pushed a commit that referenced this pull request Sep 2, 2024
If the client side crashes by any issue or exits normally, the kernel
will try to disconnect RDMA QPs. Then the kernel of server side receives
CM packets, valkey-server handles CM disconnected event and close
connection.

However, there is a lack of keepalive mechanism from RDMA transport
layer. Once the kernel of client side crashes, the server side will not
be notified. To avoid this issue, valkey server sents Keepaliv command
periodically to detect any dead QPs.

An example of mlx-cx5:

```
 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
```

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
@madolson madolson added the release-notes This issue should get a line item in the release notes label Sep 2, 2024
madolson pushed a commit that referenced this pull request Sep 3, 2024
If the client side crashes by any issue or exits normally, the kernel
will try to disconnect RDMA QPs. Then the kernel of server side receives
CM packets, valkey-server handles CM disconnected event and close
connection.

However, there is a lack of keepalive mechanism from RDMA transport
layer. Once the kernel of client side crashes, the server side will not
be notified. To avoid this issue, valkey server sents Keepaliv command
periodically to detect any dead QPs.

An example of mlx-cx5:

```
 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: transport retry counter exceeded[0xc], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
 # RDMA: CQ handle error status: Work Request Flushed Error[0x5], opcode : 0x0
```

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
@pizhenwei pizhenwei deleted the rdma-keepalive branch September 26, 2024 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants