-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: kill connections correctly in Net WorkThread #2862
fix: kill connections correctly in Net WorkThread #2862
Conversation
WalkthroughThe recent changes significantly enhance the Changes
Sequence Diagram(s)sequenceDiagram
participant WorkerThread
participant NetConn
WorkerThread->>NetConn: Begin connection close process
WorkerThread->>NetConn: Check reference count
alt Reference count is one
WorkerThread->>WorkerThread: Move to ready_to_close_conns_
WorkerThread->>NetConn: Close connection
else Reference count is greater
WorkerThread->>WorkerThread: Keep in wait_to_close_conns_
end
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range, codebase verification and nitpick comments (3)
src/net/src/worker_thread.h (2)
57-58
: Consider adding synchronization forwait_to_close_conns_
andready_to_close_conns_
.The current code does not show any locks directly associated with accesses to these vectors. Ensure that these vectors are accessed within a lock or other synchronization mechanism to prevent potential thread safety issues.
- Verify that accesses to
wait_to_close_conns_
andready_to_close_conns_
are properly synchronized.Analysis chain
Consider thread safety for
wait_to_close_conns_
andready_to_close_conns_
.Ensure that access to these vectors is properly synchronized, as they may be accessed from multiple threads.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify access to `wait_to_close_conns_` and `ready_to_close_conns_` for thread safety. # Test: Search for the vectors' usage. Expect: Access within a lock or other synchronization mechanism. rg --type cpp $'wait_to_close_conns_|ready_to_close_conns_'Length of output: 1045
Script:
#!/bin/bash # Search for mutex or lock usage in worker_thread.cc to verify synchronization around vector access. rg --type cpp 'std::mutex|std::lock_guard|std::unique_lock' -A 5 src/net/src/worker_thread.ccLength of output: 1437
51-51
: Potential Thread Safety Issue withClearConnsRefAndOtherInfo
The function
ClearConnsRefAndOtherInfo
inworker_thread.cc
does not appear to be consistently protected by locks or synchronization mechanisms. This could lead to race conditions if accessed by multiple threads simultaneously. Consider reviewing the function's usage and ensuring it is called within a synchronized context to maintain thread safety.
- File:
src/net/src/worker_thread.cc
- Function:
ClearConnsRefAndOtherInfo
Analysis chain
Ensure thread safety for
ClearConnsRefAndOtherInfo
.The new method
ClearConnsRefAndOtherInfo
should ensure thread safety, especially if it modifies shared resources.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify usage of `ClearConnsRefAndOtherInfo` for thread safety. # Test: Search for the function usage. Expect: Usage within a lock or other synchronization mechanism. rg --type cpp $'ClearConnsRefAndOtherInfo'Length of output: 451
Script:
#!/bin/bash # Description: Check the context of `ClearConnsRefAndOtherInfo` calls for synchronization mechanisms. rg --type cpp -A 5 'ClearConnsRefAndOtherInfo' src/net/src/worker_thread.ccLength of output: 964
Script:
#!/bin/bash # Description: Search for mutexes or locks in `worker_thread.cc` to check for thread safety. rg --type cpp 'std::mutex|std::lock_guard|std::unique_lock' src/net/src/worker_thread.ccLength of output: 394
Script:
#!/bin/bash # Description: Search for functions or blocks using `std::lock_guard` in `worker_thread.cc`. rg --type cpp -B 5 'std::lock_guard' src/net/src/worker_thread.ccLength of output: 1335
src/net/src/worker_thread.cc (1)
273-284
: Clarify and refine comments for the phased connection closure process.The comments describing the phased approach for closing connections are detailed but could be refined for clarity and conciseness.
- /* - * How Do we kill a conn correct: - * stage 1: stop accept new request(also give up the write back of shooting request's response) - * 1.1 remove the fd from epoll and erase it from conns_ to ensure no more request will submit to threadpool - * 1.2 add to-close-conn to wait_to_close_conns_ - * stage 2: ensure there's no other shared_ptr of this conn in pika - * 2.1 in async task that exec by TheadPool, a shared_ptr of conn will hold and my case a pipe event to tell the epoll - * to back the response, we must ensure this notification is done before we really close fd(linux will reuse the fd to accept new conn) - * 2.2 we must clear all other shared_ptr of this to-close-conn, like the map of blpop/brpop and the map of watchkeys - * 2.3 for those to-close-conns that ref count drop to 1, we add them to ready-to-close-conns_ - * stage 3: after an epoll cycle(let it handle the already-invalid-writeback-notification ), we can safely close the fds of ready_to_close_conns_ - */ + /* + * Phased Connection Closure: + * Stage 1: Stop accepting new requests and remove the fd from epoll. + * 1.1 Remove from conns_ and add to wait_to_close_conns_. + * Stage 2: Ensure no other shared_ptr exists. + * 2.1 Wait for async tasks to complete and clear shared_ptrs. + * 2.2 Move conns with ref count 1 to ready_to_close_conns_. + * Stage 3: Safely close fds after an epoll cycle. + */
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- src/net/src/worker_thread.cc (2 hunks)
- src/net/src/worker_thread.h (1 hunks)
Additional comments not posted (3)
src/net/src/worker_thread.cc (3)
286-290
: Ensure proper handling of file descriptor closure.When closing file descriptors in
ready_to_close_conns_
, ensure that no other operations are pending on these descriptors.Verification successful
Proper Handling of File Descriptor Closure Verified
The code ensures that file descriptors in
ready_to_close_conns_
are closed only after confirming no pending operations, as outlined in the multi-stage process described in the comments. The reference count check and epoll cycle handling further support this. No issues were found.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify proper handling of file descriptor closure. # Test: Search for operations on `ready_to_close_conns_`. Expect: No operations on closed fds. rg --type cpp $'ready_to_close_conns_'Length of output: 516
Script:
#!/bin/bash # Let's gather more context around the usage of `ready_to_close_conns_` to ensure no operations are performed on closed fds. rg --type cpp -A 10 -B 10 'ready_to_close_conns_'Length of output: 3726
291-298
: Verify reference count logic forwait_to_close_conns_
.Ensure that the logic checking
use_count()
accurately reflects the intended behavior and that connections are safely moved toready_to_close_conns_
.Verification successful
Reference count logic for
wait_to_close_conns_
is correctly implemented.The use of
use_count()
to check if the reference count is 1 ensures that connections are safely moved toready_to_close_conns_
when they are no longer in use elsewhere. The logic aligns with standard practices for managing shared pointers.
wait_to_close_conns_
is astd::vector<std::shared_ptr<NetConn>>
, confirming the appropriateness ofuse_count()
.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify reference count logic for `wait_to_close_conns_`. # Test: Search for `use_count()` logic. Expect: Correct handling of shared_ptr reference counts. rg --type cpp $'wait_to_close_conns_.*use_count'Length of output: 49
Script:
#!/bin/bash # Search for the declaration of `wait_to_close_conns_` to understand its type and usage. rg --type cpp 'wait_to_close_conns_' # Search for all occurrences of `use_count()` to see how it's used in the codebase. rg --type cpp 'use_count()'Length of output: 635
313-319
: EnsureClearConnsRefAndOtherInfo
handles all necessary cleanup.Verify that
ClearConnsRefAndOtherInfo
effectively clears all relevant references and information associated with a connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- tests/integration/replication_test.go (1 hunks)
- tests/integration/start_master_and_slave.sh (7 hunks)
Additional comments not posted (3)
tests/integration/start_master_and_slave.sh (1)
17-18
: Increase in timeout value.The timeout value has been increased from 60 to 500 in multiple configuration files. This change is consistent and likely aims to accommodate longer operations or improve system stability.
Also applies to: 28-29, 39-40, 52-53, 66-67, 79-80, 92-93
tests/integration/replication_test.go (2)
637-639
: Improved client isolation in goroutines.Creating a new Redis client instance within each goroutine enhances isolation and reduces the risk of resource contention. This change improves the robustness of concurrent operations.
Also applies to: 643-645
652-654
: Reconnect client to handle potential timeout issues.The addition of a sleep and reconnection to the
clientSlave
helps mitigate timeout-related issues, ensuring the test's reliability.
关于wait_close中引用计数为1了加入到ready_to_close这一部分的描述,我理解在追加到wait_to_close_conns_之前,已经从epoll删除了对应的fd。那么即使异步线程已经回写了response并且已经通过signalfd通知了workthread,workthread也不会处理这个conn了吧? |
Regarding the description of adding the reference count of 1 in wait_close to ready_to_close, I understand that the corresponding fd has been deleted from epoll before being appended to wait_to_close_conns_. So even if the asynchronous thread has written back the response and notified the workthread through signalfd, the workthread will not process this conn, right? |
这个异步通知会带着fd过来,workThread收到通知以后会去操作epoll,打开可写事件监听。如果该fd没有被复用还好,这个epoll_ctl会无效,因为fd不在上面。但如果正好有新连接重新进来复用了这个fd,且刚好注册回了这个epoll上,这个异步通知就会打开这个新client的可写时间监听,触发写回。 |
This asynchronous notification will come with fd. After receiving the notification, workThread will operate epoll and open writable event monitoring. It's okay if the fd is not reused, but this epoll_ctl will be invalid because the fd is not on it. But if a new connection happens to come in again to reuse this fd, and it happens to be registered back to this epoll, this asynchronous notification will open the writable time monitoring of this new client and trigger writeback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- src/net/src/worker_thread.cc (2 hunks)
Files skipped from review as they are similar to previous changes (1)
- src/net/src/worker_thread.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- src/net/src/worker_thread.cc (2 hunks)
Files skipped from review as they are similar to previous changes (1)
- src/net/src/worker_thread.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- tests/integration/rsync_dynamic_reconfig.go (2 hunks)
Additional comments not posted (3)
tests/integration/rsync_dynamic_reconfig.go (3)
38-41
: LGTM!The changes in the
RefillMaster
function's logic are consistent with the reduced data volume for testing.
139-139
: LGTM!The test case changes to use a reduced data volume of 2 MB are consistent with the
RefillMaster
function modifications.
38-41
: Verify the impact of reduced key/value sizes and thread count.The reduction in key and value sizes from 1024 to 64 bytes, and the thread count from 10 to 5, may affect performance and concurrency. Ensure these changes align with the intended testing goals and do not adversely impact the test's effectiveness.
Verification successful
Verify the impact of reduced key/value sizes and thread count.
The changes to the
RefillMaster
function, reducing key/value sizes and thread count, are localized to thersync_dynamic_reconfig.go
file. Ensure these modifications align with the test's objectives and do not compromise the intended performance or concurrency simulation. If they do, the changes are acceptable.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the impact of reduced key/value sizes and thread count in `RefillMaster`. # Test: Check for performance or concurrency issues in the test suite. # Expected: No significant performance degradation or concurrency issues. # Note: This script assumes a performance testing framework is available. # Here, we just list the occurrences for manual review. rg --type go 'RefillMaster'Length of output: 335
* kill conns correct in WorkThread
…#2862) * kill conns correct in WorkThread
本 PR 修复了
执行
client kill ip:port
,client kill all
,client kill normal
等杀连接的命令流程不正确的问题。背景
之前的代码中这部分逻辑存在很大问题,会造成 fd 泄露。本人在 PR #2858 中对该问题做了一点修复,fd 不会再泄露了,但那版修复不够完善,没考虑到 Linux 对 fd 复用来接受新连接的逻辑。
代码细节
NetConn
对象,期内包含 fd 等信息, fd也会注册到WorkThread的Epoll上,而WorkThread本身也是epoll线程,不断循环epoll_wait。WorkerThread
)收到包,解析出任务以后,会将该任务提交给线程池异步执行,同时带上一枚当前NetConn
的shared_ptr
,并且关闭该连接(fd)在 epoll 上的读写事件监听。等到线程池异步地执行完命令,会将 response 先写入NetConn
的response_
,再使用管道通知 epoll 将该连接(fd)的可写事件监听打开。由于发送缓冲区是空的,所以下次epoll_wait
该 fd 会立刻触发可写事件,于是NetConn
的response_
内容会被写入 socket。也就是说,Pika 中是通过管道通知 epoll 打开对应 fd 的可写监听来触发 response 的写回,或者说使用这样一种机制来标志/通知异步的任务执行结束以通知回写。问题
当 Pika 要关闭某些连接时,如果直接关闭 fd(并从epoll上移除),可能会出现这样一种情况:该 fd 是从 epoll 上移除并且关闭了,但异步线程池还在执行该连接最后提交的那个任务(异步线程池还持有了一枚
shared_ptr<Netconn>
关联这这枚已经关闭的fd)。等异步线程池执行完任务,依旧会用管道通知 epoll 将该 fd 的可写事件打开监听(实际上这枚 fd 已经关闭,且已经不在 epoll 上了),这个通知其实已经是无效的了。但如果恰好 Linux 立刻复用了该 fd 接受了一个新连接,并且注册到了这个
WorkThread
的 epoll 上,前面提到的无效通知就是一个非法通知了。因为此时该 fd 对应了一个新的 client,而这个非法的,来自旧连接请求的通知会使得 epoll 直接使用新 client 的NetConn
的response_
的内容去回写 fd。问题在于此时线程池可能已经执行完一个该NetConn
的请求,正在向response_
进行写入,这实际上就是 data race 了,可能引起各种期望之外的后果。本 PR 提供了一个相对完善,分阶段的关闭连接的流程, 步骤如下:
新增:
WorkThread::wait_to_close_conns_
,WorkThread::ready_to_close_conns_
conns_
中移除,放在wait_to_close_conns_
中,并且清除非常规路径上对该 conn 的shared_ptr
引用(blpop
等阻塞命令,以及watchKeys
功能都会持有NetConn
的shared_ptr
)。cronTask
中(即经过了一轮epoll_wait
循环)检查wait_to_close_conns_
,如果引用计数降为了 1,将其移动到ready_to_close_conns_
。cronTask
中(即又经过了一轮epoll_wait
循环),关闭掉所有在ready_to_close_conns_
中的连接。为什么引用计数降为 1 了,还要放入ready_to_close_conns_
等待一轮 epoll 循环以后再关闭 fd ?主要就是前面提到的异步回写通知的问题,假如要关闭的 conn 对象引用计数已经降为 1,也可能异步线程池已经发出过了回写的通知(通过管道和队列通知 epoll打开该fd的可写时间监听,后面进行response回写),需要等 epoll_wait 再走一轮,消化掉这些已经无效的回写通知,才能去关闭 fd,以预防前面提到的 fd 复用和无效回写通知叠加可能导致的问题。This PR fixes
Incorrect process when executing
client kill XX
,client kill all
,client kill normal
, and similar commands to terminate connections.Background
The previous code had significant issues in this logic, which could lead to fd leaks. In PR #2858, some fixes were made to address this issue, ensuring that fds would no longer leak. However, that fix was not comprehensive enough as it didn't account for Linux's fd reuse logic for accepting new connections.
Code details
Every time Pika creates a new connection with a client, a new
NetConn
object is created, containing fd and other information. Pika's network thread (WorkerThread
) receives packets, parses the task, and submits it to a thread pool for asynchronous execution, passing along ashared_ptr
to the currentNetConn
. It then closes the connection's (fd) read/write event listening on epoll. Once the thread pool completes executing the command asynchronously, it writes the response intoNetConn
'sresponse_
and notifies epoll via a pipe to enable write event listening for that connection (fd). Since the send buffer is empty, the nextepoll_wait
will immediately trigger a write event on that fd, causingNetConn
'sresponse_
content to be written to the socket. In other words, Pika uses the pipe notification to epoll to open the write event listening for the corresponding fd to trigger the response write-back, or in other words, to signal the completion of the asynchronous task.Issue
When Pika needs to close certain connections, directly closing the fd can lead to a situation where the fd is removed from epoll and closed, but the asynchronous thread pool is still executing the last submitted task for that connection (the thread pool still holds a
shared_ptr<Netconn>
). After completing the task, it will still notify epoll via the pipe to enable the write event for that fd (which has already been closed and is no longer on epoll), making this notification ineffective.However, if Linux happens to immediately reuse that fd for a new connection and registers it on the epoll of this
WorkThread
, the previously mentioned invalid notification becomes an illegal notification. This is because the fd now corresponds to a new client, and this illegal notification from the old connection will cause epoll to directly use the new client'sNetConn
'sresponse_
content to write back to the fd. The issue here is that the thread pool may have already executed a request for thisNetConn
and is writing toresponse_
, leading to a data race, which could cause various unexpected outcomes.This PR provides a more complete, phased process for closing connections
New additions:
WorkThread::wait_to_close_conns_
,WorkThread::ready_to_close_conns_
conns_
, placed inwait_to_close_conns_
, and clears anyshared_ptr
references to the conn on non-standard paths (such asblpop
andwatchKeys
, which hold ashared_ptr<NetConn>
).cronTask
(i.e., after oneepoll_wait
loop), checkwait_to_close_conns_
. If the reference count drops to 1, move it toready_to_close_conns_
.cronTask
(i.e., after anotherepoll_wait
loop), close all connections inready_to_close_conns_
. Why wait for anotherepoll
loop after the reference count drops to 1 before closing the fd? The reason is the asynchronous write-back notification mentioned earlier. If the conn's reference count has dropped to 1, the thread pool may have already issued a wrSummary by CodeRabbit
Summary by CodeRabbit
New Features
RefillMaster
function and tests, reducing key/value sizes and thread count.Bug Fixes
Refactor
Tests