Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-4508: Fix endless-loop in ClientCnxn.SendThread.run when all zk servers down #1847

Conversation

kezhuw
Copy link
Member

@kezhuw kezhuw commented Apr 1, 2022

The observable behavior is that client will not get expired event from watcher.
The cause is twofold:

  1. updateLastSendAndHeard is called in reconnection so the session
    will not timeout.
  2. No break after session timeout in ClientCnxn.SendThread.run.

@kezhuw kezhuw force-pushed the ZOOKEEPER-4508-endless-loop-when-all-server-down branch from 8f8910e to 905d56a Compare April 1, 2022 14:21
@kezhuw kezhuw force-pushed the ZOOKEEPER-4508-endless-loop-when-all-server-down branch from 905d56a to 6bd083f Compare May 4, 2023 04:55
@kezhuw
Copy link
Member Author

kezhuw commented May 8, 2023

Another user reported this in ZOOKEEPER-4692.

Ping @eolivelli @tisonkun @symat @maoling @cnauroth for review.

…l zk servers down

The observable behavior is that client will not get expired event from watcher.
The cause is twofold:
1. `updateLastSendAndHeard` is called in reconnection so the session
   will not timeout.
2. No `break` after session timeout in `ClientCnxn.SendThread.run`.
@kezhuw kezhuw force-pushed the ZOOKEEPER-4508-endless-loop-when-all-server-down branch from 6bd083f to d741b0a Compare May 9, 2023 09:19
@RabbitDong-on
Copy link

hi,kezhu. can you explain this bug in detail or give reproduce step? thanks. I am confusing.

@@ -1192,7 +1192,6 @@ public void run() {
startConnect(serverAddress);
// Update now to start the connection timer right after we make a connection attempt
clientCnxnSocket.updateNow();
clientCnxnSocket.updateLastSendAndHeard();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you removing updateLastSendAndHeard ? (here and there)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantically, it is because we are not heard(lastHeard) anything here and there. If we update lastHeard in these two places, then getIdleRecv will be reset to 0 in every re-connect which will cause no SessionTimeoutException.

For lastSend, I think it does not matter as it is only used for ping in CONNECTED state after successful ConnectRequest which will updateLastSend. I don't see a reason for updateLastSend in these two place.

@kezhuw
Copy link
Member Author

kezhuw commented May 17, 2023

can you explain this bug in detail or give reproduce step? thanks. I am confusing.

@RabbitDong-on See testWatcherExpiredAfterAllServerDown and other changed tests please. Both jira description and commit message list the issue:

The observable behavior is that client will not get expired event from watcher.

Copy link
Member Author

@kezhuw kezhuw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After digging into the code history, I am doubt whether ZooKeeper tried to support session expiration base on sole client side timing. But it is indeed a problem if client could not decide to expire a session on its own when it is not able to contact a server.

@@ -1233,13 +1241,20 @@ public void run() {
to = connectTimeout - clientCnxnSocket.getIdleRecv();
}

if (to <= 0) {
if (expirationTimeout - clientCnxnSocket.getIdleRecv() <= 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7% of developers fix this issue

THREAD_SAFETY_VIOLATION: Read/Write race. Non-private method ClientCnxn$SendThread.run() reads without synchronization from this.this$0.expirationTimeout. Potentially races with write in method ClientCnxn$SendThread.onConnected(...).
Reporting because this access may occur on a background thread.


ℹ️ Expand to see all @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage
@sonatype-lift ignore Leave out the above finding from this PR
@sonatype-lift ignoreall Leave out all the existing findings from this PR
@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
Click here to add LiftBot to another repo.

@kezhuw
Copy link
Member Author

kezhuw commented Sep 1, 2023

Superceded by #2058 which propose a client side session expiration timeout formally.

@kezhuw kezhuw closed this Sep 1, 2023
@kezhuw kezhuw deleted the ZOOKEEPER-4508-endless-loop-when-all-server-down branch October 14, 2024 02:38
@kezhuw kezhuw restored the ZOOKEEPER-4508-endless-loop-when-all-server-down branch October 14, 2024 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants