Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix false-alarm timeout exception when waiting for key exchange to complete #912

Merged
merged 2 commits into from
Apr 18, 2024

Conversation

rasantel
Copy link
Contributor

@rasantel rasantel commented Nov 10, 2023

Problem

As described in #909, the SSHJ client may hit a timeout exception almost as soon as the main thread starts waiting for the key exchange to complete, even though the timeout has not been reached.

Why isn't this issue more widespread? I don't know :) I started hitting it when I enabled the heartbeater by setting the keepalive interval just before starting a connection.

Diagnosis

This issue is an existing race condition that doesn't actually require the heartbeater to be running, but apparently the presence of the heartbeater thread (at least on my production Ubuntu 20.04 linux systems) creates the conditions for this particular interleaving to be more likely to occur.

Based on the SSHJ logs and code, the issue can happen during the following sequence of events between the main thread and the reader thread:

[2023-11-06T21:59:52,561][DEBUG][net.schmizz.sshj.transport.KeyExchanger][sshj-Reader-localhost/127.0.0.1:22-1699307992560][KeyExchanger.handle] Received SSH_MSG_KEXINIT
[2023-11-06T21:59:52,561][DEBUG][net.schmizz.sshj.transport.KeyExchanger][sshj-Reader-localhost/127.0.0.1:22-1699307992560][KeyExchanger.startKex] Initiating key exchange
[2023-11-06T21:59:52,562][DEBUG][net.schmizz.concurrent.Promise][Test worker][Promise.tryRetrieve] Awaiting <<kex done>>
[2023-11-06T21:59:52,562][DEBUG][net.schmizz.concurrent.Promise][sshj-Reader-localhost/127.0.0.1:22-1699307992560][Promise.deliver] Setting <<kex done>> to `null`

What happens under the covers is:

  1. The key exchange is initiated by the reader thread when it receives SSH_MSG_KEXINIT from the server before the main thread has a chance to initiate that exchange.
  2. Before the reader thread calls done.clear() in KeyExchanger.startKex, the main thread also calls KeyExchanger.startKex, skips the key exchange initiation (because it's already ongoing) and calls done.await with a timeout to wait for the exchange to complete.
  3. done.await calls Promise.retrieve which calls Promise.tryRetrieve, which waits for the associated condition via cond.await.
  4. The reader thread calls done.clear(), which delivers a value of null to the done promise.
  5. This causes cond.await to wake up the main thread, and Promise.tryRetrieve returns this null value.
  6. Promise.retrieve throws a timeout exception because the retrieved value is null, even though there wasn't a timeout.

I can consistently reproduce this issue by attaching a debugger in order to force this particular thread interleaving. I set a breakpoint at the entrance of KeyExchanger.startKex, let both the main thread and the reader hit it, resume the reader thread (so it will initiate the exchange) but stop it right before done.clear, resume the main thread which will pause at Promise.retrieve, and then resume the reader so it calls done.clear, which delivers a null to Promise.retrieve on the main thread and thus triggers the exception.

Solution

One option would be to add extra synchronization to KeyExchanger.startKex to ensure that waitForDone, which calls done.await, is not called before done.clear.

The other option I see, which avoids extra synchronization and its drawbacks, is to think of Promise.clear as just the act of clearing the value to null, without actually "delivering" that null value to anyone waiting on that promise. This is what I am proposing in the current diff.

The only concern with doing the latter could be any side effects it may have on other uses of Promise.clear. I inspected the code and found one other use of clear, in UserAuthImpl.authenticate. In that method, clear is called before retrieve, so the call to retrieve will not be affected by this change in clear. Presumably, all calls to authenticate should happen in the same thread, but if somehow an application calls it from different threads, then this change fixes the same issue that we target in this PR for the key exchange; parallel calls to authenticate will no longer throw a spurious timeout exception in such an unlikely use case. The only other place where Promise.clear is called is Event.clear, and an inspection of the uses of Event.clear results in similar conclusions: calls to clear cannot occur while the event is waiting on its promise.

Testing

I repeated the same procedure that reproduces the bug. With this fix applied, the main thread at Promise.retrieve no longer receives a null from done.clear from the reader thread, and instead is awakened later when the reader actually delivers a non-null value via done.set, thus avoiding the exception.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.46%. Comparing base (03f8b22) to head (b9581e1).
Report is 6 commits behind head on master.

❗ Current head b9581e1 differs from pull request most recent head 631d882. Consider uploading reports for the commit 631d882 to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##             master     #912      +/-   ##
============================================
- Coverage     68.96%   68.46%   -0.51%     
+ Complexity     1448     1407      -41     
============================================
  Files           208      207       -1     
  Lines          7602     7505      -97     
  Branches        658      642      -16     
============================================
- Hits           5243     5138     -105     
- Misses         2009     2021      +12     
+ Partials        350      346       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hierynomus hierynomus merged commit 079cb08 into hierynomus:master Apr 18, 2024
4 checks passed
@hierynomus
Copy link
Owner

Thanks! Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants