-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make sure failed SSL does not impact other sessions #64256
Conversation
Tagging subscribers to this area: @dotnet/ncl, @vcsjones Issue DetailsI was finally able to reproduce it while looking at something else. I have debug build of OpenSSL + custom tracing and that allowed me to have somewhat reliable repro (to fail on 10-15 minutes) Here is what I saw:
then handshake function reruns failure and we will try to get more info about it. However, because the handle is closed and released that fails
so we throw and that bails out of the handshake leaving the error behind. The OpenSSL error queue is per-thread so when unrelated operation comes in and if there is debug build we hit the assert - that guards agains exactly this behavior. On release build we can mix the errors and return back something lame. It is somewhat unfortunate that with current logic, basically any call to OpenSSL with handle can throw. Eventually I decided it is OK to throw and I added finally block to clean up the error. We may explore how to use the handles and avoid cases like this but for now I decided to go with simplest fix as we see hits in CI and it is unpleasant as it kills whole set run. fixes #57722
|
src/libraries/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.OpenSsl.cs
Outdated
Show resolved
Hide resolved
} | ||
} | ||
finally | ||
{ | ||
Crypto.ErrClearError(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment would be helpful.
And there's no finer-grained place to do this? Just seems a little bandaid-y to do this here, as it seems like we don't know where the extraneous error is coming from.
But if this is the best place, so be it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we know exactly what is failing - at least in this particular case.
139940077336320:error:1409442E:SSL routines:ssl3_read_bytes:tlsv1 alert protocol version:ssl/record/rec_layer_s3.c:1543:SSL alert number 70
The test is constructed to fail and sends TLS alert. But the client side is disposed before it is able to fully process it.
It should be sufficient to cleanup only on error or exception.
The other aspect is that GetSslError() gets last error from OpenSSL. I don't tank if it is ever possible to get multiple errors. From that prospective Crypto.ErrClearError
is generic bandaid for something that may (or may not) happen.
We could avoid this by grabbing reference on the SafeHandle. If we do, we would do handshake AND have ability to query error and follow the error path.
One more thought, Since the Ssl.SslDoHandshake
is only called from here, we could change it to pass back also what ever GetSslError() would do. e.g. avoiding the second call that may throw.
Int Ssl.SslDoHandshake(SafeHandle context, ref Ssl.SslErrorCode error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the PAL method to combine two P/Invokes into one by making the error code an out/ref/pointer parameter, if they're always called in pairs, seems like some goodness (it'll reduce the interlocked counting on the safehandle, and all the other work that goes on in a P/Invoke).
As for sprinkling in a call to ERR_clear_error(): I'm actively working on a change to reverse all of these flows. Every shim function that has a possibility of changing the error queue starts off with ERR_clear_error(). If there are multiple places that it might've sprinkled in noise, it'll call it multiple times.
Once that's done, it should be the case that we can remove the shim function for it altogether, because every flow looks like
- Clean queue
- Do work.
- Execption? Throw first error from the queue.
- No exception? Queue floats.
- Next op starts
- Clear error queue.
- Do work.
- ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make the minimal change and then I'll leave the general cleanup to you @bartonjs. It seems like failure can put multiple error on queue, right? e.g. just calling GetError() to get last one may not be sufficient...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. just calling GetError() to get last one may not be sufficient...
The new model is to throw the first one (clearing out the rest). The reason the current/"old" model is the last one is because we ignored the queue and let it float around, so the last one is the only one we knew was relevant. But, we've since learned of operations that emit multiple errors, the latter ones being bad error state recovery from the initial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified SslDoHandshake
to give error back to avoid extra call and removed the try/finally.
That should be sufficient to deal with the released handle during handshake.
I was thinking about this for a while. I agree with @stephentoub that the fix looks bandaid-y and I could scope it more to just cover the case I was able to reproduce. cc: @bartonjs and @am11 since similar topic bubbled up on separate thread. (not sure if related) |
I was finally able to reproduce it while looking at something else. I have debug build of OpenSSL + custom tracing and that allowed me to have somewhat reliable repro (to fail on 10-15 minutes)
Here is what I saw:
Some test can Dispose SslStream while still in handshake. When this happens, The handle is released by the interior code as soon as we exit from the pinvoke.
then handshake function reruns failure and we will try to get more info about it. However, because the handle is closed and released that fails
so we throw and that bails out of the handshake leaving the error behind. The OpenSSL error queue is per-thread so when unrelated operation comes in and if there is debug build we hit the assert - that guards agains exactly this behavior. On release build we can mix the errors and return back something lame.
It is somewhat unfortunate that with current logic, basically any call to OpenSSL with handle can throw.
My First fix was to add
DangerousAddRef/Release
to keep the handle always alive through the handshake.I did ~ 1000 runs over night and never hit a problem.
Eventually I decided it is OK to throw and I added finally block to clean up the error.
We may not need the cleanup if handshake finished successfully and we were able to get the error.
But it its probably cheap enough to just keep it there as common case.
We may explore how to use the handles and avoid cases like this but for now I decided to go with simplest fix as we see hits in CI and it is unpleasant as it kills whole set run.
fixes #57722