Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for close_notify message in Base.isopen(ctx::SSLContext) #145

Merged
merged 3 commits into from
Apr 18, 2018

Conversation

samoconnor
Copy link
Contributor

This change enables Base.isopen(ctx::SSLContext) to detect that the peer has closed an idle connection.

Before this change an idle TLS connection could receive a close_notify message from the TLS peer, but we would never notice it because the LibuvStream was no longer in StatusActive so the message never found its way into the read buffer.

This appears to be the root cause of various EOFError() issues related to re-use of idle connections:
JuliaWeb/HTTP.jl#214
JuliaWeb/HTTP.jl#199
JuliaWeb/HTTP.jl#220
JuliaWeb/GitHub.jl#106

This change modifies Base.isopen(ctx::SSLContext) to:

  • call Base.start_reading(ctx.bio); yield() to ensure that the LibuvStream is active,
  • do a zero-byte mbedtls_ssl_read to ensure that the close_notify message is processed by the TLS library if it has been received,
  • check return code of the zero-byte read for MBEDTLS_ERR_SSL_PEER_CLOSE_NOTIFY

Note that the idle_timeout= change to GitHub.jl is still desirable because there is a race-condition when a request is sent on an idle connection at the same moment that the server decides to send close_notify and drop the connection.

@samoconnor
Copy link
Contributor Author

Review request: @vtjnash, @quinnj, @malmaud, @KristofferC

Copy link
Member

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic seems sound to me. I find it a little difficult to imagine any downsides to a change like this: will any of these new calls unexpectedly throw? Is it bad to call Base.start_reading if ctx.bio happens to be in a bad state or something?

I think it all seems good though.

@codecov-io
Copy link

codecov-io commented Apr 16, 2018

Codecov Report

Merging #145 into master will increase coverage by 0.75%.
The diff coverage is 72.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #145      +/-   ##
==========================================
+ Coverage   68.82%   69.57%   +0.75%     
==========================================
  Files          10       10              
  Lines         433      447      +14     
==========================================
+ Hits          298      311      +13     
- Misses        135      136       +1
Impacted Files Coverage Δ
src/ssl.jl 73.75% <72.22%> (+2.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05eae2e...697c5d5. Read the comment docs.

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fascinating to realize there's actually a spec for this race condition in the http/1.1 keep-alive documentation: (https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html). Accordingly, apparently nobody handles this particularly well.

src/ssl.jl Outdated
# has sent a close_notify message on an otherwise idle connection.
# https://tools.ietf.org/html/rfc5246#section-7.2.1
Base.start_reading(ctx.bio)
yield()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isopen isn't typically a yield-point. Should use Base.process_events(false) instead

@vtjnash
Copy link
Member

vtjnash commented Apr 16, 2018

Maybe we should add something like: (EDIT: add to users, like HTTP.jl)

kill_on_data(s::IO) = eof(s) || s.in_use || close(s)

s.in_use = true
send(s, "TYP PATH HTTP/1.1\nHEADERS\n\nBODY"))
reply = recv(s)
s.in_use = false
@async kill_on_data(s)
return reply

 - @schedule task to keep socket in active state
 - and monitor for MBEDTLS_ERR_SSL_PEER_CLOSE_NOTIFY

Whenever MBEDTLS_ERR_SSL_PEER_CLOSE_NOTIFY is recieved call
close(::SSLContext).

move zero-byte read into function: decrypt_available_bytes
@samoconnor
Copy link
Contributor Author

samoconnor commented Apr 17, 2018

@vtjnash, your kill_on_data idea, and your comment that isopen isn't typically a yield-point, and a night of sleep, led me to a slightly different implementation.

I've moved the start_reading call and the check for MBEDTLS_ERR_SSL_PEER_CLOSE_NOTIFY to an async loop started by handshake(ctx::SSLContext). This way the close will be noticed as soon as it arrives.

I've changed the handling of MBEDTLS_ERR_SSL_PEER_CLOSE_NOTIFY so that it always calls
close(ctx::SSLContext) to ensure that the socket is closed as well as the ctx.isopen flag being cleared.

I've also moved the zero-byte read into a function decrypt_available_bytes.

…calling isopen(::SSLContext) does that for us
@samoconnor
Copy link
Contributor Author

There is now a pleasing symmetry between the status query functions isopen and _bytesavailable (and indirectly eof) in that they all call decrypt_available_bytes before inspecting the status.

# Ensure that libuv is reading data from the socket in case the peer
# has sent a close_notify message on an otherwise idle connection.
# https://tools.ietf.org/html/rfc5246#section-7.2.1
Base.start_reading(ctx.bio)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base.start_reading should always be very cheap:

    elseif stream.status == StatusPaused
        stream.status = StatusActive
        return Int32(0)
    elseif stream.status == StatusActive
        return Int32(0)

https://github.com/JuliaLang/julia/blob/master/base/stream.jl#L651-L655

@vtjnash
Copy link
Member

vtjnash commented Apr 17, 2018

Yes, this seems good. I like that yours is handling it at the TLS level, rather than entirely deferring to the higher level.

For non-SSL streams, should we add a similar watchdog to HTTP.jl?

I think eof should be a roughly equivalent blocking test to your while loop, and just lacks in the ability to contain a while loop (eg typically would need to be re-installed by read after consuming the buffer, perhaps via an extra Condition variable). But, otoh, would be using a more generic part of the IO API. The downside to your code in this PR is that it disables flow-control. The other, related downside, (and why we can’t do this in Base) is that we can’t hand off the stream to a subprocess at any later point (because our watchdog would remain active and continuously try to consume some bytes into our local buffer). For this PR, neither of those seems too important to me right now.

@samoconnor
Copy link
Contributor Author

samoconnor commented Apr 17, 2018

For non-SSL streams, should we add a similar watchdog to HTTP.jl?

I hadn't given it much thought, most HTTP APIs mandate TLS these days, so that's the most common use-case. Is it the case that Base.isopen(::TCPSocket) will return true in StatusOpen state even after a FIN or RST has been received? i.e. unless start_reading is called libuv will never tell Base that the connection has been closed? In that case perhaps Base.isopen(::LibuvStream) should call start_reading and process_events?

I've just tried the HTTP.jl patch below for plain HTTP/TCP connections. It keeps the socket in active state and does an explicit close after isopen returns false.
Without the patch, I see (via wireshark) that the server sent a FIN, but we don't notice until we try to send another request and get a RST:

28.060559 HTTP/1.1 200 OK 
28.060637 52251 → 80 [ACK] Seq=1129 Ack=7141 Win=262112 Len=0
51.099488 80 → 52251 [FIN, ACK] Seq=7141 Ack=1129 Win=16896 Len=0
51.099568 52251 → 80 [ACK] Seq=1129 Ack=7142 Win=262144 Len=0
61.643910 52251 → 80 [PSH, ACK] Seq=1129 Ack=7142 Win=262144 Len=16 [TCP segment of a reassembled PDU]
61.644131 GET / HTTP/1.1 
61.672759 80 → 52251 [RST] Seq=7142 Win=0 Len=0

With the patch, we see the FIN and respond with a FIN and the connection is cleanly closed:

28.406297 HTTP/1.1 200 OK 
28.406331 52239 → 80 [ACK] Seq=1129 Ack=7141 Win=261696 Len=0
51.038689 80 → 52239 [FIN, ACK] Seq=7141 Ack=1129 Win=16896 Len=0
51.038768 52239 → 80 [ACK] Seq=1129 Ack=7142 Win=262144 Len=0
51.038869 52239 → 80 [FIN, ACK] Seq=1129 Ack=7142 Win=262144 Len=0
51.064841 80 → 52239 [ACK] Seq=7142 Ack=1130 Win=16896 Len=0
--- a/src/ConnectionPool.jl
+++ b/src/ConnectionPool.jl
@@ -488,6 +488,15 @@ struct ConnectTimeout <: Exception
     port
 end

+function tcp_monitor(tcp)
+    while isopen(tcp)
+        Base.start_reading(tcp)
+        wait(tcp.readnotify)
+        yield()
+    end
+    close(tcp)
+end
+
 function getconnection(::Type{TCPSocket},
                        host::AbstractString,
                        port::AbstractString;
@@ -502,6 +511,7 @@ function getconnection(::Type{TCPSocket},
     if connect_timeout == 0
         tcp = Sockets.connect(Sockets.getaddrinfo(host), p)
         keepalive && keepalive!(tcp)
+        @schedule tcp_monitor(tcp)
         return tcp
     end

Part of the problem here is that the Base API does not distinguish between isopenread / isopenwrite or closeread / closewrite. I think if the Base API for bi-directional streams was more explicit about this it would be easier to reason about cases like this.

@samoconnor
Copy link
Contributor Author

The following is a revised version of the HTTP.jl patch that works for both TLS and TCP.
It calls isopen on the ::IO object, thereby invoking the MBEDTLS_ERR_SSL_PEER_CLOSE_NOTIFY check in Base.isopen(::SSLContext) if the ::IO is a TLS connection.
With this patch the @schedule while isopen(ctx) loop in handshake(::SSLContext) is redundant. But the question remains: is it generally useful to leave it in the TLS library? ... or should there be an option in Base to do Sockets.setopt(tcp, close_on_remote_fin=true) ?

--- a/src/ConnectionPool.jl
+++ b/src/ConnectionPool.jl
@@ -453,6 +453,7 @@ function getconnection(::Type{Transaction{T}},
             busy = findall(T, host, port, pipeline_limit)
             if length(busy) < connection_limit
                 io = getconnection(T, host, port; kw...)
+                @schedule tcp_monitor(io)
                 c = Connection(host, port,
                                pipeline_limit, idle_timeout,
                                io)
@@ -488,6 +489,16 @@ struct ConnectTimeout <: Exception
     port
 end

+function tcp_monitor(io)
+    tcp = tcpsocket(io)
+    while isopen(io)
+        Base.start_reading(tcp)
+        wait(tcp.readnotify)
+        yield()
+    end
+    close(tcp)
+end
+

@samoconnor
Copy link
Contributor Author

This HTTP.jl PR JuliaWeb/HTTP.jl#235 implements the patch above.
If both PRs are used, TLS connections end up with two monitoring tasks and a race to see which one calls isopen first. This seems to be harmless.

vtjnash pushed a commit to JuliaWeb/HTTP.jl that referenced this pull request Apr 17, 2018
When a connection is returned to the (read) pool,
add a monitor to it for receiving unexpected data (or EOF),
and kill / close the Connection object if any activity occurs
before the next write (when it should have simply been waiting idle in the pool)

per JuliaLang/MbedTLS.jl#145 (comment)

closes #214
closes #199
closes #220
closes JuliaWeb/GitHub.jl#106
@vtjnash
Copy link
Member

vtjnash commented Apr 17, 2018

Sockets.setopt(tcp, close_on_remote_fin=true)

Definitely, no. That would introduce all sorts of other subtle bugs and races into the IO system.

I've made a couter-proposal PR with my concept of what I think this should look like.

is redundant

Not a major issue, since we'll just call all of them. Unfortunately, all versions of this also triggers a very serious bug with IdDict, and so this PR is currently unusable on all versions of Julia.

@samoconnor
Copy link
Contributor Author

@vtjnash, I assume this is the IdDict issue: JuliaLang/julia#26839 ?
Can you explain how/when this bug is triggered by the connection monitoring stuff we're doing here? and how it manifests?
(I haven't seen any problems on my mac or on travis.)

@samoconnor
Copy link
Contributor Author

I'll address the HTTP.jl PR in it's own comments section.

With regard to the the @schedule while isopen(ctx) loop in handshake(::SSLContext):
Given that the TLS protocol includes OOB alert messages that can arrive when the application layer is not trying to read, it seems that the Base.start_reading call to keep the socket in active state is a generally useful change.

So I'm in favour of merging this PR as is (but I don't have commit access).
@quinnj? @vtjnash?

@quinnj quinnj merged commit 6d1b87a into JuliaLang:master Apr 18, 2018
@quinnj
Copy link
Member

quinnj commented Apr 18, 2018

I plan on putting some load on a local application w/ this change in to see how it fares; I'll post back here if I see any issues.

@samoconnor
Copy link
Contributor Author

thx @quinnj

quinnj pushed a commit to JuliaWeb/HTTP.jl that referenced this pull request May 31, 2018
* ConnectionPool: monitor idle connections

When a connection is returned to the (read) pool,
add a monitor to it for receiving unexpected data (or EOF),
and kill / close the Connection object if any activity occurs
before the next write (when it should have simply been waiting idle in the pool)

per JuliaLang/MbedTLS.jl#145 (comment)

closes #214
closes #199
closes #220
closes JuliaWeb/GitHub.jl#106

*  - Encapsulate read|writebusy/sequence/count logic in new isbusy function.
 - Move close() on eof() || !isbusy() to new monitor_idle_connection function.
 - Make monitor_idle_connection() a noop for ::Connection{SSLContext}

* require Julia 0.6.3 #236 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants