ConnectionPool: monitor idle connections #236

vtjnash · 2018-04-17T19:56:15Z

When a connection is returned to the (read) pool,
add a monitor to it for receiving unexpected data (or EOF),
and kill / close the Connection object if any activity occurs
before the next write (when it should have simply been waiting idle in the pool)

per JuliaLang/MbedTLS.jl#145 (comment)

closes #214
closes #199
closes #220
closes JuliaWeb/GitHub.jl#106
replaces #235

When a connection is returned to the (read) pool, add a monitor to it for receiving unexpected data (or EOF), and kill / close the Connection object if any activity occurs before the next write (when it should have simply been waiting idle in the pool) per JuliaLang/MbedTLS.jl#145 (comment) closes #214 closes #199 closes #220 closes JuliaWeb/GitHub.jl#106

samoconnor · 2018-04-18T00:04:32Z

closes JuliaWeb/GitHub.jl#106

From JuliaWeb/GitHub.jl#106 (comment):

To be reliable, GitHub.jl will still need to implement state-aware retry of non-idepmontent operations, but the changes above should improve failure rates in the short term

The change in this PR, or #235, and idle_timeout= and JuliaLang/MbedTLS.jl#145, should all help to reduce the probability of the HTTP layer making a non-idepmontent request that cannot be automatically retried. However, the HTTP layer will never handle every corner case. The server can comit a transaction, send "HTTP/1.1 200 OK" but the network can fail before we get the "OK", it might time out, or we might get a [RST], or a [FIN] depending on the type of failure. Applications that make POST requests will always need to implement application-state-aware retry (or set retry_non_idempotent=true if they are sure that is safe).

samoconnor · 2018-04-18T01:43:42Z

I haven't yet reviewed that the busy/count/sequence logic is correct, but I get the point. We want to say "if we are not expecting data".

Compared to #235:

I like that this is only active when there is no active transaction.
I like that it avoids unexported Base internals.
I'm not certian that it plays nicely with TLS. I don't know enought about TLS to know how often OOB alerts will be recieved at random times. Even if close_notify is the only thing we're going to receive, it seems that it would be best to allow the TLS layer to deal with it cleanly. i.e. whatever HTTP.jl does should avoid getting in the way of a clean TLS shutdown.
Perhaps we should do the close after a short delay to give the TLS layer to time to clean up nicely. Or, perhaps we should only do this for raw TCP sockets given that MbedTLS.jl should handle this for us via close_notify.

For the purposes of arbitrary local pedantry, I would:

Put the monitor in a seperate function so that only one line is added to closeread.
Keep the postcondition at the end of the function.
Re-wrap at 80 cols.
Add a @debug log message.

samoconnor · 2018-04-18T01:53:19Z

This might be better, i.e. if the connection is EOF (nothing more to read), then close it (signal won't write any more).

--- a/src/ConnectionPool.jl
+++ b/src/ConnectionPool.jl
@@ -270,7 +270,7 @@ function IOExtras.closeread(t::Transaction)
     # short-circuit if it is already dead or has already started to be reused
     if isopen(t.c) && !t.c.writebusy && (t.c.writecount <= t.sequence + 1) # && !isreadable(t)
         @schedule begin
-            if !eof(t.c.io) && # wait to see if receive any more data
+            if eof(t.c.io) || # wait to see if receive any more data
                     !t.c.readbusy && # and aren't actively expecting to receive data
                     !t.c.writebusy && (t.c.writecount <= t.sequence + 1) # nor have returned from startwrite and/or called closewrite again
                 # other end must be in an invalid state, or terminated the connection:

…tion. - Move close() on eof() || !isbusy() to new monitor_idle_connection function. - Make monitor_idle_connection() a noop for ::Connection{SSLContext}

samoconnor · 2018-04-19T01:25:04Z

Hi @vtjnash, @quinnj,
I've pushed a commit to this PR that I think captures the intention of Jameson's proposal, but cleans the code up a bit:

Encapsulate read|writebusy/sequence/count logic in new isbusy function.
Move close() on eof() || !isbusy() to new monitor_idle_connection function.
Make monitor_idle_connection() a noop for ::Connection{SSLContext}

Please review, and please excuse me if I've blatted an important detail in my reinterpretation.

vtjnash · 2018-04-19T01:39:55Z

Lgtm. I was assuming that seeing eof implied close had already gotten called, but no harm in being explicit. I have no opinion currently on making it a no-op for SSL. We know that it should now be handled there now, but also seems suspect to make these code-paths diverge (mostly just in terms of test coverage though).

vtjnash · 2018-04-19T01:41:43Z

src/ConnectionPool.jl

@@ -264,10 +270,32 @@ function IOExtras.closeread(t::Transaction)
    notify(t.c.readdone)                          ;@debug 2 "✉️  Read done:  $t"
    notify(poolcondition)

+    if !isbusy(t.c)
+        @schedule monitor_idle_connection(t.c)


maybe worth doing let c = t.c; @schedule monitor_idle_connection(t.c); end, so we don't capture all of t. But afaik, there's no finalizers on t, or other way to see this difference.

Yeah, t is an immutable struct containing just c and an Int, so probably not much difference.
But it's a good reminder to consider what is being captured in closures. I have a feeling I should be more cognisant of that more often...

samoconnor · 2018-04-19T01:43:27Z

I was assuming that seeing eof implied close had already gotten called, but no harm in being explicit.

AFAICT eof() = true just means we received a [FIN] or a [RST], calling close() means we send a [FIN].

vtjnash · 2018-04-19T01:57:17Z

Doesn't usually matter, unless you think it's going to be long-lived relative to the object that it wraps and holds lots of other references too.

calling close() means we send a

Makes sense. Seems like you're really great at finding bugs, though :)
If we're already in StateEOF, we might (unintentionally) ignore the close call: https://github.com/JuliaLang/julia/blob/50413a46a05b4f522e09fac2fb9b6a705cac31d7/base/stream.jl#L322
But we only enter that state for TTY objects. Everything else we close immediately upon receipt of an EOF notification: https://github.com/JuliaLang/julia/blob/50413a46a05b4f522e09fac2fb9b6a705cac31d7/base/stream.jl#L480-L488
Should we not do that, and let the user handle this?

samoconnor · 2018-04-19T11:13:59Z

We need to make a fundamental decision:

A: Does our IO model distinguish between open-for-read and open-for-write?
...and provide an API for SHUT_RD SHUT_WR and SHUT_RDWR?, or...
B: Does our IO model have only two states?
open == readable and writeable and
closed == no reading or writing allowed.

If we go for model A) we can still have convenience methods to hide read vs write (e.g. close can do SHUT_RDWR) but the spec for the convenience methods will be clearly and generically defined in terms of the read/write aware API.

Going for model B) might not be unreasonable. We could argue that protocols that rely on SHUT_WR are archaic and that modern protocols all have higher level mechanisms for connection lifetime control built in these days. This would preclude working with protocols that work like: TX([Request], [FIN]), RX([Response], [FIN]).

In model B), calling close() when we get a [FIN] is probably correct. In model A), the fact that the peer has nothing else to send doesn't mean we shouldn't be able to send more to the peer.

The current situation feels like it is somewhere in between, we mostly try to hide the distinction between the send and receive sides of the connection, but then we have eof() that queries only the receive state. The current situation is also not consistent across the various IO subtypes (JuliaLang/julia#24526)

Either way, I think we need a section under Interfaces that lays down the law and a state diagram that shows what sequences of state transitions is allowed and what causes them.

samoconnor · 2018-04-19T11:23:53Z

... all versions of this also triggers a very serious bug with IdDict

@vtjnash, is this that?

https://travis-ci.org/JuliaWeb/HTTP.jl/jobs/368429787#L4837-L4842

  KeyError: key TCPSocket(RawFD(-1) closed, 0 bytes waiting) not found
  getindex at ./associative.jl:363 [inlined]
  unpreserve_handle at ./libuv.jl:52
  wait_readnb at ./stream.jl:305
  eof at ./stream.jl:56
  eof at /home/travis/.julia/v0.6/HTTP/src/ConnectionPool.jl:152
  readuntil at /home/travis/.julia/v0.6/HTTP/src/IOExtras.jl:149
  readheaders at /home/travis/.julia/v0.6/HTTP/src/Messages.jl:468

vtjnash · 2018-04-19T15:53:14Z

SHUT_RDWR is a bit different, since it affects the file description, not the file descriptor.

I think model A would be more consistent, especially with things like AbstractPipe, where it's meaningful and useful to have a half-duplex pipe. We used to think libuv required a call to uv_close in the callback when it reported EOF – which is definitely not true now – which is how Julia ended up here.

Yep, that's the manifestation of that bug.

quinnj · 2018-04-19T16:10:47Z

FWIW, I saw the KeyError: key TCPSocket(RawFD(-1) closed, 0 bytes waiting) not found error quite a bit when running this PR under some dev load (w/ MbedTLS master) yesterday. Is there a current workaround? Or we just need to wait for the PR to Base to get merged?

vtjnash · 2018-04-19T16:27:09Z

This PR I think is a no-op for HTTPS workloads

samoconnor · 2018-04-20T08:26:14Z

WIP notes...

State	Description
NEW	Valid fd but not connected. Error if reading or writing attempted.
SYN	Connected for writes and reads
FIN_WR	Error if writing attempted
FIN_RD	Error if reading attempted
FIN	fd still valid. Error if reading or writing attempted. No error on status query functions.
CLOSED	fd no longer valid. Error on any operation other than closed query.

POSIX	State	Go	Rust	Julia
	!CLOSED !FIN !FIN_WR	`SetNoDelay`	`set_nodelay`
	!CLOSED !FIN !FIN_WR	`SetWriteDeadline`	`set_write_timeout`
`send`	SYN FIN_RD	`Write`	`Write`	`unsafe_write`
`shutdown(SHUT_WR)`	SYN -> FIN_WR FIN_RD -> FIN	`CloseWrite`	`shutdown`	`closewrite`?
	!CLOSED !FIN !FIN_RD	`SetReadDeadline`	`set_read_timeout`
`recv`	SYN FIN_WR	`Read`	`Read`	`unsafe_read`
`shutdown(SHUT_RD)`	SYN -> FIN_RD FIN_WR -> FIN	`CloseRead` CloseWrite + CloseRead != Close	`shutdown`	`closeread`?
	!CLOSED	`SetLinger`
`close`	* -> CLOSED	`Close`	"when the value is dropped"	`close`

quinnj · 2018-04-29T17:21:15Z

@samoconnor, @vtjnash, should we merge this now that the IdDict bug is fixed in julia master?

samoconnor · 2018-04-30T06:16:48Z

I would prefer to wait until there is a Julia 0.6.3 with the IdDict bugfix and/or put an if VERSION... around @schedule monitor_idle_connection so that this feature is never enabled on a Julia version where it is known to cause the KeyError problem.

quinnj · 2018-05-30T21:17:49Z

@vtjnash, what do you think about the status of this PR? can we put versioning so it only applies to 0.7 w/ the idDict bug fixed?

codecov-io · 2018-05-31T00:04:47Z

Codecov Report

Merging #236 into master will increase coverage by 0.06%.
The diff coverage is 87.5%.

@@            Coverage Diff             @@
##           master     #236      +/-   ##
==========================================
+ Coverage   71.52%   71.59%   +0.06%     
==========================================
  Files          35       35              
  Lines        1886     1894       +8     
==========================================
+ Hits         1349     1356       +7     
- Misses        537      538       +1

Impacted Files	Coverage Δ
src/ConnectionPool.jl	`75.92% <87.5%> (+0.44%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 597c0c7...071b40e. Read the comment docs.

samoconnor · 2018-05-31T00:05:52Z

@quinnj, I've just pushed a commit to this PR to require 0.6.3.
I assume that this should take care of the idDict problem?

quinnj · 2018-05-31T13:23:39Z

I'm seeing tests hang on 0.6 on CI: https://travis-ci.org/JuliaWeb/HTTP.jl/jobs/386029926, https://ci.appveyor.com/project/quinnj/http-jl/build/1.0.611/job/u7y9uj2m3ol6neau. @samoconnor, @vtjnash, could there be something in the changes here that would cause 0.6 to hang?

quinnj · 2018-05-31T13:47:12Z

@samoconnor, I can reproduce the hang locally on 0.6; it seems to hang on this block: https://github.com/JuliaWeb/HTTP.jl/blob/master/test/loopback.jl#L203. Could you take a look?

quinnj · 2018-06-04T17:00:28Z

Ok, this was actually a conflict of sorts w/ my 0.7 compat PR, which changed @schedule blocks to @async. Fix is here: #248

samoconnor mentioned this pull request Apr 18, 2018

Check for close_notify message in Base.isopen(ctx::SSLContext) JuliaLang/MbedTLS.jl#145

Merged

- Encapsulate read|writebusy/sequence/count logic in new isbusy func…

dddd1e8

…tion. - Move close() on eof() || !isbusy() to new monitor_idle_connection function. - Make monitor_idle_connection() a noop for ::Connection{SSLContext}

samoconnor mentioned this pull request Apr 19, 2018

Monitor TCP connections and close if peer sends [FIN]. #235

Closed

samoconnor mentioned this pull request Apr 19, 2018

IOError(EOFError() #233

Closed

vtjnash commented Apr 19, 2018

View reviewed changes

require Julia 0.6.3 #236 (comment)

071b40e

samoconnor mentioned this pull request May 31, 2018

EOFError() during request #245

Closed

quinnj merged commit 24aa08e into master May 31, 2018

quinnj deleted the jn/peer_close_notify_branch branch May 31, 2018 03:47

quinnj mentioned this pull request Jun 4, 2018

loopback.jl test hang on 0.6 #247

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConnectionPool: monitor idle connections #236

ConnectionPool: monitor idle connections #236

vtjnash commented Apr 17, 2018

samoconnor commented Apr 18, 2018

samoconnor commented Apr 18, 2018

samoconnor commented Apr 18, 2018

samoconnor commented Apr 19, 2018 •

edited

Loading

vtjnash commented Apr 19, 2018

vtjnash Apr 19, 2018

samoconnor Apr 19, 2018

samoconnor commented Apr 19, 2018

vtjnash commented Apr 19, 2018

samoconnor commented Apr 19, 2018

samoconnor commented Apr 19, 2018

vtjnash commented Apr 19, 2018

quinnj commented Apr 19, 2018

vtjnash commented Apr 19, 2018

samoconnor commented Apr 20, 2018

quinnj commented Apr 29, 2018

samoconnor commented Apr 30, 2018

quinnj commented May 30, 2018

codecov-io commented May 31, 2018

samoconnor commented May 31, 2018

quinnj commented May 31, 2018

quinnj commented May 31, 2018

quinnj commented Jun 4, 2018

ConnectionPool: monitor idle connections #236

ConnectionPool: monitor idle connections #236

Conversation

vtjnash commented Apr 17, 2018

samoconnor commented Apr 18, 2018

samoconnor commented Apr 18, 2018

samoconnor commented Apr 18, 2018

samoconnor commented Apr 19, 2018 • edited Loading

vtjnash commented Apr 19, 2018

vtjnash Apr 19, 2018

Choose a reason for hiding this comment

samoconnor Apr 19, 2018

Choose a reason for hiding this comment

samoconnor commented Apr 19, 2018

vtjnash commented Apr 19, 2018

samoconnor commented Apr 19, 2018

samoconnor commented Apr 19, 2018

vtjnash commented Apr 19, 2018

quinnj commented Apr 19, 2018

vtjnash commented Apr 19, 2018

samoconnor commented Apr 20, 2018

quinnj commented Apr 29, 2018

samoconnor commented Apr 30, 2018

quinnj commented May 30, 2018

codecov-io commented May 31, 2018

Codecov Report

samoconnor commented May 31, 2018

quinnj commented May 31, 2018

quinnj commented May 31, 2018

quinnj commented Jun 4, 2018

samoconnor commented Apr 19, 2018 •

edited

Loading