Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionPool: monitor idle connections #236

Merged
merged 3 commits into from
May 31, 2018
Merged

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Apr 17, 2018

When a connection is returned to the (read) pool,
add a monitor to it for receiving unexpected data (or EOF),
and kill / close the Connection object if any activity occurs
before the next write (when it should have simply been waiting idle in the pool)

per JuliaLang/MbedTLS.jl#145 (comment)

closes #214
closes #199
closes #220
closes JuliaWeb/GitHub.jl#106
replaces #235

When a connection is returned to the (read) pool,
add a monitor to it for receiving unexpected data (or EOF),
and kill / close the Connection object if any activity occurs
before the next write (when it should have simply been waiting idle in the pool)

per JuliaLang/MbedTLS.jl#145 (comment)

closes #214
closes #199
closes #220
closes JuliaWeb/GitHub.jl#106
@samoconnor
Copy link
Contributor

closes JuliaWeb/GitHub.jl#106

From JuliaWeb/GitHub.jl#106 (comment):

To be reliable, GitHub.jl will still need to implement state-aware retry of non-idepmontent operations, but the changes above should improve failure rates in the short term

The change in this PR, or #235, and idle_timeout= and JuliaLang/MbedTLS.jl#145, should all help to reduce the probability of the HTTP layer making a non-idepmontent request that cannot be automatically retried. However, the HTTP layer will never handle every corner case. The server can comit a transaction, send "HTTP/1.1 200 OK" but the network can fail before we get the "OK", it might time out, or we might get a [RST], or a [FIN] depending on the type of failure. Applications that make POST requests will always need to implement application-state-aware retry (or set retry_non_idempotent=true if they are sure that is safe).

@samoconnor
Copy link
Contributor

I haven't yet reviewed that the busy/count/sequence logic is correct, but I get the point. We want to say "if we are not expecting data".

Compared to #235:

  • I like that this is only active when there is no active transaction.
  • I like that it avoids unexported Base internals.
  • I'm not certian that it plays nicely with TLS. I don't know enought about TLS to know how often OOB alerts will be recieved at random times. Even if close_notify is the only thing we're going to receive, it seems that it would be best to allow the TLS layer to deal with it cleanly. i.e. whatever HTTP.jl does should avoid getting in the way of a clean TLS shutdown.
  • Perhaps we should do the close after a short delay to give the TLS layer to time to clean up nicely. Or, perhaps we should only do this for raw TCP sockets given that MbedTLS.jl should handle this for us via close_notify.

For the purposes of arbitrary local pedantry, I would:

  • Put the monitor in a seperate function so that only one line is added to closeread.
  • Keep the postcondition at the end of the function.
  • Re-wrap at 80 cols.
  • Add a @debug log message.

@samoconnor
Copy link
Contributor

This might be better, i.e. if the connection is EOF (nothing more to read), then close it (signal won't write any more).

--- a/src/ConnectionPool.jl
+++ b/src/ConnectionPool.jl
@@ -270,7 +270,7 @@ function IOExtras.closeread(t::Transaction)
     # short-circuit if it is already dead or has already started to be reused
     if isopen(t.c) && !t.c.writebusy && (t.c.writecount <= t.sequence + 1) # && !isreadable(t)
         @schedule begin
-            if !eof(t.c.io) && # wait to see if receive any more data
+            if eof(t.c.io) || # wait to see if receive any more data
                     !t.c.readbusy && # and aren't actively expecting to receive data
                     !t.c.writebusy && (t.c.writecount <= t.sequence + 1) # nor have returned from startwrite and/or called closewrite again
                 # other end must be in an invalid state, or terminated the connection:

…tion.

 - Move close() on eof() || !isbusy() to new monitor_idle_connection function.
 - Make monitor_idle_connection() a noop for ::Connection{SSLContext}
@samoconnor
Copy link
Contributor

samoconnor commented Apr 19, 2018

Hi @vtjnash, @quinnj,
I've pushed a commit to this PR that I think captures the intention of Jameson's proposal, but cleans the code up a bit:

  • Encapsulate read|writebusy/sequence/count logic in new isbusy function.
  • Move close() on eof() || !isbusy() to new monitor_idle_connection function.
  • Make monitor_idle_connection() a noop for ::Connection{SSLContext}

Please review, and please excuse me if I've blatted an important detail in my reinterpretation.

@samoconnor samoconnor mentioned this pull request Apr 19, 2018
@vtjnash
Copy link
Member Author

vtjnash commented Apr 19, 2018

Lgtm. I was assuming that seeing eof implied close had already gotten called, but no harm in being explicit. I have no opinion currently on making it a no-op for SSL. We know that it should now be handled there now, but also seems suspect to make these code-paths diverge (mostly just in terms of test coverage though).

@@ -264,10 +270,32 @@ function IOExtras.closeread(t::Transaction)
notify(t.c.readdone) ;@debug 2 "✉️ Read done: $t"
notify(poolcondition)

if !isbusy(t.c)
@schedule monitor_idle_connection(t.c)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth doing let c = t.c; @schedule monitor_idle_connection(t.c); end, so we don't capture all of t. But afaik, there's no finalizers on t, or other way to see this difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, t is an immutable struct containing just c and an Int, so probably not much difference.
But it's a good reminder to consider what is being captured in closures. I have a feeling I should be more cognisant of that more often...

@samoconnor
Copy link
Contributor

I was assuming that seeing eof implied close had already gotten called, but no harm in being explicit.

AFAICT eof() = true just means we received a [FIN] or a [RST], calling close() means we send a [FIN].

@vtjnash
Copy link
Member Author

vtjnash commented Apr 19, 2018

Doesn't usually matter, unless you think it's going to be long-lived relative to the object that it wraps and holds lots of other references too.

calling close() means we send a

Makes sense. Seems like you're really great at finding bugs, though :)
If we're already in StateEOF, we might (unintentionally) ignore the close call: https://github.com/JuliaLang/julia/blob/50413a46a05b4f522e09fac2fb9b6a705cac31d7/base/stream.jl#L322
But we only enter that state for TTY objects. Everything else we close immediately upon receipt of an EOF notification: https://github.com/JuliaLang/julia/blob/50413a46a05b4f522e09fac2fb9b6a705cac31d7/base/stream.jl#L480-L488
Should we not do that, and let the user handle this?

@samoconnor
Copy link
Contributor

We need to make a fundamental decision:

  • A: Does our IO model distinguish between open-for-read and open-for-write?
    ...and provide an API for SHUT_RD SHUT_WR and SHUT_RDWR?, or...
  • B: Does our IO model have only two states?
    open == readable and writeable and
    closed == no reading or writing allowed.

If we go for model A) we can still have convenience methods to hide read vs write (e.g. close can do SHUT_RDWR) but the spec for the convenience methods will be clearly and generically defined in terms of the read/write aware API.

Going for model B) might not be unreasonable. We could argue that protocols that rely on SHUT_WR are archaic and that modern protocols all have higher level mechanisms for connection lifetime control built in these days. This would preclude working with protocols that work like: TX([Request], [FIN]), RX([Response], [FIN]).

In model B), calling close() when we get a [FIN] is probably correct. In model A), the fact that the peer has nothing else to send doesn't mean we shouldn't be able to send more to the peer.

The current situation feels like it is somewhere in between, we mostly try to hide the distinction between the send and receive sides of the connection, but then we have eof() that queries only the receive state. The current situation is also not consistent across the various IO subtypes (JuliaLang/julia#24526)

Either way, I think we need a section under Interfaces that lays down the law and a state diagram that shows what sequences of state transitions is allowed and what causes them.

@samoconnor
Copy link
Contributor

... all versions of this also triggers a very serious bug with IdDict

@vtjnash, is this that?

https://travis-ci.org/JuliaWeb/HTTP.jl/jobs/368429787#L4837-L4842

  KeyError: key TCPSocket(RawFD(-1) closed, 0 bytes waiting) not found
  getindex at ./associative.jl:363 [inlined]
  unpreserve_handle at ./libuv.jl:52
  wait_readnb at ./stream.jl:305
  eof at ./stream.jl:56
  eof at /home/travis/.julia/v0.6/HTTP/src/ConnectionPool.jl:152
  readuntil at /home/travis/.julia/v0.6/HTTP/src/IOExtras.jl:149
  readheaders at /home/travis/.julia/v0.6/HTTP/src/Messages.jl:468

@vtjnash
Copy link
Member Author

vtjnash commented Apr 19, 2018

SHUT_RDWR is a bit different, since it affects the file description, not the file descriptor.

I think model A would be more consistent, especially with things like AbstractPipe, where it's meaningful and useful to have a half-duplex pipe. We used to think libuv required a call to uv_close in the callback when it reported EOF – which is definitely not true now – which is how Julia ended up here.

Yep, that's the manifestation of that bug.

@quinnj
Copy link
Member

quinnj commented Apr 19, 2018

FWIW, I saw the KeyError: key TCPSocket(RawFD(-1) closed, 0 bytes waiting) not found error quite a bit when running this PR under some dev load (w/ MbedTLS master) yesterday. Is there a current workaround? Or we just need to wait for the PR to Base to get merged?

@vtjnash
Copy link
Member Author

vtjnash commented Apr 19, 2018

This PR I think is a no-op for HTTPS workloads

@samoconnor
Copy link
Contributor

WIP notes...

State Description
NEW Valid fd but not connected. Error if reading or writing attempted.
SYN Connected for writes and reads
FIN_WR Error if writing attempted
FIN_RD Error if reading attempted
FIN fd still valid. Error if reading or writing attempted. No error on status query functions.
CLOSED fd no longer valid. Error on any operation other than closed query.
POSIX State Go Rust Julia
!CLOSED
!FIN
!FIN_WR
SetNoDelay set_nodelay
!CLOSED
!FIN
!FIN_WR
SetWriteDeadline set_write_timeout
send SYN
FIN_RD
Write Write unsafe_write
shutdown(SHUT_WR) SYN -> FIN_WR
FIN_RD -> FIN
CloseWrite shutdown closewrite?
!CLOSED
!FIN
!FIN_RD
SetReadDeadline set_read_timeout
recv SYN
FIN_WR
Read Read unsafe_read
shutdown(SHUT_RD) SYN -> FIN_RD
FIN_WR -> FIN
CloseRead
CloseWrite +
CloseRead
!= Close
shutdown closeread?
!CLOSED SetLinger
close * -> CLOSED Close "when the value is dropped" close

@quinnj
Copy link
Member

quinnj commented Apr 29, 2018

@samoconnor, @vtjnash, should we merge this now that the IdDict bug is fixed in julia master?

@samoconnor
Copy link
Contributor

I would prefer to wait until there is a Julia 0.6.3 with the IdDict bugfix and/or put an if VERSION... around @schedule monitor_idle_connection so that this feature is never enabled on a Julia version where it is known to cause the KeyError problem.

@quinnj
Copy link
Member

quinnj commented May 30, 2018

@vtjnash, what do you think about the status of this PR? can we put versioning so it only applies to 0.7 w/ the idDict bug fixed?

@codecov-io
Copy link

Codecov Report

Merging #236 into master will increase coverage by 0.06%.
The diff coverage is 87.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #236      +/-   ##
==========================================
+ Coverage   71.52%   71.59%   +0.06%     
==========================================
  Files          35       35              
  Lines        1886     1894       +8     
==========================================
+ Hits         1349     1356       +7     
- Misses        537      538       +1
Impacted Files Coverage Δ
src/ConnectionPool.jl 75.92% <87.5%> (+0.44%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 597c0c7...071b40e. Read the comment docs.

@samoconnor
Copy link
Contributor

@quinnj, I've just pushed a commit to this PR to require 0.6.3.
I assume that this should take care of the idDict problem?

@quinnj quinnj merged commit 24aa08e into master May 31, 2018
@quinnj quinnj deleted the jn/peer_close_notify_branch branch May 31, 2018 03:47
@quinnj
Copy link
Member

quinnj commented May 31, 2018

I'm seeing tests hang on 0.6 on CI: https://travis-ci.org/JuliaWeb/HTTP.jl/jobs/386029926, https://ci.appveyor.com/project/quinnj/http-jl/build/1.0.611/job/u7y9uj2m3ol6neau. @samoconnor, @vtjnash, could there be something in the changes here that would cause 0.6 to hang?

@quinnj
Copy link
Member

quinnj commented May 31, 2018

@samoconnor, I can reproduce the hang locally on 0.6; it seems to hang on this block: https://github.com/JuliaWeb/HTTP.jl/blob/master/test/loopback.jl#L203. Could you take a look?

@quinnj
Copy link
Member

quinnj commented Jun 4, 2018

Ok, this was actually a conflict of sorts w/ my 0.7 compat PR, which changed @schedule blocks to @async. Fix is here: #248

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants