Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guarantee exclusive use of HttpClientConnection. #273

Merged
merged 1 commit into from
May 12, 2022

Conversation

etan-status
Copy link
Contributor

When calling the HTTP send / open functions, acquireConnection is
called to obtain a connection in state Ready. In the next code block,
the connection state is advanced to RequestHeadersSending. However,
returning from chronos async procs yields control, similar to await.
This means that a connection may be added to the pool in Ready state,
and then a different acquireConnection call may obtain a second ref.
Introducing a new Acquired state ensures consistency in this case.

No tests added due to this being scheduler dependent; ran manual tests
by adding a doAssert item.state != HttpClientConnectionState.Acquired
between if not(isNil(conn)): and return conn. Eventually, the assert
got hit after several hours of repeated tests, confirming the edge case
to be solved by applying this fix.

Not sure if it is by design that returning from an async proc yields.
Even if it's not, this should solve current HTTP issues in nimbus-eth2.

When calling the HTTP `send` / `open` functions, `acquireConnection` is
called to obtain a connection in state `Ready`. In the next code block,
the connection state is advanced to `RequestHeadersSending`. However,
returning from chronos `async` procs yields control, similar to `await`.
This means that a connection may be added to the pool in `Ready` state,
and then a different `acquireConnection` call may obtain a second ref.
Introducing a new `Acquired` state ensures consistency in this case.

No tests added due to this being scheduler dependent; ran manual tests
by adding a `doAssert item.state != HttpClientConnectionState.Acquired`
between `if not(isNil(conn)):` and `return conn`. Eventually, the assert
got hit after several hours of repeated tests, confirming the edge case
to be solved by applying this fix.

Not sure if it is by design that returning from an `async` proc yields.
Even if it's not, this should solve current HTTP issues in nimbus-eth2.
@etan-status
Copy link
Contributor Author

@cheatfate cheatfate merged commit dedd7cb into master May 12, 2022
@cheatfate cheatfate deleted the dev/etan/acquirefix branch May 12, 2022 23:54
etan-status added a commit to vacp2p/nim-libp2p that referenced this pull request Mar 2, 2024
There are two edge cases that currently trigger `KeyError`:

1. `await newStream.open()` succeeds, but Chronos injects an implicit
   suspend point when an `{.async.}` function that suspended returns.
   During this implicit suspend point, another task could get scheduled
   and close the stream, which removes it from `m.channels` again.
   `let channel = m.channels[header.streamId` then gives `KeyError`.
   More explanation of implicit suspend point on return after await:
   status-im/nim-chronos#273

2. During the `Flush the data` block, `await m.connection.readExactly`
   is called which may suspend. If it suspends, same as above, another
   task may get scheduled that closes the stream, also resulting in the
   `KeyError` below.

The `YamuxError` is subsequently properly handled below.
etan-status added a commit to vacp2p/nim-libp2p that referenced this pull request Mar 2, 2024
There are two edge cases that currently trigger `KeyError`:

1. `await newStream.open()` succeeds, but Chronos injects an implicit
   suspend point when an `{.async.}` function that suspended returns.
   During this implicit suspend point, another task could get scheduled
   and close the stream, which removes it from `m.channels` again.
   `let channel = m.channels[header.streamId` then gives `KeyError`.
   More explanation of implicit suspend point on return after await:
   status-im/nim-chronos#273

2. During the `Flush the data` block, `await m.connection.readExactly`
   is called which may suspend. If it suspends, same as above, another
   task may get scheduled that closes the stream, also resulting in the
   `KeyError` below.

The `YamuxError` is subsequently properly handled below.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants