Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

[client]: Fix for incorrectly dropped consensus messages (#11082) #11086

Merged
merged 1 commit into from
Sep 25, 2019
Merged

[client]: Fix for incorrectly dropped consensus messages (#11082) #11086

merged 1 commit into from
Sep 25, 2019

Conversation

dforsten
Copy link
Contributor

Fixes a race condition causing the currently_queued counter to underflow and consensus messages getting dropped incorrectly as a consequence.

The initial solution suggested in #11082 was dropped in favor of converting the atomic counter to a signed integer for robustness. There is no definite guarantee that if the send() message returns an error the message will not get executed at a later point, which would cause another potential underflow.

The fix requires a slight limitation of the maximum size of the queue, a 63bit positive integer range should however be well large enough for its purposes.

@parity-cla-bot
Copy link

It looks like @dforsten signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

ethcore/src/client/client.rs Show resolved Hide resolved
}

impl IoChannelQueue {
pub fn new(limit: usize) -> Self {
let limit = i64::try_from(limit).unwrap_or(i64::max_value());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider changing the input arg to isize rather than casting it? Seems like there are 3 call sites in total (in Client::new()) so perhaps changing the type is not too bad?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the input arg type does spread into a couple of corners in the code, especially into client settings.
It is an option of course - on the other hand it should only be configurable as a positive number, making the setting itself a signed integer might be confusing to the user.
Using a signed integer in IoChannelQueue is a local implementation strategy to ensure the queue size counter checks work correctly in a multi-threaded environment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the other hand it should only be configurable as a positive number, making the setting itself a signed integer might be confusing to the user.

That could be handled when reading the user provided config, i.e. continue using an unsigned integer in configs and convert to signed there rather than for each message.

currently_queued: Arc<AtomicUsize>,
limit: usize,
currently_queued: Arc<AtomicI64>,
limit: i64,
}

impl IoChannelQueue {
pub fn new(limit: usize) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your code, but maybe this can be a private method?

@ordian
Copy link
Collaborator

ordian commented Sep 24, 2019

There is no definite guarantee that if the send() message returns an error the message will not get executed at a later point, which would cause another potential underflow.

Could you elaborate how's that possible? I thought if send(msg) returns an error, it means the msg won't be executed.

Using a signed integer is a clever trick, but seems more like a hack than a proper solution to me.

@ordian ordian added A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust. labels Sep 24, 2019
@ordian ordian added this to the 2.7 milestone Sep 24, 2019
@ordian
Copy link
Collaborator

ordian commented Sep 24, 2019

The initial solution suggested in #11082 was dropped in favor of converting the atomic counter to a signed integer for robustness.

Have you been able to reproduce the issue with the initial solution?

@dforsten
Copy link
Contributor Author

There is no definite guarantee that if the send() message returns an error the message will not get executed at a later point, which would cause another potential underflow.

Could you elaborate how's that possible? I thought if send(msg) returns an error, it means the msg won't be executed.

Reading the implementation of send(msg) I am not at all certain that if it returns an error the message will not be executed. If there is an absolute guarantee that this is the case then we can implement the solution I originally suggested.

@dvdplm
Copy link
Collaborator

dvdplm commented Sep 24, 2019

Reading the implementation of send(msg) I am not at all certain that if it returns an error the message will not be executed. If there is an absolute guarantee that this is the case then we can implement the solution I originally suggested.

Looking at the send() impl it seems like one of the two code paths – send_sync() – never returns an error but even in that case, e.g. a failed call to Weak::upgrade(), would not send the message.
Do you know which of the two code paths you end up in when you've encountered the problem?

In other words: I think your original suggestion was the right one.

@dforsten
Copy link
Contributor Author

The code section I am most unsure if it will return an error even if the message will eventually be sent is this one in channel.rs of mio:

    pub fn try_send(&self, t: T) -> Result<(), TrySendError<T>> {
        self.tx.try_send(t)
            .map_err(From::from)
            .and_then(|_| {
                self.ctl.inc()?;
                Ok(())
            })
    }

try_send() is ok, but I am unclear if when self.ctl.inc()?; fails the message actually gets de-queued or sent at a later point.

@dvdplm
Copy link
Collaborator

dvdplm commented Sep 24, 2019

I am unclear if when self.ctl.inc()?; fails the message actually gets de-queued or sent at a later point.

Help me check if I understand this right:

Afaict if try_send() succeeds then the message is sent; if self.ctl.inc()?; fails after that we'd see an error bubble back up to the caller and we'd have a situation where the closure returns an Err but the message was actually sent (so we'd see a decrement). In that scenario we'd bail before doing the increment.
Did I get that right? There is no de-queueing and/or re-trying happening though, or am I missing something?

When self.ctl.inc()?; fails it returns a std::io::Error so it's not something we can match on I think. If it returned a more specific error variant we could perhaps look for that specifically.

I think I understand what you mean by the original solution not being the right one: we'd increment before sending, send the message and still get an error back. Now we'd do two decrements (one from the message closure and the other in the Err => branch in the caller) and could underflow.

Sounds like your second solution with the signed int is the least worst. :/

@dvdplm
Copy link
Collaborator

dvdplm commented Sep 24, 2019

Sounds like your second solution with the signed int is the least worst. :/

With this PR we could end up with negative currently_queued: maybe queue() should check for that before sending and reset to 0 if that is the case?

Copy link
Collaborator

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long-term we want to rewrite the whole mio-thing, and since we're not confident try_send won't execute the message on error, it seems like the simplest workaround atm.

@@ -2739,12 +2740,13 @@ fn transaction_receipt(

/// Queue some items to be processed by IO client.
struct IoChannelQueue {
currently_queued: Arc<AtomicUsize>,
limit: usize,
currently_queued: Arc<AtomicI64>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add docs why we're using signed integers here, so it won't get accidentally reverted in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added documentation to the currently_queued field to clarify why signed integers are being used here.

@dvdplm
Copy link
Collaborator

dvdplm commented Sep 24, 2019

Long-term we want to rewrite the whole mio-thing,

Indeed. And on that note it's a bit weird to send in an "item count" from the outside to the queue. The queue itself should keep count of how many messages it has pending/sent/failed etc. The only callsite where count is not 1 is when we enqueue transactions.

Fixes a race condition causing the currently_queued counter to underflow and consensus messages getting dropped incorrectly as a consequence.
@dforsten
Copy link
Contributor Author

Sounds like your second solution with the signed int is the least worst. :/

With this PR we could end up with negative currently_queued: maybe queue() should check for that before sending and reset to 0 if that is the case?

The only effect a negative number would have is that the queue size limit would effectively become larger. Before this fix it would straight away fail in that case, and not forward consensus messages at all.

@dvdplm
Copy link
Collaborator

dvdplm commented Sep 25, 2019

The only effect a negative number would have is that the queue size limit would effectively become larger. Before this fix it would straight away fail in that case, and not forward consensus messages at all.

Yes, I am clear on that. It'd still be good to avoid negative numbers I think. If you disagree, do you mind explaining why?

@ordian
Copy link
Collaborator

ordian commented Sep 25, 2019

Yes, I am clear on that. It'd still be good to avoid negative numbers I think. If you disagree, do you mind explaining why?

Negative numbers could appear only temporarily due to race condition, and if we reset them to zero, currently_queued will be incorrectly indefinitely growing and reach the limit at some point.

@ordian
Copy link
Collaborator

ordian commented Sep 25, 2019

There is no definite guarantee that if the send() message returns an error the message will not get executed at a later point, which would cause another potential underflow.

But with this in mind, both the old and the signed implementation are still incorrect and could underflow.

@dforsten
Copy link
Contributor Author

dforsten commented Sep 25, 2019

There is no definite guarantee that if the send() message returns an error the message will not get executed at a later point, which would cause another potential underflow.

But with this in mind, both the old and the signed implementation are still incorrect and could underflow.

That is correct, if send() returns an Error without dropping the message the counter will decrement by one each time. With unsigned numbers the effect is an immediate underflow, with 64bit signed numbers it will take a very long time until it underflows.

If that case happens a lot or constantly the validator node should be considered faulty though since consensus messages will likely take a long/indefinite time to be processed.

@dforsten
Copy link
Contributor Author

Is there still anything I should change on my end to get this PR merged?

Copy link
Collaborator

@dvdplm dvdplm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of a hack, but it is (much) better than what we have.

Thank you for taking the time, much appreciated.

@dvdplm dvdplm added B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. B1-patch-beta 🕷🕷 labels Sep 25, 2019
@dvdplm dvdplm merged commit 7c5fd04 into openethereum:master Sep 25, 2019
s3krit pushed a commit that referenced this pull request Oct 21, 2019
…1086)

Fixes a race condition causing the currently_queued counter to underflow and consensus messages getting dropped incorrectly as a consequence.
This was referenced Nov 5, 2019
dvdplm pushed a commit that referenced this pull request Nov 6, 2019
…1086)

Fixes a race condition causing the currently_queued counter to underflow and consensus messages getting dropped incorrectly as a consequence.
s3krit pushed a commit that referenced this pull request Nov 11, 2019
* ropsten #6631425 foundation #8798209 (#11201)
* [stable] builtin, istanbul and mordor testnet backports (#11234)
  * ethcore-builtin (#10850)
  * [builtin]: support `multiple prices and activations` in chain spec (#11039)
  * [chain specs]: activate `Istanbul` on mainnet (#11228)
  * ethcore/res: add mordor testnet configuration (#11200)
* Update list of bootnodes for xDai chain (#11236)
* ethcore: remove `test-helper feat` from build (#11047)
* Secret store: fix Instant::now() related race in net_keep_alive (#11155) (#11159)
* [stable]: backport #10691 and #10683 (#11143)
  * Fix compiler warning (that will become an error) (#10683)
  * Refactor Clique stepping (#10691)
* Add Constantinople eips to the dev (instant_seal) config (#10809)
* Add cargo-remote dir to .gitignore (?)
* Insert explicit warning into the panic hook (#11225)
* Fix docker centos build (#11226)
* Update MIX bootnodes. (#11203)
* Use provided usd-per-eth value if an endpoint is specified (#11209)
* Add new line after writing block to hex file. (#10984)
* Type annotation for next_key() matching of json filter options (#11192) (but no `FilterOption` in 2.5 so…)
* Upgrade jsonrpc to latest (#11206)
* [CI] check evmbin build (#11096)
* Correct EIP-712 encoding (#11092)
* [client]: Fix for incorrectly dropped consensus messages (#11086)
* Fix block detail updating (#11015)
* Switching sccache from local to Redis (#10971)
* Made ecrecover implementation trait public (#11188)
* [dependencies]: jsonrpc `14.0.1` (#11183)
* [receipt]: add `sender` & `receiver` to `RichReceipts` (#11179)
* [ethcore/builtin]: do not panic in blake2pricer on short input (#11180)
* util Host: fix a double Read Lock bug in fn Host::session_readable() (#11175)
* ethcore client: fix a double Read Lock bug in fn Client::logs() (#11172)
* Change how RPCs eth_call and eth_estimateGas handle "Pending" (#11127)
* Cleanup stratum a bit (#11161)
* Upgrade to jsonrpc v14 (#11151)
* SecretStore: expose restore_key_public in HTTP API (#10241)
s3krit pushed a commit that referenced this pull request Nov 11, 2019
* [CI] check evmbin build (#11096)
* Correct EIP-712 encoding (#11092)
* [client]: Fix for incorrectly dropped consensus messages (#11082) (#11086)
* Update hardcoded headers (foundation, classic, kovan, xdai, ewc, ...) (#11053)
* Add cargo-remote dir to .gitignore (?)
* Update light client headers: ropsten 6631425 foundation 8798209 (#11201)
* Update list of bootnodes for xDai chain (#11236)
* ethcore/res: add mordor testnet configuration (#11200)
* [chain specs]: activate Istanbul on mainnet (#11228)
* [builtin]: support multiple prices and activations in chain spec (#11039)
* [receipt]: add sender & receiver to RichReceipts (#11179)
* [ethcore/builtin]: do not panic in blake2pricer on short input (#11180)
* Made ecrecover implementation trait public (#11188)
* Fix docker centos build (#11226)
* Update MIX bootnodes. (#11203)
* Insert explicit warning into the panic hook (#11225)
* Use provided usd-per-eth value if an endpoint is specified (#11209)
* Cleanup stratum a bit (#11161)
* Add Constantinople EIPs to the dev (instant_seal) config (#10809) (already backported)
* util Host: fix a double Read Lock bug in fn Host::session_readable() (#11175)
* ethcore client: fix a double Read Lock bug in fn Client::logs() (#11172)
* Type annotation for next_key() matching of json filter options (#11192)
* Upgrade jsonrpc to latest (#11206)
* [dependencies]: jsonrpc 14.0.1 (#11183)
* Upgrade to jsonrpc v14 (#11151)
* Switching sccache from local to Redis (#10971)
* Snapshot restoration overhaul (#11219)
* Add new line after writing block to hex file. (#10984)
* Pause pruning while snapshotting (#11178)
* Change how RPCs eth_call and eth_estimateGas handle "Pending" (#11127)
* Fix block detail updating (#11015)
* Make InstantSeal Instant again #11186
* Filter out some bad ropsten warp snapshots (#11247)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-pleasereview 🤓 Pull request needs code review. B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants