[client]: Fix for incorrectly dropped consensus messages (#11082) #11086

dforsten · 2019-09-23T20:39:07Z

Fixes a race condition causing the currently_queued counter to underflow and consensus messages getting dropped incorrectly as a consequence.

The initial solution suggested in #11082 was dropped in favor of converting the atomic counter to a signed integer for robustness. There is no definite guarantee that if the send() message returns an error the message will not get executed at a later point, which would cause another potential underflow.

The fix requires a slight limitation of the maximum size of the queue, a 63bit positive integer range should however be well large enough for its purposes.

parity-cla-bot · 2019-09-23T20:39:09Z

It looks like @dforsten signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

ethcore/src/client/client.rs

dvdplm · 2019-09-24T07:17:42Z

ethcore/src/client/client.rs

 }

 impl IoChannelQueue {
 	pub fn new(limit: usize) -> Self {
+		let limit = i64::try_from(limit).unwrap_or(i64::max_value());


Did you consider changing the input arg to isize rather than casting it? Seems like there are 3 call sites in total (in Client::new()) so perhaps changing the type is not too bad?

Changing the input arg type does spread into a couple of corners in the code, especially into client settings.
It is an option of course - on the other hand it should only be configurable as a positive number, making the setting itself a signed integer might be confusing to the user.
Using a signed integer in IoChannelQueue is a local implementation strategy to ensure the queue size counter checks work correctly in a multi-threaded environment.

on the other hand it should only be configurable as a positive number, making the setting itself a signed integer might be confusing to the user.

That could be handled when reading the user provided config, i.e. continue using an unsigned integer in configs and convert to signed there rather than for each message.

dvdplm · 2019-09-24T07:18:08Z

ethcore/src/client/client.rs

-	currently_queued: Arc<AtomicUsize>,
-	limit: usize,
+	currently_queued: Arc<AtomicI64>,
+	limit: i64,
 }

 impl IoChannelQueue {
 	pub fn new(limit: usize) -> Self {


Not your code, but maybe this can be a private method?

ordian · 2019-09-24T10:43:13Z

There is no definite guarantee that if the send() message returns an error the message will not get executed at a later point, which would cause another potential underflow.

Could you elaborate how's that possible? I thought if send(msg) returns an error, it means the msg won't be executed.

Using a signed integer is a clever trick, but seems more like a hack than a proper solution to me.

ordian · 2019-09-24T10:46:15Z

The initial solution suggested in #11082 was dropped in favor of converting the atomic counter to a signed integer for robustness.

Have you been able to reproduce the issue with the initial solution?

dforsten · 2019-09-24T11:26:02Z

There is no definite guarantee that if the send() message returns an error the message will not get executed at a later point, which would cause another potential underflow.

Could you elaborate how's that possible? I thought if send(msg) returns an error, it means the msg won't be executed.

Reading the implementation of send(msg) I am not at all certain that if it returns an error the message will not be executed. If there is an absolute guarantee that this is the case then we can implement the solution I originally suggested.

dvdplm · 2019-09-24T12:52:03Z

Reading the implementation of send(msg) I am not at all certain that if it returns an error the message will not be executed. If there is an absolute guarantee that this is the case then we can implement the solution I originally suggested.

Looking at the send() impl it seems like one of the two code paths – send_sync() – never returns an error but even in that case, e.g. a failed call to Weak::upgrade(), would not send the message.
Do you know which of the two code paths you end up in when you've encountered the problem?

In other words: I think your original suggestion was the right one.

dforsten · 2019-09-24T14:02:53Z

The code section I am most unsure if it will return an error even if the message will eventually be sent is this one in channel.rs of mio:

    pub fn try_send(&self, t: T) -> Result<(), TrySendError<T>> {
        self.tx.try_send(t)
            .map_err(From::from)
            .and_then(|_| {
                self.ctl.inc()?;
                Ok(())
            })
    }

try_send() is ok, but I am unclear if when self.ctl.inc()?; fails the message actually gets de-queued or sent at a later point.

dvdplm · 2019-09-24T14:48:55Z

I am unclear if when self.ctl.inc()?; fails the message actually gets de-queued or sent at a later point.

Help me check if I understand this right:

Afaict if try_send() succeeds then the message is sent; if self.ctl.inc()?; fails after that we'd see an error bubble back up to the caller and we'd have a situation where the closure returns an Err but the message was actually sent (so we'd see a decrement). In that scenario we'd bail before doing the increment.
Did I get that right? There is no de-queueing and/or re-trying happening though, or am I missing something?

When self.ctl.inc()?; fails it returns a std::io::Error so it's not something we can match on I think. If it returned a more specific error variant we could perhaps look for that specifically.

I think I understand what you mean by the original solution not being the right one: we'd increment before sending, send the message and still get an error back. Now we'd do two decrements (one from the message closure and the other in the Err => branch in the caller) and could underflow.

Sounds like your second solution with the signed int is the least worst. :/

dvdplm · 2019-09-24T15:00:08Z

Sounds like your second solution with the signed int is the least worst. :/

With this PR we could end up with negative currently_queued: maybe queue() should check for that before sending and reset to 0 if that is the case?

ordian

Long-term we want to rewrite the whole mio-thing, and since we're not confident try_send won't execute the message on error, it seems like the simplest workaround atm.

ordian · 2019-09-24T16:02:39Z

ethcore/src/client/client.rs

@@ -2739,12 +2740,13 @@ fn transaction_receipt(

 /// Queue some items to be processed by IO client.
 struct IoChannelQueue {
-	currently_queued: Arc<AtomicUsize>,
-	limit: usize,
+	currently_queued: Arc<AtomicI64>,


Please add docs why we're using signed integers here, so it won't get accidentally reverted in the future.

I added documentation to the currently_queued field to clarify why signed integers are being used here.

dvdplm · 2019-09-24T17:18:42Z

Long-term we want to rewrite the whole mio-thing,

Indeed. And on that note it's a bit weird to send in an "item count" from the outside to the queue. The queue itself should keep count of how many messages it has pending/sent/failed etc. The only callsite where count is not 1 is when we enqueue transactions.

Fixes a race condition causing the currently_queued counter to underflow and consensus messages getting dropped incorrectly as a consequence.

dforsten · 2019-09-25T10:19:23Z

Sounds like your second solution with the signed int is the least worst. :/

With this PR we could end up with negative currently_queued: maybe queue() should check for that before sending and reset to 0 if that is the case?

The only effect a negative number would have is that the queue size limit would effectively become larger. Before this fix it would straight away fail in that case, and not forward consensus messages at all.

dvdplm · 2019-09-25T10:29:09Z

The only effect a negative number would have is that the queue size limit would effectively become larger. Before this fix it would straight away fail in that case, and not forward consensus messages at all.

Yes, I am clear on that. It'd still be good to avoid negative numbers I think. If you disagree, do you mind explaining why?

ordian · 2019-09-25T10:34:24Z

Yes, I am clear on that. It'd still be good to avoid negative numbers I think. If you disagree, do you mind explaining why?

Negative numbers could appear only temporarily due to race condition, and if we reset them to zero, currently_queued will be incorrectly indefinitely growing and reach the limit at some point.

ordian · 2019-09-25T10:38:33Z

There is no definite guarantee that if the send() message returns an error the message will not get executed at a later point, which would cause another potential underflow.

But with this in mind, both the old and the signed implementation are still incorrect and could underflow.

dforsten · 2019-09-25T11:36:14Z

There is no definite guarantee that if the send() message returns an error the message will not get executed at a later point, which would cause another potential underflow.

But with this in mind, both the old and the signed implementation are still incorrect and could underflow.

That is correct, if send() returns an Error without dropping the message the counter will decrement by one each time. With unsigned numbers the effect is an immediate underflow, with 64bit signed numbers it will take a very long time until it underflows.

If that case happens a lot or constantly the validator node should be considered faulty though since consensus messages will likely take a long/indefinite time to be processed.

dforsten · 2019-09-25T17:01:19Z

Is there still anything I should change on my end to get this PR merged?

dvdplm

It's kind of a hack, but it is (much) better than what we have.

Thank you for taking the time, much appreciated.

…1086) Fixes a race condition causing the currently_queued counter to underflow and consensus messages getting dropped incorrectly as a consequence.

* ropsten #6631425 foundation #8798209 (#11201) * [stable] builtin, istanbul and mordor testnet backports (#11234) * ethcore-builtin (#10850) * [builtin]: support `multiple prices and activations` in chain spec (#11039) * [chain specs]: activate `Istanbul` on mainnet (#11228) * ethcore/res: add mordor testnet configuration (#11200) * Update list of bootnodes for xDai chain (#11236) * ethcore: remove `test-helper feat` from build (#11047) * Secret store: fix Instant::now() related race in net_keep_alive (#11155) (#11159) * [stable]: backport #10691 and #10683 (#11143) * Fix compiler warning (that will become an error) (#10683) * Refactor Clique stepping (#10691) * Add Constantinople eips to the dev (instant_seal) config (#10809) * Add cargo-remote dir to .gitignore (?) * Insert explicit warning into the panic hook (#11225) * Fix docker centos build (#11226) * Update MIX bootnodes. (#11203) * Use provided usd-per-eth value if an endpoint is specified (#11209) * Add new line after writing block to hex file. (#10984) * Type annotation for next_key() matching of json filter options (#11192) (but no `FilterOption` in 2.5 so…) * Upgrade jsonrpc to latest (#11206) * [CI] check evmbin build (#11096) * Correct EIP-712 encoding (#11092) * [client]: Fix for incorrectly dropped consensus messages (#11086) * Fix block detail updating (#11015) * Switching sccache from local to Redis (#10971) * Made ecrecover implementation trait public (#11188) * [dependencies]: jsonrpc `14.0.1` (#11183) * [receipt]: add `sender` & `receiver` to `RichReceipts` (#11179) * [ethcore/builtin]: do not panic in blake2pricer on short input (#11180) * util Host: fix a double Read Lock bug in fn Host::session_readable() (#11175) * ethcore client: fix a double Read Lock bug in fn Client::logs() (#11172) * Change how RPCs eth_call and eth_estimateGas handle "Pending" (#11127) * Cleanup stratum a bit (#11161) * Upgrade to jsonrpc v14 (#11151) * SecretStore: expose restore_key_public in HTTP API (#10241)

* [CI] check evmbin build (#11096) * Correct EIP-712 encoding (#11092) * [client]: Fix for incorrectly dropped consensus messages (#11082) (#11086) * Update hardcoded headers (foundation, classic, kovan, xdai, ewc, ...) (#11053) * Add cargo-remote dir to .gitignore (?) * Update light client headers: ropsten 6631425 foundation 8798209 (#11201) * Update list of bootnodes for xDai chain (#11236) * ethcore/res: add mordor testnet configuration (#11200) * [chain specs]: activate Istanbul on mainnet (#11228) * [builtin]: support multiple prices and activations in chain spec (#11039) * [receipt]: add sender & receiver to RichReceipts (#11179) * [ethcore/builtin]: do not panic in blake2pricer on short input (#11180) * Made ecrecover implementation trait public (#11188) * Fix docker centos build (#11226) * Update MIX bootnodes. (#11203) * Insert explicit warning into the panic hook (#11225) * Use provided usd-per-eth value if an endpoint is specified (#11209) * Cleanup stratum a bit (#11161) * Add Constantinople EIPs to the dev (instant_seal) config (#10809) (already backported) * util Host: fix a double Read Lock bug in fn Host::session_readable() (#11175) * ethcore client: fix a double Read Lock bug in fn Client::logs() (#11172) * Type annotation for next_key() matching of json filter options (#11192) * Upgrade jsonrpc to latest (#11206) * [dependencies]: jsonrpc 14.0.1 (#11183) * Upgrade to jsonrpc v14 (#11151) * Switching sccache from local to Redis (#10971) * Snapshot restoration overhaul (#11219) * Add new line after writing block to hex file. (#10984) * Pause pruning while snapshotting (#11178) * Change how RPCs eth_call and eth_estimateGas handle "Pending" (#11127) * Fix block detail updating (#11015) * Make InstantSeal Instant again #11186 * Filter out some bad ropsten warp snapshots (#11247)

dvdplm reviewed Sep 24, 2019

View reviewed changes

ordian added A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust. labels Sep 24, 2019

ordian added this to the 2.7 milestone Sep 24, 2019

ordian approved these changes Sep 24, 2019

View reviewed changes

[client]: Fix for incorrectly dropped consensus messages (#11082)

954d386

Fixes a race condition causing the currently_queued counter to underflow and consensus messages getting dropped incorrectly as a consequence.

dvdplm approved these changes Sep 25, 2019

View reviewed changes

dvdplm added B0-patch-stable 🕷 Pull request should also be back-ported to the stable branch. B1-patch-beta 🕷🕷 labels Sep 25, 2019

dvdplm merged commit 7c5fd04 into openethereum:master Sep 25, 2019

dforsten mentioned this pull request Sep 26, 2019

Race Condition and incorrectly dropped consensus messages #11082

Closed

dvdplm mentioned this pull request Sep 27, 2019

Cannot increase the gasPrice of a pending transaction: "There are too many transactions in the queue..." #10636

Open

This was referenced Nov 5, 2019

v2.6.5 beta #11240

Merged

v2.5.10 stable #11239

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[client]: Fix for incorrectly dropped consensus messages (#11082) #11086

[client]: Fix for incorrectly dropped consensus messages (#11082) #11086

dforsten commented Sep 23, 2019

parity-cla-bot commented Sep 23, 2019

dvdplm Sep 24, 2019

dforsten Sep 24, 2019

dvdplm Sep 24, 2019

dvdplm Sep 24, 2019

ordian commented Sep 24, 2019

ordian commented Sep 24, 2019

dforsten commented Sep 24, 2019

dvdplm commented Sep 24, 2019 •

edited

Loading

dforsten commented Sep 24, 2019

dvdplm commented Sep 24, 2019

dvdplm commented Sep 24, 2019

ordian left a comment

ordian Sep 24, 2019

dforsten Sep 25, 2019

dvdplm commented Sep 24, 2019

dforsten commented Sep 25, 2019

dvdplm commented Sep 25, 2019

ordian commented Sep 25, 2019 •

edited

Loading

ordian commented Sep 25, 2019

dforsten commented Sep 25, 2019 •

edited

Loading

dforsten commented Sep 25, 2019

dvdplm left a comment

[client]: Fix for incorrectly dropped consensus messages (#11082) #11086

[client]: Fix for incorrectly dropped consensus messages (#11082) #11086

Conversation

dforsten commented Sep 23, 2019

parity-cla-bot commented Sep 23, 2019

dvdplm Sep 24, 2019

Choose a reason for hiding this comment

dforsten Sep 24, 2019

Choose a reason for hiding this comment

dvdplm Sep 24, 2019

Choose a reason for hiding this comment

dvdplm Sep 24, 2019

Choose a reason for hiding this comment

ordian commented Sep 24, 2019

ordian commented Sep 24, 2019

dforsten commented Sep 24, 2019

dvdplm commented Sep 24, 2019 • edited Loading

dforsten commented Sep 24, 2019

dvdplm commented Sep 24, 2019

dvdplm commented Sep 24, 2019

ordian left a comment

Choose a reason for hiding this comment

ordian Sep 24, 2019

Choose a reason for hiding this comment

dforsten Sep 25, 2019

Choose a reason for hiding this comment

dvdplm commented Sep 24, 2019

dforsten commented Sep 25, 2019

dvdplm commented Sep 25, 2019

ordian commented Sep 25, 2019 • edited Loading

ordian commented Sep 25, 2019

dforsten commented Sep 25, 2019 • edited Loading

dforsten commented Sep 25, 2019

dvdplm left a comment

Choose a reason for hiding this comment

dvdplm commented Sep 24, 2019 •

edited

Loading

ordian commented Sep 25, 2019 •

edited

Loading

dforsten commented Sep 25, 2019 •

edited

Loading