Queued child chain blocks that got stuck might cause operator to invalidate the child chain #702

pdobacz · 2019-05-20T12:44:34Z

tl;dr Whenever a formed and queued child chain block (as done by OMG.ChildChain.BlockQueue.Server) gets stuck on any level of the flow (see comments for possible scenarios) it can become stale.

Such stuck block, while valid at the moment of forming, can become invalid in time, if for example an exit gets started, matures and is processed, resulting in a UTXO spent in the stuck block going missing.

Original synopsis (and original title) below:

Resyncing staging-v0.1 suddenly fails with :tx_execution

iex(1)> 2019-05-20 14:42:15.818 [info] module=OMG.Watcher.BlockGetter.Core function=log_downloading_blocks/2 ⋅Child chain seen at block #523000. Downloading blocks [522000]⋅
2019-05-20 14:42:16.158 [info] module=OMG.Watcher.BlockGetter.Core function=validate_download_response/5 ⋅Validating block #522000 5D8B3DA5AC292461... with 3 txs⋅
2019-05-20 14:42:16.319 [error] module=OMG.Watcher.BlockGetter function=handle_cast/2 ⋅Invalid block 522000, because of {:error, {:tx_execution, :utxo_not_found}}⋅

when using 112fb91

The text was updated successfully, but these errors were encountered:

pdobacz · 2019-05-20T14:08:56Z

Investigation lead to discovering this is not an issue with the Watcher nor Child chain implementation per se.

To begin with - Watcher validly signals the invalid block. This block spends from {521110, 0, 0} which has been exited and exit finalized. This is how it happened:

Someone decided to spend {521110, 0, 0} and submitted a transaction, this was somewhere before 2019-05-13 09:39:30.991 UTC, but after the last valid block submission on staging-v0.1, which in turn was around 2019-05-07 15:53:42.278 ⋅Submitted %OMG.API.BlockQueue.Core.BlockSubmission{...., num: 521000}⋅
At that point, the child chain suffered from the authority account being locked (investigated here: Authority account unlocked on geth. Childchain reports locked #662). The transaction ended up in block 522000, which never got into ethereum until too late.
Next, the said {521110, 0, 0} was exited successfully, without 522000 being seen. The child chain server correctly recognized the exit (2019-05-13 09:39:30.991 [info] module=OMG.API.State.Core function=exit_utxos/2 ⋅Recognized exits [{:utxo_position, 521110, 0, 0}]⋅), but it was too late, as the invalidating block 522000 was already queued up
After the authority account got unlocked, the 522000 made its way to ethereum, rendering the chain invalid

elixir-omg | 2019-05-14 12:07:04.210 [warn] module=OMG.Watcher.BlockGetter function=handle_info/2 ⋅Chain invalid when trying to download blocks, because of {:error, [%OMG.Watcher.Event.InvalidBlock{blknum: 522000, error_type: :tx_execution, hash: <<93, 139, 61, 165, 172, 41, 36, 97, 126, 210, 95, 49, 75, 109, 111, 70, 147, 80, 27, 98, 134, 139, 12, 22, 67, 48, 62, 16, 50, 3, 67, 112>>, name: :invalid_block}]}, won't try again⋅
-- | --
  | elixir-omg | 2019-05-14 12:07:03.705 [error] module=OMG.Watcher.BlockGetter function=handle_cast/2 ⋅Invalid block 522000, because of {:error, {:tx_execution, :utxo_not_found}}⋅
  | elixir-omg | 2019-05-14 12:07:03.217 [info] module=OMG.Watcher.BlockGetter.Core function=validate_download_response/5 ⋅Validating block #522000 5D8B3DA5AC292461... with 3 txs⋅
  | elixir-omg | 2019-05-14 12:07:02.690 [info] module=OMG.Watcher.BlockGetter.Core function=log_downloading_blocks/2 ⋅Child chain seen at block #523000. Downloading blocks [522000]⋅

pdobacz · 2019-05-21T10:48:53Z

Dropping take-aways from slack convo with @InoMurko @JBunCE :

how do we fix this?

Propositions:

if prolonged_outage, do: cleanse_block_queue_&_drop_mempool() (but there's risk that it backfires by false positives)
we need to be alarmed when accounts are locked so that we can immediately unlock it (logging of BlockQueue problems improved recently, so it should be easier to diagnose now). If we're approaching a managed child chain service then a simple set of alarms and exposed APIs for administration purposes are in place. as a sort of backoffice.
we could also put a fuse in the contract

function submitBlock(hash, only_till_eth_height) {
   require (block.number < only_till_eth_height);
}

a variation of the above is also that we do cleanse_block_queue_&_drop_mempool somewhat manually - when we know things might go bad.
- Yeah but you would need to get admin access 😁
- We need a better mechanism. This won’t scale

But seriously, this is ugly, because whatever state we hold, if it's not out there on ethereum it's getting stale every minute 😞

How old can it get before it breaks the chain?

we have this thing called sla_margin, which is how late a block might be, compared to an exit that invalidates a tx within. This should be minimum finalization period - some time, e.g. 7 days - 1 day, but right now it's calculated a bit differently (there's a ticket on Pivotal tracker to fix that).

So the same rule applies here I guess.

Gist of the rule is that such block must be shown so that it allows exit challenges to go through comfortably before they mature.

You know if we run with Parity this problem goes away immediately

only it's manifestation via account_locked. Any failure for a block to submit can cause this, just account_locked is common for us, at least with geth

Block submission can not go through for many reasons. We fail loudly (and omg_child_chain used to come to a halt on such events) in all the obvious and easy to detect ones - like account_locked, not enough ether or geth connectivity issues.

There are probably ones that aren't loud enough, but are also much less common, like a bad network congestion condition or child chain being censored, those are also much harder to detect 😞

InoMurko · 2019-12-18T13:48:19Z

Infura!

pdobacz · 2020-01-17T12:20:04Z

hm, I missed this getting closed, but I think Infura doesn't solve this, (see ":account_locked or other reasons for queued blocks to get stuck").

It could actually make it worse, since any Infura-related delay in pushing the submission through can put us in the same bad position.

I'll take the liberty to reopen, rephrase the title to demonstrate the issue better and put some clarifications in the description

InoMurko · 2020-01-17T20:32:21Z

OK. to clarify, :account_locked is not what could happen on Infura. What could happen is, for example, saturation of the network would prevent us to submit a block "in time".

The other part, which makes things a bit more complicated, is the introduction of the secure submission layer.

InoMurko · 2020-03-16T10:05:21Z

As a start, I'll increase observability for the queued up blocks.

InoMurko · 2020-03-17T13:36:13Z

Once we have that, we will be able to observe the queue size by:

{:ok, mined_blknum} = Eth.RootChain.get_mined_child_block()
{:ok, stored_child_top_num} = OMG.DB.get_single_value(:child_top_block_number)
queue_size = div(stored_child_top_num - stored_child_top_num, Application.fetch_env(:omg_eth, :child_block_interval))

On Datadog, we can plot this and create alarms/monitors that get triggered if the queue_size starts rising and it's not drained fast enough.

unnawut · 2020-07-10T10:35:21Z

Closing in favour of #1629 which I'm currently working on

pdobacz mentioned this issue May 20, 2019

BlockGetter will proceed syncing if there's no blocks after an invalid block #703

Closed

pdobacz self-assigned this May 20, 2019

pdobacz changed the title ~~Resyncing staging-v0.1 suddenly fails with :tx_execution~~ :account_locked or other reasons for queued blocks to get stuck might cause operator to invalidate the child chain May 20, 2019

InoMurko closed this as completed Dec 18, 2019

pdobacz reopened this Jan 17, 2020

pdobacz changed the title ~~:account_locked or other reasons for queued blocks to get stuck might cause operator to invalidate the child chain~~ Queued child chain blocks that got stuck might cause operator to invalidate the child chain Jan 17, 2020

pdobacz removed their assignment Jan 17, 2020

InoMurko added the martini label Jan 17, 2020

InoMurko self-assigned this Mar 16, 2020

unnawut closed this as completed Jul 10, 2020

unnawut added the bug Something isn't working label Aug 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queued child chain blocks that got stuck might cause operator to invalidate the child chain #702

Queued child chain blocks that got stuck might cause operator to invalidate the child chain #702

pdobacz commented May 20, 2019 •

edited

Loading

pdobacz commented May 20, 2019

pdobacz commented May 21, 2019

InoMurko commented Dec 18, 2019

pdobacz commented Jan 17, 2020

InoMurko commented Jan 17, 2020

InoMurko commented Mar 16, 2020

InoMurko commented Mar 17, 2020

unnawut commented Jul 10, 2020

Queued child chain blocks that got stuck might cause operator to invalidate the child chain #702

Queued child chain blocks that got stuck might cause operator to invalidate the child chain #702

Comments

pdobacz commented May 20, 2019 • edited Loading

Resyncing staging-v0.1 suddenly fails with :tx_execution

pdobacz commented May 20, 2019

pdobacz commented May 21, 2019

InoMurko commented Dec 18, 2019

pdobacz commented Jan 17, 2020

InoMurko commented Jan 17, 2020

InoMurko commented Mar 16, 2020

InoMurko commented Mar 17, 2020

unnawut commented Jul 10, 2020

pdobacz commented May 20, 2019 •

edited

Loading