Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queued child chain blocks that got stuck might cause operator to invalidate the child chain #702

Closed
pdobacz opened this issue May 20, 2019 · 8 comments
Assignees
Labels
bug Something isn't working martini

Comments

@pdobacz
Copy link
Contributor

pdobacz commented May 20, 2019

tl;dr Whenever a formed and queued child chain block (as done by OMG.ChildChain.BlockQueue.Server) gets stuck on any level of the flow (see comments for possible scenarios) it can become stale.

Such stuck block, while valid at the moment of forming, can become invalid in time, if for example an exit gets started, matures and is processed, resulting in a UTXO spent in the stuck block going missing.


Original synopsis (and original title) below:

Resyncing staging-v0.1 suddenly fails with :tx_execution

iex(1)> 2019-05-20 14:42:15.818 [info] module=OMG.Watcher.BlockGetter.Core function=log_downloading_blocks/2 ⋅Child chain seen at block #523000. Downloading blocks [522000]⋅
2019-05-20 14:42:16.158 [info] module=OMG.Watcher.BlockGetter.Core function=validate_download_response/5 ⋅Validating block #522000 5D8B3DA5AC292461... with 3 txs⋅
2019-05-20 14:42:16.319 [error] module=OMG.Watcher.BlockGetter function=handle_cast/2 ⋅Invalid block 522000, because of {:error, {:tx_execution, :utxo_not_found}}⋅

when using 112fb91

@pdobacz
Copy link
Contributor Author

pdobacz commented May 20, 2019

Investigation lead to discovering this is not an issue with the Watcher nor Child chain implementation per se.

To begin with - Watcher validly signals the invalid block. This block spends from {521110, 0, 0} which has been exited and exit finalized. This is how it happened:

  1. Someone decided to spend {521110, 0, 0} and submitted a transaction, this was somewhere before 2019-05-13 09:39:30.991 UTC, but after the last valid block submission on staging-v0.1, which in turn was around 2019-05-07 15:53:42.278 ⋅Submitted %OMG.API.BlockQueue.Core.BlockSubmission{...., num: 521000}⋅
  2. At that point, the child chain suffered from the authority account being locked (investigated here: Authority account unlocked on geth. Childchain reports locked #662). The transaction ended up in block 522000, which never got into ethereum until too late.
  3. Next, the said {521110, 0, 0} was exited successfully, without 522000 being seen. The child chain server correctly recognized the exit (2019-05-13 09:39:30.991 [info] module=OMG.API.State.Core function=exit_utxos/2 ⋅Recognized exits [{:utxo_position, 521110, 0, 0}]⋅), but it was too late, as the invalidating block 522000 was already queued up
  4. After the authority account got unlocked, the 522000 made its way to ethereum, rendering the chain invalid
elixir-omg | 2019-05-14 12:07:04.210 [warn] module=OMG.Watcher.BlockGetter function=handle_info/2 ⋅Chain invalid when trying to download blocks, because of {:error, [%OMG.Watcher.Event.InvalidBlock{blknum: 522000, error_type: :tx_execution, hash: <<93, 139, 61, 165, 172, 41, 36, 97, 126, 210, 95, 49, 75, 109, 111, 70, 147, 80, 27, 98, 134, 139, 12, 22, 67, 48, 62, 16, 50, 3, 67, 112>>, name: :invalid_block}]}, won't try again⋅
-- | --
  | elixir-omg | 2019-05-14 12:07:03.705 [error] module=OMG.Watcher.BlockGetter function=handle_cast/2 ⋅Invalid block 522000, because of {:error, {:tx_execution, :utxo_not_found}}⋅
  | elixir-omg | 2019-05-14 12:07:03.217 [info] module=OMG.Watcher.BlockGetter.Core function=validate_download_response/5 ⋅Validating block #522000 5D8B3DA5AC292461... with 3 txs⋅
  | elixir-omg | 2019-05-14 12:07:02.690 [info] module=OMG.Watcher.BlockGetter.Core function=log_downloading_blocks/2 ⋅Child chain seen at block #523000. Downloading blocks [522000]⋅

@pdobacz pdobacz changed the title Resyncing staging-v0.1 suddenly fails with :tx_execution :account_locked or other reasons for queued blocks to get stuck might cause operator to invalidate the child chain May 20, 2019
@pdobacz
Copy link
Contributor Author

pdobacz commented May 21, 2019

Dropping take-aways from slack convo with @InoMurko @JBunCE :

how do we fix this?

Propositions:

  • if prolonged_outage, do: cleanse_block_queue_&_drop_mempool() (but there's risk that it backfires by false positives)
  • we need to be alarmed when accounts are locked so that we can immediately unlock it (logging of BlockQueue problems improved recently, so it should be easier to diagnose now). If we're approaching a managed child chain service then a simple set of alarms and exposed APIs for administration purposes are in place. as a sort of backoffice.
  • we could also put a fuse in the contract
function submitBlock(hash, only_till_eth_height) {
   require (block.number < only_till_eth_height);
}
  • a variation of the above is also that we do cleanse_block_queue_&_drop_mempool somewhat manually - when we know things might go bad.
    • Yeah but you would need to get admin access 😁
    • We need a better mechanism. This won’t scale

But seriously, this is ugly, because whatever state we hold, if it's not out there on ethereum it's getting stale every minute 😞

How old can it get before it breaks the chain?

we have this thing called sla_margin, which is how late a block might be, compared to an exit that invalidates a tx within. This should be minimum finalization period - some time, e.g. 7 days - 1 day, but right now it's calculated a bit differently (there's a ticket on Pivotal tracker to fix that).

So the same rule applies here I guess.

Gist of the rule is that such block must be shown so that it allows exit challenges to go through comfortably before they mature.

You know if we run with Parity this problem goes away immediately

only it's manifestation via account_locked. Any failure for a block to submit can cause this, just account_locked is common for us, at least with geth

Block submission can not go through for many reasons. We fail loudly (and omg_child_chain used to come to a halt on such events) in all the obvious and easy to detect ones - like account_locked, not enough ether or geth connectivity issues.

There are probably ones that aren't loud enough, but are also much less common, like a bad network congestion condition or child chain being censored, those are also much harder to detect 😞

@InoMurko
Copy link
Contributor

Infura!

@pdobacz
Copy link
Contributor Author

pdobacz commented Jan 17, 2020

hm, I missed this getting closed, but I think Infura doesn't solve this, (see ":account_locked or other reasons for queued blocks to get stuck").

It could actually make it worse, since any Infura-related delay in pushing the submission through can put us in the same bad position.

I'll take the liberty to reopen, rephrase the title to demonstrate the issue better and put some clarifications in the description

@pdobacz pdobacz reopened this Jan 17, 2020
@pdobacz pdobacz changed the title :account_locked or other reasons for queued blocks to get stuck might cause operator to invalidate the child chain Queued child chain blocks that got stuck might cause operator to invalidate the child chain Jan 17, 2020
@pdobacz pdobacz removed their assignment Jan 17, 2020
@InoMurko
Copy link
Contributor

OK. to clarify, :account_locked is not what could happen on Infura. What could happen is, for example, saturation of the network would prevent us to submit a block "in time".

The other part, which makes things a bit more complicated, is the introduction of the secure submission layer.

@InoMurko InoMurko self-assigned this Mar 16, 2020
@InoMurko
Copy link
Contributor

As a start, I'll increase observability for the queued up blocks.

@InoMurko
Copy link
Contributor

Once we have that, we will be able to observe the queue size by:

{:ok, mined_blknum} = Eth.RootChain.get_mined_child_block()
{:ok, stored_child_top_num} = OMG.DB.get_single_value(:child_top_block_number)
queue_size = div(stored_child_top_num - stored_child_top_num, Application.fetch_env(:omg_eth, :child_block_interval))

On Datadog, we can plot this and create alarms/monitors that get triggered if the queue_size starts rising and it's not drained fast enough.

@unnawut
Copy link
Contributor

unnawut commented Jul 10, 2020

Closing in favour of #1629 which I'm currently working on

@unnawut unnawut closed this as completed Jul 10, 2020
@unnawut unnawut added the bug Something isn't working label Aug 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working martini
Projects
None yet
Development

No branches or pull requests

3 participants