Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reorged docker compose #1579

Merged
merged 88 commits into from
Jul 9, 2020
Merged

Add reorged docker compose #1579

merged 88 commits into from
Jul 9, 2020

Conversation

ayrat555
Copy link
Contributor

@ayrat555 ayrat555 commented Jun 13, 2020

Overview

Adds a step to circle ci that runs cabbage tests using 2 node ethash geth that can be used to trigger reorgs

@ayrat555 ayrat555 force-pushed the add-reorged-docker-compose branch 4 times, most recently from 9330f31 to 4e3b5c8 Compare June 14, 2020 20:53
docker-compose-reorg.yml Outdated Show resolved Hide resolved
docker-compose-reorg.yml Outdated Show resolved Hide resolved
docker-compose-reorg.yml Outdated Show resolved Hide resolved
docker-compose-reorg.yml Outdated Show resolved Hide resolved
@ayrat555 ayrat555 force-pushed the add-reorged-docker-compose branch 4 times, most recently from fd0abb8 to e5a54b8 Compare June 18, 2020 12:52
@ayrat555
Copy link
Contributor Author

It seems not all tests can survive reorgs

Copy link
Contributor

@unnawut unnawut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VERY COOL!

priv/cabbage/apps/itest/lib/account.ex Outdated Show resolved Hide resolved
priv/cabbage/apps/itest/test/itest/deposits_test.exs Outdated Show resolved Hide resolved
@@ -62,6 +63,7 @@ defmodule InFlightExitsTests do
@gas_process_exit_price 1_000_000_000

setup do
Reorg.finish_reorg()
# as we're testing IFEs, queue needs to be empty
0 = get_next_exit_from_queue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still a problem with eth hash nodes returning revert reasons rather then 0 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a hack for it 1ef4b25

Copy link
Contributor Author

@ayrat555 ayrat555 Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@InoMurko It seems event with this hack, in-flight tests still fail

     ** (MatchError) no match of right hand side value: {:error, %{"code" => 3, "data" => "0x08c379a00000000000000000000000000000000000000000000000000000000000000020000000000000000000000000000000000000000000000000000000000000004050696767796261636b20697320706f737369626c65206f6e6c7920696e20746865206669727374207068617365206f6620746865206578697420706572696f64", "message" => "execution reverted: Piggyback is possible only in the first phase of the exit period"}}

I think reorgs leave these test in inconsistent state

@ayrat555
Copy link
Contributor Author

ayrat555 commented Jul 4, 2020

@InoMurko I finally made test failures consistent by waiting for nodes to reconnect to each by calling net_peerCount. 32e2394
Is this last step? PR's good to merge?

@ayrat555 ayrat555 requested a review from InoMurko July 6, 2020 06:58
.circleci/config.yml Outdated Show resolved Hide resolved
Comment on lines 611 to 618
- run:
name: (Perf) Format generated code and check for warnings
command: |
cd priv/perf
# run format ONLY on formatted code so that it cleans up quoted atoms because
# we cannot exclude folders to --warnings-as-errors
mix format apps/*_api/lib/*_api/model/*.ex
make format-code-check-warnings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this too

.circleci/config.yml Outdated Show resolved Hide resolved
.circleci/config.yml Show resolved Hide resolved
Comment on lines 74 to 94
case Jason.decode!(response.body)["data"] do
%{
"result" => "complete",
"transactions" => [
%{
"sign_hash" => sign_hash,
"typed_data" => typed_data,
"txbytes" => txbytes
}
]
} ->
{:ok, [sign_hash, typed_data, txbytes]}

%{"code" => "create:client_error", "messages" => %{"code" => "operation:service_unavailable"}} = result ->
if tries == 0 do
result
else
Process.sleep(1_000)
create_transaction(amount_in_wei, input_address, output_address, currency, tries - 1)
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should break this into functions.

result = Jason.decode!(response.body)["data"] 
case process_transaction_result() do
{:ok, [sign_hash, typed_data, txbytes]} -> {:ok, [sign_hash, typed_data, txbytes]}
:unavailable ->
if tries == 0 do
          result
        else
          Process.sleep(1_000)
          create_transaction(amount_in_wei, input_address, output_address, currency, tries - 1)
        end
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does operation:service_unavailable happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's happening during reorgs. when one node is paused. I think it happens because Nginx continues to redirect to paused node

Copy link
Contributor

@InoMurko InoMurko Jul 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, so perhaps an alarm is raised (ethereum connection error?)

get-alarms:
	echo "Child Chain alarms" ; \
	curl -s -X GET http://localhost:9656/alarm.get ; \
	echo "\nWatcher alarms" ; \
	curl -s -X GET http://localhost:${WATCHER_PORT}/alarm.get ; \
	echo "\nWatcherInfo alarms" ; \
	curl -s -X GET http://localhost:${WATCHER_INFO_PORT}/alarm.get

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any alarms. but in Nginx logs I see

nginx           | 2020/07/06 11:36:53 [error] 29#29: *37837 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 172.25.0.105, server: , request: "POST / HTTP/1.1", upstream: "http://172.25.0.103:8545/", host: "172.25.0.104:80"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alarms would be raised on enpoints watchers and childchain enpoints: /alarm.get. Perhaps logged as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is in Nginx. during reorgs when one node is paused, Nginx still tries to access it

priv/cabbage/apps/itest/lib/client.ex Outdated Show resolved Hide resolved
@@ -295,6 +291,10 @@ defmodule Itest.Poller do
Process.sleep(@sleep_retry_sec)
submit_typed(typed_data_signed, counter - 1)

%{"messages" => %{"code" => "operation:service_unavailable"}} ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it didn't dissapear?

end

defp hash(message), do: ExthCrypto.Hash.hash(message, ExthCrypto.Hash.kec())
def with_retries(func, total_time \\ 510, current_time \\ 0) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixing private and public functions.
also, what is the reason we need to retry. Is it so that all nodes unlock an account? is there a better way to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any other way. It is related to Nginx timeout. Some requests can no finish in 10 seconds. I set it 10 because during reorg Nginx continues to redirect to the paused node (til the first failure?), paused node accepts a request but do not return any response for N (timeout limit) seconds

@ayrat555 ayrat555 force-pushed the add-reorged-docker-compose branch from 9ed6bb0 to 49976e5 Compare July 6, 2020 12:00
@ayrat555 ayrat555 force-pushed the add-reorged-docker-compose branch from 49976e5 to 240005e Compare July 6, 2020 12:33
@ayrat555 ayrat555 requested a review from InoMurko July 6, 2020 17:03
.circleci/config.yml Show resolved Hide resolved
@@ -96,4 +100,19 @@ defmodule Itest.PlasmaFramework do
Map.put(acc, key, value)
end)
end

def with_retries(func, total_time \\ 120, current_time \\ 0) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixing private and public functions

amount
|> Currency.to_wei()
|> Client.deposit(alice_account, Itest.PlasmaFramework.vault(Currency.ether()))
end)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the anon. function, the balance = Client.get_exact_balance(alice_account, expecting_amount) would def. return the correct and LAST balance because the reorg is at this point finished, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, it takes some time for childchain to sync the latest data

Comment on lines 30 to 56
if Application.get_env(:cabbage, :reorg) do
pause_container!(@node1)
unpause_container!(@node2)

Process.sleep(10_000)

func.()

Process.sleep(10_000)

pause_container!(@node2)
unpause_container!(@node1)

Process.sleep(30_000)

response = func.()

Process.sleep(30_000)

unpause_container!(@node2)
unpause_container!(@node1)

:ok = Poller.wait_until_peer_count(1)

Process.sleep(20_000)

response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like these sleeps, because they're undeterministic. Is there a way for us to make this less prone to error?

Copy link
Contributor Author

@ayrat555 ayrat555 Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can request the block number from nodes and wait for 5 generated blocks. I'll do this now.

@InoMurko
Copy link
Contributor

InoMurko commented Jul 7, 2020

I've noticed another issue:

defthen ~r/^Alice should have "(?<amount>[^"]+)" ETH on the child chain$/,
%{amount: amount},
%{alice_account: alice_account} = state do
geth_block_every = 1
{:ok, response} =
WatcherSecurityCriticalAPI.Api.Configuration.configuration_get(WatcherSecurityCriticalAPI.Connection.new())
watcher_security_critical_config =
WatcherSecurityCriticalConfiguration.to_struct(Jason.decode!(response.body)["data"])
finality_margin_blocks = watcher_security_critical_config.deposit_finality_margin
to_miliseconds = 1000
finality_margin_blocks
|> Kernel.*(geth_block_every)
|> Kernel.*(to_miliseconds)
|> Kernel.round()
|> Process.sleep()
expecting_amount = Currency.to_wei(amount)
balance = Client.get_exact_balance(alice_account, expecting_amount)
balance = balance["amount"]
assert_equal(expecting_amount, balance, "For #{alice_account}")
{:ok, state}
end

geth_block_every = 1 is basically - we assume a block definetly arrives every second (which worked fine with clique).

Perhaps a better approach, scalable at least, would be to get the curent ethereum height and compare it with curent ethereum height + watcher_security_critical_config.deposit_finality_margin, when the current ethereum height is above curent ethereum height + watcher_security_critical_config.deposit_finality_margin you're allowed to check for balance.

Or a even better approach, would be to get the block number at which the deposit transaction was made and compare it with curent ethereum height + watcher_security_critical_config.deposit_finality_margin, when the deposit ethereum height is above curent ethereum height + watcher_security_critical_config.deposit_finality_margin you're allowed to check for balance.

@ayrat555 ayrat555 requested a review from InoMurko July 7, 2020 16:19
@ayrat555
Copy link
Contributor Author

ayrat555 commented Jul 8, 2020

@InoMurko how does it look?

@ayrat555 ayrat555 merged commit 8aebb53 into master Jul 9, 2020
@ayrat555 ayrat555 deleted the add-reorged-docker-compose branch July 9, 2020 07:41
@unnawut unnawut added the chore Technical work that does not affect service behaviour label Aug 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Technical work that does not affect service behaviour
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants