Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: improve robustness of connect_nodes() #30118

Merged
merged 1 commit into from
May 23, 2024

Conversation

furszy
Copy link
Member

@furszy furszy commented May 15, 2024

Decoupled from #27837 because this can help other too, found it investigating a CI failure https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200.

The connect_nodes function in the test framework relies on a stable number of peer
connections to verify that the new connection between the nodes is successfully established.
This approach is fragile, as any of the peers involved in the process can drop, lose, or
create a connection at any step, causing subsequent wait_until checks to stall indefinitely
even when the peers in question were connected successfully.

This commit improves the situation by using the nodes' subversion and the connection
direction (inbound/outbound) to identify the exact peer connection and perform the
checks exclusively on it.

@DrahtBot
Copy link
Contributor

DrahtBot commented May 15, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK maflcko, AngusP, stratospher, achow101
Stale ACK rkrux

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #27837 (net: introduce block tracker to retry to download blocks after failure by furszy)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot DrahtBot added the Tests label May 15, 2024
Copy link

@rkrux rkrux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK 59ba873

Make successful, so are all the functional tests.

Overall I agree with this change because it makes the nodes connection verification dependent more on connect_nodes() arguments a, b instead of the earlier approach that relied more on the state of the whole response of getpeerinfo(), which seemed brittle as mentioned in the PR description.

I can see connect_nodes() being used in numerous functional tests, thereby increasing robustness for all of them!

@furszy Were you able to identify few connections that were dropped in logs of this CI run? https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200

test/functional/test_framework/test_framework.py Outdated Show resolved Hide resolved
Copy link
Member Author

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@furszy Were you able to identify few connections that were dropped in logs of this CI run? https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200

Just one. One of the P2PInterface connections I create on the test side gets disconnected after advancing the node time prior to connecting the test nodes again. Need to expand the complete CI logs to find it.
But the fragility is easy to reproduce. Just launch a thread that disconnects a node before calling connect_nodes().

test/functional/test_framework/test_framework.py Outdated Show resolved Hide resolved
@furszy furszy force-pushed the 2024_test_improve_connect_nodes branch from 59ba873 to e9cb116 Compare May 18, 2024 14:02
@furszy furszy mentioned this pull request May 18, 2024
18 tasks
Copy link
Contributor

@AngusP AngusP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach ACK, some nits/questions

Tested, functional tests all pass

@furszy furszy force-pushed the 2024_test_improve_connect_nodes branch from e9cb116 to f4c588c Compare May 18, 2024 22:07
Copy link
Contributor

@AngusP AngusP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK f4c588c

Tested, functional tests all pass

@DrahtBot DrahtBot requested a review from rkrux May 19, 2024 10:28
Copy link
Contributor

@stratospher stratospher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested ACK f4c588c.

very cool! i like how the logic depends only on the node we're connecting to.
reproduced the intermittent failure using this patch and verified that the new logic fixes it!

    self.nodes[0].add_p2p_connection(P2PInterface(), send_version=False, wait_for_verack=False)
    self.connect_nodes(0, 2)

also didn't observe that much of a time increase (only around 70 s) when running all the functional tests in my local config (without wallets).

@maflcko
Copy link
Member

maflcko commented May 21, 2024

This approach is fragile, as any of the peers involved in the process can drop, lose, or
create a connection at any step, causing subsequent wait_until checks to stall indefinitely
even when the peers in question were connected successfully.

I'd say the tests should avoid racy code and deterministically execute the test code line-by-line, but enforcing this in connect_nodes is the wrong approach. So Concept ACK.

The 'connect_nodes' function in the test framework relies
on a stable number of peer connections to verify the new
connection between the nodes is successfully established.
This approach is fragile, as any of the peers involved in
the process can drop, lose, or create a connection at any
step, causing subsequent 'wait_until' checks to stall
indefinitely even when the peers in question are connected
successfully.

This commit improves the situation by using the nodes' subversion
and the connection direction (inbound/outbound) to identify the
exact peer connection and perform the checks exclusively on it.
@furszy furszy force-pushed the 2024_test_improve_connect_nodes branch from f4c588c to 6629d1d Compare May 21, 2024 13:59
Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 6629d1d

@@ -106,7 +106,7 @@ def __init__(self, i, datadir_path, *, chain, rpchost, timewait, timeout_factor,
"-debugexclude=libevent",
"-debugexclude=leveldb",
"-debugexclude=rand",
"-uacomment=testnode%d" % i,
"-uacomment=testnode%d" % i, # required for subversion uniqueness across peers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"-uacomment=testnode%d" % i, # required for subversion uniqueness across peers
f"-uacomment=testnode{i}", # required for subversion uniqueness across peers

style-nit, if you retouch

return peer['bytesrecv_per_msg'].pop(msg_type, 0) >= min_bytes_recv

self.wait_until(lambda: check_bytesrecv(find_conn(from_connection, to_connection_subver, inbound=False), 'verack', 21))
self.wait_until(lambda: check_bytesrecv(find_conn(to_connection, from_connection_subver, inbound=True), 'verack', 21))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: As you remove the version check, I wonder if this one can be removed as well for the same reason? The final pong test should be necessary and sufficient already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: As you remove the version check, I wonder if this one can be removed as well for the same reason? The final pong test should be necessary and sufficient already.

I removed the version check because veracks are direct responses to version messages at the protocol level. I'm not sure we should rely on the pong alone because that might change over time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should rely on the pong alone because that might change over time?

If it changes (for example the ping is sent before the verack), the test will already start to fail intermittently and will need to be adjusted anyway.

See the next line comment:

        # The message bytes are counted before processing the message, so make
        # sure it was fully processed by waiting for a ping.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should rely on the pong alone because that might change over time?

If it changes (for example the ping is sent before the verack), the test will already start to fail intermittently and will need to be adjusted anyway.

I don't think thats applicable here. We cannot freely change the initial negotiation phase (the version-verack window) without a BIP for the p2p protocol change. At the protocol level, only features negotiation messages are allowed in this phase. Any other received message will be ignored and it is a reason for banning the other party.

See the next line comment:

        # The message bytes are counted before processing the message, so make
        # sure it was fully processed by waiting for a ping.

What about including the flag indicating that the connection is ready (fSuccessfullyConnected) in the getpeerinfo response? It seems generally useful and would clean up all this code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should rely on the pong alone because that might change over time?

If it changes (for example the ping is sent before the verack), the test will already start to fail intermittently and will need to be adjusted anyway.

I don't think thats applicable here. We cannot freely change the initial negotiation phase (the version-verack window) without a BIP for the p2p protocol change. At the protocol level, only features negotiation messages are allowed in this phase. Any other received message will be ignored and it is a reason for banning the other party.

Yes, that is what I am trying to say. The pong alone should be necessary and sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I am trying to say. The pong alone should be necessary and sufficient.

Ah ok. "goto the first message".. now we are sync. Nice bikeshedding from my end.

So yeah, we could wait only for the pong or introduce a new field on the getpeerinfo output (fSuccessfullyConnected with another name like "handshake_completed").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll have to tell maintainers if you want to address this nit, or leave it for a follow-up, otherwise they won't know whether it is fine to merge this from your side or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll have to tell maintainers if you want to address this nit, or leave it for a follow-up, otherwise they won't know whether it is fine to merge this from your side or not.

ok, yes. I was thinking on the second approach, but let's move forward. This will let me un-draft #27837. Happy to review any follow-up doing this.
Thanks Marko for the ping and this discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #30252, also mentioning fSuccessfullyConnected. Feel free to NACK/ACK/nit.

@DrahtBot DrahtBot requested review from AngusP and stratospher May 21, 2024 15:21
Copy link
Contributor

@AngusP AngusP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-ACK 6629d1d

Comment on lines +644 to +645
assert peer is not None, "Error: peer disconnected"
return peer['bytesrecv_per_msg'].pop(msg_type, 0) >= min_bytes_recv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: Mixture of ' and " strings here and in a few places, though it was already mixed before -- " seems to be the style in this file

@stratospher
Copy link
Contributor

reACK 6629d1d.

@achow101
Copy link
Member

ACK 6629d1d

@achow101 achow101 merged commit e163d86 into bitcoin:master May 23, 2024
16 checks passed
@furszy furszy deleted the 2024_test_improve_connect_nodes branch May 23, 2024 14:10
UdjinM6 pushed a commit to UdjinM6/dash that referenced this pull request Oct 3, 2024
6629d1d test: improve robustness of connect_nodes() (furszy)

Pull request description:

  Decoupled from bitcoin#27837 because this can help other too, found it investigating a CI failure https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200.

  The `connect_nodes` function in the test framework relies on a stable number of peer
  connections to verify that the new connection between the nodes is successfully established.
  This approach is fragile, as any of the peers involved in the process can drop, lose, or
  create a connection at any step, causing subsequent `wait_until` checks to stall indefinitely
  even when the peers in question were connected successfully.

  This commit improves the situation by using the nodes' subversion and the connection
  direction (inbound/outbound) to identify the exact peer connection and perform the
  checks exclusively on it.

ACKs for top commit:
  stratospher:
    reACK 6629d1d.
  achow101:
    ACK 6629d1d
  maflcko:
    utACK 6629d1d
  AngusP:
    re-ACK 6629d1d

Tree-SHA512: 5f345c0ce49ea81b643e97c5cffd133e182838752c27592fcdeac14ad10919fb4b7ff38e289e42a7c3c638a170bd0d0b7a9cd493898997a2082a7b7ceef4aeeb
kwvg pushed a commit to kwvg/dash that referenced this pull request Oct 3, 2024
PastaPastaPasta added a commit to dashpay/dash that referenced this pull request Oct 4, 2024
, bitcoin#23774, bitcoin#25443, bitcoin#26138, bitcoin#26854, bitcoin#27128, bitcoin#27761, bitcoin#27863, bitcoin#28287, bitcoin#30118, partial bitcoin#22778 (auxiliary backports: part 16)

e458adb merge bitcoin#30118: improve robustness of connect_nodes() (UdjinM6)
ac94de2 merge bitcoin#28287: add `sendmsgtopeer` rpc and a test for net-level deadlock situation (Kittywhiskers Van Gogh)
d1fce0b fix: ensure that deadlocks are actually resolved (Kittywhiskers Van Gogh)
19e7bf6 merge bitcoin#27863: do not break when addr is not from a distinct network group (Kittywhiskers Van Gogh)
1adb9a2 merge bitcoin#27761: Log addresses of stalling peers (Kittywhiskers Van Gogh)
2854a6a merge bitcoin#27128: fix intermittent issue in `p2p_disconnect_ban` (Kittywhiskers Van Gogh)
d4b0fae merge bitcoin#26854: Fix intermittent timeout in p2p_permissions.py (Kittywhiskers Van Gogh)
892e329 merge bitcoin#26138: Avoid race in disconnect_nodes helper (Kittywhiskers Van Gogh)
d6ce037 merge bitcoin#25443: Fail if connect_nodes fails (Kittywhiskers Van Gogh)
60b5392 partial bitcoin#22778: Reduce resource usage for inbound block-relay-only connections (Kittywhiskers Van Gogh)
85c4aef merge bitcoin#23774: Add missing assert_equal import to p2p_add_connections.py (Kittywhiskers Van Gogh)
0354417 merge bitcoin#22777: don't request tx relay on feeler connections (Kittywhiskers Van Gogh)
7229eb0 merge bitcoin#23042: Avoid logging AlreadyHaveTx when disconnecting misbehaving peer (Kittywhiskers Van Gogh)
05395ff merge bitcoin#22817: Avoid race after connect_nodes (Kittywhiskers Van Gogh)

Pull request description:

  ## Additional Information

  * Depends on #6286

  * Depends on #6287

  * Depends on #6289

  * When backporting [bitcoin#28287](bitcoin#28287), `p2p_net_deadlock.py` relies on the function, `random_bytes()`, that is introduced in [bitcoin#25625](bitcoin#25625). Backporting [bitcoin#25625](bitcoin#25625) would attract changes outside the scope of this PR.

    In the interest of brevity, the changes that introduce `random_bytes()` have been included in [bitcoin#28287](bitcoin#28287) instead.

  ## Breaking Changes

  None expected.

  ## Checklist:

  - [x] I have performed a self-review of my own code
  - [x] I have commented my code, particularly in hard-to-understand areas **(note: N/A)**
  - [x] I have added or updated relevant unit/integration/functional/e2e tests
  - [x] I have made corresponding changes to the documentation **(note: N/A)**
  - [x] I have assigned this pull request to a milestone _(for repository code-owners and collaborators only)_

ACKs for top commit:
  UdjinM6:
    utACK e458adb
  PastaPastaPasta:
    utACK e458adb

Tree-SHA512: 48494004dddecb31c53f5e19ab0114b92ed7b4381c7977800fd49b7403222badbfdcfe46241e854f5b086c6f54a35f6483f91c6f047b7ac9b1e88e35bb32ad02
panleone pushed a commit to panleone/PIVX that referenced this pull request Nov 11, 2024
6629d1d test: improve robustness of connect_nodes() (furszy)

Pull request description:

  Decoupled from bitcoin#27837 because this can help other too, found it investigating a CI failure https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200.

  The `connect_nodes` function in the test framework relies on a stable number of peer
  connections to verify that the new connection between the nodes is successfully established.
  This approach is fragile, as any of the peers involved in the process can drop, lose, or
  create a connection at any step, causing subsequent `wait_until` checks to stall indefinitely
  even when the peers in question were connected successfully.

  This commit improves the situation by using the nodes' subversion and the connection
  direction (inbound/outbound) to identify the exact peer connection and perform the
  checks exclusively on it.

ACKs for top commit:
  stratospher:
    reACK 6629d1d.
  achow101:
    ACK 6629d1d
  maflcko:
    utACK 6629d1d
  AngusP:
    re-ACK 6629d1d

Tree-SHA512: 5f345c0ce49ea81b643e97c5cffd133e182838752c27592fcdeac14ad10919fb4b7ff38e289e42a7c3c638a170bd0d0b7a9cd493898997a2082a7b7ceef4aeeb
Fuzzbawls added a commit to PIVX-Project/PIVX that referenced this pull request Nov 15, 2024
5f8c06e test: Add possibility to skip awaiting for the connection (Alessandro Rezzi)
5374229 Merge bitcoin#30252: test: Remove redundant verack check (merge-script)
e7c21c7 Merge bitcoin#30118: test: improve robustness of connect_nodes() (Ava Chow)
9beef8d Merge bitcoin#26854: test: Fix intermittent timeout in p2p_permissions.py (MarcoFalke)
b3a3d78 Merge bitcoin#25443: test: Fail if connect_nodes fails (laanwj)
72709b9 test: Avoid connecting a peer to himself (Alessandro Rezzi)
77697b8 test: Do not connect the nodes in parallel (Alessandro Rezzi)
17c065d test: Avoid race after connect_nodes (MarcoFalke)
d73ac82 test: refactor connect_nodes and disconnect_nodes to take two indices instead of index + node (Alessandro Rezzi)
e19d72f Merge bitcoin#18866: test: Fix verack race to avoid intermittent test failures (MarcoFalke)

Pull request description:

  This PR solves the failures of `wallet_listtransactions.py`, like the one of  https://github.com/PIVX-Project/PIVX/actions/runs/11705208154/job/32601252220?pr=2947.

  They happen due to mempool sync timeout:
  ```
  2024-11-06T14:58:03.8489793Z AssertionError: Mempool sync timed out after 60s:
  2024-11-06T14:58:03.8490785Z   {'7364836e7eae24a75378b920373e303b99b4ff18db758defc4c057d784a43905'}
  ```

  The issue is that the connection between nodes is established after the transaction is sent, as we can see from the logs:
  ```
  2024-11-06T14:58:03.9085473Z 2024-11-06T14:57:02Z (mocktime: 2019-10-31T18:21:20Z) Relaying wtx 7364836e7eae24a75378b920373e303b99b4ff18db758defc4c057d784a43905
  ...
  2024-11-06T14:58:03.9103287Z 2024-11-06T14:57:02Z (mocktime: 2019-10-31T18:21:20Z) New outbound peer connected: version: 70927, blocks=200, peer=0
  ```

  Hence the newly connected node will never receive the transaction and the mempool will never be synced.

  This bug is fixed by ensuring that `connect_nodes` actually wait for the connection to be established. As a consequence of those checks we cannot anymore connect nodes in parallel in `connect_nodes_clique` (which will make tests run slightly slower)

ACKs for top commit: 5f8c06e
  Fuzzbawls:
    utACK 5f8c06e
  Duddino:
    utACK 5f8c06e
  Liquid369:
    tACK 5f8c06e

Tree-SHA512: 88007d7302f3b7c3c5b9d446e7d8acc959cd03b0bb27409b1633cb86c57905a4814be148e2e5f6ccaa1da17eccdd44f68f81d00299696d1f47f52f9b12b32ec7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants