Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rabbit_peer_discovery: Retry RPC calls #12801

Merged
merged 3 commits into from
Nov 25, 2024

Conversation

dumbbell
Copy link
Member

@dumbbell dumbbell commented Nov 25, 2024

Why

In CI, we observe some timeouts in the Erlang distribution connections between the temporary hidden node and the nodes it queries. This affects peer discovery obviously.

How

We introduce some query retries to reduce the risk of an incomplete query.

While here, we move the sorting of queried nodes from the query_node_props2/3 last clause (executed in the temporary hidden node) to the function setting the temporary hidden node and asking for these queries. This way the debug messages from that sorting are logged by RabbitMQ out of the box.

The branch contains two additional related commits:

  1. Remove the use of group leader proxy. It is useless after we used the standard_io connection option to start the temporary hidden node.
  2. Fix non-tail-recursive query_node_props2/3.

@dumbbell dumbbell self-assigned this Nov 25, 2024
[Why]
This was the first solution put in place to prevent that the temporary
hidden node connects to the node that started it to write any printed
messages. Because of this, the nodes that the temporary hidden node
queried found out about the parent node and they opened an Erlang
distribution connection to it. This polluted the known nodes list.

However later, the temporary hidden node was started with the
`standard_io` connection option. This prevented the temporary hidden
node from knowing about the node that started it, solving the problem in
a cleaner way.

[How]
This commit garbage-collects that piece of code that is now useless. It
makes the query code way simpler to understand.
[Why]
This impacts what is reported by the catch because it caught exceptions
emitted by code supposedly called later. An example is the assert
in `query_node_props2/3` last clause.
[Why]
In CI, we observe some timeouts in the Erlang distribution connections
between the temporary hidden node and the nodes it queries. This affects
peer discovery obviously.

[How]
We introduce some query retries to reduce the risk of an incomplete
query.

While here, we move the sorting of queried nodes from the
`query_node_props2/3` last clause (executed in the temporary hidden
node) to the function setting the temporary hidden node and asking for
these queries. This way the debug messages from that sorting are logged
by RabbitMQ out of the box.
@dumbbell dumbbell force-pushed the make-peer-disc-nodes-querying-more-resiliant branch from 8a4a684 to f6314d0 Compare November 25, 2024 15:16
@dumbbell dumbbell marked this pull request as ready for review November 25, 2024 15:38
@dumbbell dumbbell merged commit 60b074a into main Nov 25, 2024
274 checks passed
@dumbbell dumbbell deleted the make-peer-disc-nodes-querying-more-resiliant branch November 25, 2024 15:38
@michaelklishin michaelklishin added this to the 4.1.0 milestone Nov 25, 2024
michaelklishin added a commit that referenced this pull request Nov 25, 2024
rabbit_peer_discovery: Retry RPC calls (backport #12801)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants