Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rabbit_peer_discovery: Retry RPC calls (backport #12801) #12804

Merged
merged 3 commits into from
Nov 25, 2024

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented Nov 25, 2024

Why

In CI, we observe some timeouts in the Erlang distribution connections between the temporary hidden node and the nodes it queries. This affects peer discovery obviously.

How

We introduce some query retries to reduce the risk of an incomplete query.

While here, we move the sorting of queried nodes from the query_node_props2/3 last clause (executed in the temporary hidden node) to the function setting the temporary hidden node and asking for these queries. This way the debug messages from that sorting are logged by RabbitMQ out of the box.

The branch contains two additional related commits:

  1. Remove the use of group leader proxy. It is useless after we used the standard_io connection option to start the temporary hidden node.
  2. Fix non-tail-recursive query_node_props2/3.
    This is an automatic backport of pull request rabbit_peer_discovery: Retry RPC calls #12801 done by Mergify.

[Why]
This was the first solution put in place to prevent that the temporary
hidden node connects to the node that started it to write any printed
messages. Because of this, the nodes that the temporary hidden node
queried found out about the parent node and they opened an Erlang
distribution connection to it. This polluted the known nodes list.

However later, the temporary hidden node was started with the
`standard_io` connection option. This prevented the temporary hidden
node from knowing about the node that started it, solving the problem in
a cleaner way.

[How]
This commit garbage-collects that piece of code that is now useless. It
makes the query code way simpler to understand.

(cherry picked from commit 62f22a7)
[Why]
This impacts what is reported by the catch because it caught exceptions
emitted by code supposedly called later. An example is the assert
in `query_node_props2/3` last clause.

(cherry picked from commit 4d4985f)
[Why]
In CI, we observe some timeouts in the Erlang distribution connections
between the temporary hidden node and the nodes it queries. This affects
peer discovery obviously.

[How]
We introduce some query retries to reduce the risk of an incomplete
query.

While here, we move the sorting of queried nodes from the
`query_node_props2/3` last clause (executed in the temporary hidden
node) to the function setting the temporary hidden node and asking for
these queries. This way the debug messages from that sorting are logged
by RabbitMQ out of the box.

(cherry picked from commit f6314d0)
@michaelklishin michaelklishin added this to the 4.0.5 milestone Nov 25, 2024
@michaelklishin michaelklishin merged commit 78aebbc into v4.0.x Nov 25, 2024
196 checks passed
@michaelklishin michaelklishin deleted the mergify/bp/v4.0.x/pr-12801 branch November 25, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants