Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgpd: avoid clearing routes for peers that were never established #16271

Merged
merged 1 commit into from
Jun 26, 2024

Conversation

lsang6WIND
Copy link

Under heavy system load with many peers in passive mode and a large number of routes, bgpd can enter an infinite loop. This occurs while processing timeout BGP_OPEN messages, which prevents it from accepting new connections. The following log entries illustrate the issue:

bgpd[6151]: [VX6SM-8YE5W][EC 33554460] 3.3.2.224: nexthop_set failed, resetting connection - intf 0x0
bgpd[6151]: [P790V-THJKS][EC 100663299] bgp_open_receive: bgp_getsockname() failed for peer: 3.3.2.224
bgpd[6151]: [HTQD2-0R1WR][EC 33554451] bgp_process_packet: BGP OPEN receipt failed for peer: 3.3.2.224
... repeating

The issue occurs when bgpd handles a massive number of routes in the RIB while receiving numerous BGP_OPEN packets. If bgpd is overloaded, it fails to process these packets promptly, leading the remote peer to close the connection and resend BGP_OPEN packets.

When bgpd eventually starts processing these timeout BGP_OPEN packets, it finds the TCP connection closed by the remote peer, resulting in "bgp_stop()" being called. For each timeout peer, bgpd must iterate through the routing table, which is time-consuming and causes new incoming BGP_OPEN packets to timeout, perpetuating the infinite loop.

To address this issue, the code is modified to check if the peer has been established at least once before calling "bgp_clear_route_all()". This ensures that routes are only cleared for peers that had a successful session, preventing unnecessary iterations over the routing table for peers that never established a connection.

With this change, BGP_OPEN timeout messages may still occur, but in the worst case, bgpd will stabilize. Before this patch, bgpd could enter a loop where it was unable to accpet any new connections.

Copy link
Member

@ton31337 ton31337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Copy link
Member

@riw777 riw777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@fdumontet6WIND
Copy link
Contributor

improve performance

@ton31337
Copy link
Member

@Mergifyio backport dev/10.1

Copy link

mergify bot commented Jun 25, 2024

backport dev/10.1

✅ Backports have been created

@riw777
Copy link
Member

riw777 commented Jun 26, 2024

lint errors need to be fixed ... still trying to get ci to pass (it's failing in ospf)

Under heavy system load with many peers in passive mode and a large
number of routes, bgpd can enter an infinite loop. This occurs while
processing timeout BGP_OPEN messages, which prevents it from accepting
new connections. The following log entries illustrate the issue:
>bgpd[6151]: [VX6SM-8YE5W][EC 33554460] 3.3.2.224: nexthop_set failed, resetting connection - intf 0x0
>bgpd[6151]: [P790V-THJKS][EC 100663299] bgp_open_receive: bgp_getsockname() failed for peer: 3.3.2.224
>bgpd[6151]: [HTQD2-0R1WR][EC 33554451] bgp_process_packet: BGP OPEN receipt failed for peer: 3.3.2.224
... repeating

The issue occurs when bgpd handles a massive number of routes in the RIB
while receiving numerous BGP_OPEN packets. If bgpd is overloaded, it
fails to process these packets promptly, leading the remote peer to
close the connection and resend BGP_OPEN packets.

When bgpd eventually starts processing these timeout BGP_OPEN packets,
it finds the TCP connection closed by the remote peer, resulting in
"bgp_stop()" being called. For each timeout peer, bgpd must iterate
through the routing table, which is time-consuming and causes new
incoming BGP_OPEN packets to timeout, perpetuating the infinite loop.

To address this issue, the code is modified to check if the peer has
been established at least once before calling "bgp_clear_route_all()".
This ensures that routes are only cleared for peers that had a
successful session, preventing unnecessary iterations over the routing
table for peers that never established a connection.

With this change, BGP_OPEN timeout messages may still occur, but in the
worst case, bgpd will stabilize. Before this patch, bgpd could enter a
loop where it was unable to accpet any new connections.

Signed-off-by: Loïc Sang <loic.sang@6wind.com>
@lsang6WIND
Copy link
Author

Previous checks are all okay except for the linter.

The following topotest failures are not related to this PR:
bgp_gr_functionality_topo1.test_bgp_gr_functionality_topo1-3 test_BGP_GR_TC_31_1_p1
bgp_peer_type_multipath_relax.test_bgp_peer-type_multipath-relax test_bgp_peer_type_multipath_relax_test10

@riw777 riw777 merged commit 40f7926 into FRRouting:master Jun 26, 2024
11 checks passed
donaldsharp added a commit that referenced this pull request Jun 27, 2024
bgpd: avoid clearing routes for peers that were never established (backport #16271)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants