Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bgpmon]: Fix exception in bgpmon caused by duplicate bgp neighbor ID #6546

Merged
merged 3 commits into from
Jan 27, 2021

Conversation

bingwang-ms
Copy link
Contributor

@bingwang-ms bingwang-ms commented Jan 25, 2021

Signed-off-by: bingwang bingwang@microsoft.com

- Why I did it
It is possible that BGP neighbors in IPv4 and IPv6 address families share the same name.
For example, if belowing config is applied to creare a bgp monitor:

{
    "BGP_MONITORS": {
        "11.0.0.1": {
            "admin_status": "up",
                "asn": "4200065100",
                "holdtime": "10",
                "keepalive": "3",
                "local_addr": "10.1.0.32",
                "name": "bgp_monitor",
                "nhopself": "0",
                "rrclient": "0"
        }
    }
}

Then a bgp neighbor named 11.0.0.1 will exist in both ipv4 and ipv6 neighbors.

root@str-7260cx3-acs-2:/var/log# show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 4200065100 vrf-id 0
BGP table version 45133
RIB entries 12807, using 2458944 bytes of memory
Peers 5, using 109080 KiB of memory
Peer groups 5, using 320 bytes of memory


Neighbhor      V          AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down    State/PfxRcd    NeighborName
-----------  ---  ----------  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.33      4  4200064600       9616      18386         0      0       0  02:33:56   6400            ARISTA01T1
10.0.0.35      4  4200064600       9616      19205         0      0       0  02:33:56   6400            ARISTA02T1
10.0.0.37      4  4200064600       9618      20598         0      0       0  02:33:55   6400            ARISTA03T1
10.0.0.39      4  4200064600       9616      17468         0      0       0  02:33:55   6400            ARISTA04T1
11.0.0.1       4  4200065100          0          0         0      0       0  never      Active          bgp_monitor

Total number of neighbors 5
root@str-7260cx3-acs-2:/var/log# show ipv6 bgp sum

IPv6 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 4200065100 vrf-id 0
BGP table version 36748
RIB entries 12807, using 2458944 bytes of memory
Peers 5, using 109080 KiB of memory
Peer groups 5, using 320 bytes of memory


Neighbhor      V          AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down    State/PfxRcd    NeighborName
-----------  ---  ----------  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
11.0.0.1       4  4200065100          0          0         0      0       0  never      Active          bgp_monitor
fc00::2a       4  4200064600       9617      21462         0      0       0  02:33:59   6400            ARISTA03T1
fc00::2e       4  4200064600       9618      20863         0      0       0  02:33:59   6400            ARISTA04T1
fc00::22       4  4200064600       9615      19880         0      0       0  02:34:00   6400            ARISTA01T1
fc00::26       4  4200064600       9616      20582         0      0       0  02:34:00   6400            ARISTA02T1

Total number of neighbors 5

However, such case is not handled in bgpmon.py, and an Exception will be raised because the same key is deleted twice.

Jan 25 06:40:56.540613 str-7260cx3-acs-2 INFO bgp#/supervisord: bgpmon Traceback (most recent call last):
Jan 25 06:40:56.540971 str-7260cx3-acs-2 INFO bgp#/supervisord: bgpmon   File "/usr/local/bin/bgpmon", line 8, in <module>
Jan 25 06:40:56.540971 str-7260cx3-acs-2 INFO bgp#/supervisord: bgpmon     sys.exit(main())
Jan 25 06:40:56.540971 str-7260cx3-acs-2 INFO bgp#/supervisord: bgpmon   File "/usr/local/lib/python3.7/dist-packages/bgpmon/bgpmon.py", line 167, in main
Jan 25 06:40:56.540971 str-7260cx3-acs-2 INFO bgp#/supervisord: bgpmon     bgp_state_get.update_neigh_states()
Jan 25 06:40:56.540971 str-7260cx3-acs-2 INFO bgp#/supervisord: bgpmon   File "/usr/local/lib/python3.7/dist-packages/bgpmon/bgpmon.py", line 143, in update_neigh_states
Jan 25 06:40:56.540971 str-7260cx3-acs-2 INFO bgp#/supervisord: bgpmon     del self.peer_state[peer]
Jan 25 06:40:56.540971 str-7260cx3-acs-2 INFO bgp#/supervisord: bgpmon KeyError: '11.0.0.1'

This commit will address the issue.

- How I did it
Use set instead of list to store neighbors' ID to avoid duplicate keys.

Please be noted that this PR only fix exception in bgpmon.py. Similar issue exists in STATE_DB because NEIGH_STATE_TABLE|x.x.x.x is used as keys in STATE_DB. If neighbors have duplicated keys, the value will be overwriten.

- How to verify it
Verified on A7260. No exception is observed after update.

- Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012

- Description for the changelog

No.

- A picture of a cute animal (not mandatory but encouraged)

It is possible that BGP neighbors in IPv4 and IPv6 address families
share the same name (such as bgp monitor). However, such case is not
handled in bgpmon, and an Exception will be raised. This commit will
address the issue.

Signed-off-by: bingwang <bingwang@microsoft.com>
@lguohan lguohan requested a review from gechiang January 25, 2021 08:40
@lguohan
Copy link
Collaborator

lguohan commented Jan 25, 2021

@gechiang , I think bgpmon still missing unit test. please add unit test to protect such behavior.

@bingwang-ms
Copy link
Contributor Author

@lguohan Please help to take a look at another related issue #6547 . Thanks

@gechiang
Copy link
Collaborator

@bingwang-ms Thanks for triaging this issue. I have a very basic question. Why is this "11.0.0.1" Neighbor being reported by FRR as part of the IPV6 unicast neighbor? Isn't this not a valid behavior that needs to be addressed first by FRR? Will this be investigated or raised as an issue with BGP FRR community? But in any case we need to beef up our code to prevent this from causing an issue.

As for the fix I am thinking we should convert the "peer list" to a "peer set". this way the duplicated entry(ies) will only be in the set ONCE. If we do this it will also avoid causing the STATE DB issue since we no longer have duplicated peers. Let's not try to append ip version to the key to solve the initial issue. Let me know if my suggestion is sound?
Thanks again!

Let me know if you agree with my suggested approach...
Thanks!

@bingwang-ms
Copy link
Contributor Author

@bingwang-ms Thanks for triaging this issue. I have a very basic question. Why is this "11.0.0.1" Neighbor being reported by FRR as part of the IPV6 unicast neighbor? Isn't this not a valid behavior that needs to be addressed first by FRR? Will this be investigated or raised as an issue with BGP FRR community? But in any case we need to beef up our code to prevent this from causing an issue.

As for the fix I am thinking we should convert the "peer list" to a "peer set". this way the duplicated entry(ies) will only be in the set ONCE. If we do this it will also avoid causing the STATE DB issue since we no longer have duplicated peers. Let's not try to append ip version to the key to solve the initial issue. Let me know if my suggestion is sound?
Thanks again!

Let me know if you agree with my suggested approach...
Thanks!

Thanks @gechiang . I'm not sure if the behavior is triggered by FRR or our scripts. So I filed a issue here.
For the second question, I think using a set instead of dict is not a good option. We may lose some neighbors if ipv4 and ipv6 neigbors share the same ID. In other words, if we are asure that duplicated keys should never exist in ipv4 and ipv6 neighbors, we don't need this fix.

@abdosi
Copy link
Contributor

abdosi commented Jan 26, 2021

@bingwang-ms Thanks for triaging this issue. I have a very basic question. Why is this "11.0.0.1" Neighbor being reported by FRR as part of the IPV6 unicast neighbor? Isn't this not a valid behavior that needs to be addressed first by FRR? Will this be investigated or raised as an issue with BGP FRR community? But in any case we need to beef up our code to prevent this from causing an issue.

As for the fix I am thinking we should convert the "peer list" to a "peer set". this way the duplicated entry(ies) will only be in the set ONCE. If we do this it will also avoid causing the STATE DB issue since we no longer have duplicated peers. Let's not try to append ip version to the key to solve the initial issue. Let me know if my suggestion is sound?
Thanks again!

Let me know if you agree with my suggested approach...
Thanks!

@gechiang we use v4 neighbor peering to advertise v6 routes.

@gechiang
Copy link
Collaborator

gechiang commented Jan 26, 2021

@abdosi Thanks for responding. So this is the same peer that got reported via IPV4 unicast as well as IPV6 unicast. Therefore one of this should be ignored. This come back to the suggestion on how we should address this duplication.
@bingwang-ms Since it is the same peer if we use the way you intent to solve the exception issue it will still result to having the same peer in the list (one with _ipv4 and one with _ipv6 for the same peer). Which one should we use to update the state DB? The purpose of this STATE DB is to report the state of the peer's state. If we were to change the handling to convert the "peer list" to a "peer set" then this duplication will be resolved as we will only have one single peer in the "peer set" and that is the one we can use for the State DB update operation.
let me know if you also agree?

@bingwang-ms
Copy link
Contributor Author

Thanks @abdosi and @gechiang for clarification. If the same ID in IPv4 neigh and IPv6 neigh refer to the same one, I think it's correct to ignore one. I will update the PR to address the issue. And items in STATE_DB are expecetd. Correct?

Signed-off-by: bingwang <bingwang@microsoft.com>
@bingwang-ms
Copy link
Contributor Author

Updated. Thanks

if len(data) > PIPE_BATCH_MAX_COUNT:
self.flush_pipe(data)
# If anything in the pipeline not yet flushed, flush them now
if len(data) > 0:
self.flush_pipe(data)
# Save the new List
self.peer_l = self.new_peer_l[:]
# Save the new dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the change looks great. Just the comment here needs to be changed from "dict" to "set" to avoid confusion. Nice work overall! Thanks for the fix!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated.
So I'll close #6547. I think it's not an issue. Correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bingwang-ms Yes. #6547 is not an issue. Your PR fix will not allow duplicated entry in STATE DB to occur.
Thanks!

Signed-off-by: bingwang <bingwang@microsoft.com>
@lguohan lguohan changed the title Fix exception in bgpmon caused by duplicate bgp neighbor ID [bgpmon]: Fix exception in bgpmon caused by duplicate bgp neighbor ID Jan 26, 2021
@bingwang-ms
Copy link
Contributor Author

Retest this, please

@gechiang
Copy link
Collaborator

retest vsimage please

@lguohan
Copy link
Collaborator

lguohan commented Jan 26, 2021

/Azurepipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@daall daall linked an issue Jan 26, 2021 that may be closed by this pull request
@gechiang gechiang merged commit 6fa807d into sonic-net:master Jan 27, 2021
abdosi pushed a commit that referenced this pull request Jan 28, 2021
…#6546)

* Fix exception in bgpmon caused by duplicate keys
It is possible that BGP neighbors in IPv4 and IPv6 address families
share the same name (such as bgp monitor). However, such case is not
handled in bgpmon, and an Exception will be raised. This commit will
address the issue by Using set instead of list to avoid duplicate keys.
lguohan pushed a commit that referenced this pull request Jan 28, 2021
…#6546)

* Fix exception in bgpmon caused by duplicate keys
It is possible that BGP neighbors in IPv4 and IPv6 address families
share the same name (such as bgp monitor). However, such case is not
handled in bgpmon, and an Exception will be raised. This commit will
address the issue by Using set instead of list to avoid duplicate keys.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NEIGH_STATE_TABLE in STATE_DB is possibly overwriten
6 participants