Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202305][chassis-packet]: route_check fails on LC due to timeout on frr routes #17403

Closed
anamehra opened this issue Dec 4, 2023 · 8 comments
Closed
Assignees
Labels
Triaged this issue has been triaged

Comments

@anamehra
Copy link
Contributor

anamehra commented Dec 4, 2023

Description

On 202305 chassis-packet, after the introduction of frr route check in route_check.py (sonic-net/sonic-utilities#2762), route_check.py may take more than 2 mins to finish. The current timeout is 2 mins which causes route check to fail and affects monit output. This affects the sonic-mgmt pretest check. Other test cases relying on monit output may also be affected.

root@sfd-t2-lc0:/home/cisco# time route_check.py                                                                                                                                                                                                              [[BAborting routeCheck.py upon timeout signal after 120 seconds                                                                                                                                                                                            
[<FrameSummary file /usr/local/bin/route_check.py, line 810 in <module>>, <FrameSummary file /usr/local/bin/route_check.py, line 797 in main>, <FrameSummary file /usr/local/bin/route_check.py, line 745 in check_routes>, <FrameSummary file /usr/local/bin/│·
route_check.py, line 537 in check_frr_pending_routes>, <FrameSummary file /usr/local/bin/route_check.py, line 345 in get_frr_routes>, <FrameSummary file /usr/lib/python3.9/subprocess.py, line 424 in check_output>, <FrameSummary file /usr/lib/python3.9/su│·
bprocess.py, line 507 in run>, <FrameSummary file /usr/lib/python3.9/subprocess.py, line 1121 in communicate>, <FrameSummary file /usr/local/bin/route_check.py, line 95 in handler>]                                                                      
Traceback (most recent call last):                                                                                                                                                                                                                            
  File "/usr/local/bin/route_check.py", line 810, in <module>                                                                                                                                                                                                 
    sys.exit(main()[0])                                                                                                                                                                                                                                       
  File "/usr/local/bin/route_check.py", line 797, in main                                                                                                                                                                                                     
    ret, res= check_routes()                                                                                                                                                                                                                                  
  File "/usr/local/bin/route_check.py", line 745, in check_routes                                                                                                                                                                                             
    rt_frr_miss = check_frr_pending_routes()                                                                                                                                                                                                                  
  File "/usr/local/bin/route_check.py", line 537, in check_frr_pending_routes                                                                                                                                                                                 
    frr_routes = get_frr_routes()                                                                                                                                                                                                                             
  File "/usr/local/bin/route_check.py", line 345, in get_frr_routes                                                                                                                                                                                           
    output = subprocess.check_output('show ipv6 route json', shell=True)                                                                                                                                                                                      
  File "/usr/lib/python3.9/subprocess.py", line 424, in check_output                                                                                                                                                                                          
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,                                                                                                                                                                                          
  File "/usr/lib/python3.9/subprocess.py", line 507, in run                                                                                                                                                                                                   
    stdout, stderr = process.communicate(input, timeout=timeout)                                                                                                                                                                                              
  File "/usr/lib/python3.9/subprocess.py", line 1121, in communicate                                                                                                                                                                                          
    stdout = self.stdout.read()                                                                                                                                                                                                                               
  File "/usr/local/bin/route_check.py", line 96, in handler                                                                                                                                                                                                   
    raise Exception("timeout occurred")                                                                                                                                                                                                                       
Exception: timeout occurred                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                          
real    2m0.714s                                                                                                                                                                                                                                              
user    0m57.700s                                                                                                                                                                                                                                             
sys     0m2.939s 

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

sha1 from 202305 build:
d814cc41d [eventd]: Disabling eventd tests (#17053) (#17061)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@anamehra
Copy link
Contributor Author

anamehra commented Dec 4, 2023

Hi @rlhui , @abdosi , @gechiang , for your viz. Thanks

@anamehra anamehra changed the title [chassis-packet]: route_check fails on LC due to timeout on frr routes [202305][chassis-packet]: route_check fails on LC due to timeout on frr routes Dec 13, 2023
@anamehra
Copy link
Contributor Author

Hi @abdosi , on chassis-packet LC, it took around 2 mins just to get output of 'show ip/ipv6 route json' command.

root@sfd-lt2-lc0:/home/cisco# time show ipv6 route json >p                                                                      
                                                                                                                                
real    1m0.847s                                                                                                                
user    0m51.948s                                                                                                               
sys     0m2.226s                                                                                                                
root@sfd-lt2-lc0:/home/cisco# time show ip route json >p                                                                         
                                                                                                                                
real    1m12.786s                                                                                                               
user    0m58.159s                                                                                                               
sys     0m2.784s       

root@sfd-lt2-lc0:/home/cisco# show ip bgp su -d all

IPv4 Unicast Summary:
asic0: BGP router identifier 3.3.3.3, local AS number 65100 vrf-id 0
BGP table version 97803
asic1: BGP router identifier 3.3.3.4, local AS number 65100 vrf-id 0
BGP table version 103326
asic2: BGP router identifier 3.3.3.5, local AS number 65100 vrf-id 0
BGP table version 98074
RIB entries 308085, using 59152320 bytes of memory
Peers 48, using 35614848 KiB of memory
Peer groups 12, using 768 bytes of memory


Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  -----------------
3.3.3.3        4  65100       2420       2555         0      0       0  01:47:07            34062  ASIC0
3.3.3.3        4  65100       2431       2433         0      0       0  01:47:40            34062  ASIC0
3.3.3.4        4  65100       2514       2381         0      0       0  01:45:06            50838  ASIC1
3.3.3.4        4  65100       2556       2423         0      0       0  01:47:07            50838  ASIC1
3.3.3.5        4  65100       2381       2514         0      0       0  01:45:05            34066  ASIC2
3.3.3.5        4  65100       2432       2430         0      0       0  01:47:39            34066  ASIC2
.
.

route_check.py has 2 min timer. And show commands of ripv4 and ipv6 took more than a min each.

cc: @stepanblyschak

@prabhataravind
Copy link
Contributor

@deepak-singhal0408, could you please check? Per Arvind, the route scale might be higher on T2 so route_check takes more time to finish.

@prabhataravind prabhataravind added the Triaged this issue has been triaged label Dec 20, 2023
@abdosi
Copy link
Contributor

abdosi commented Dec 30, 2023

@stepanblyschak can we add check if ifb suppression feature enabled then only call check_frr_pending_routes() https://github.com/sonic-net/sonic-utilities/blob/master/scripts/route_check.py#L549 till the time we have proper fix to handle this timeout. On T2 topology we can have 30K+ routes and this can cause timeout for 2 mins.

@abdosi
Copy link
Contributor

abdosi commented Dec 30, 2023

@prsunny for viz.

@rlhui
Copy link
Contributor

rlhui commented Jan 10, 2024

closing as this issue does not exist anymore for 202305 with the fib suppression removed in 202305.

@rlhui
Copy link
Contributor

rlhui commented Jan 10, 2024

fib suppression feature is reverted.
#17578
sonic-net/sonic-utilities#3093
sonic-net/sonic-swss#2997

@rlhui rlhui closed this as completed Jan 10, 2024
@stepanblyschak
Copy link
Collaborator

stepanblyschak commented Jan 18, 2024

@anamehra @rlhui Could you please re-open it for master?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
Status: Done
Development

No branches or pull requests

6 participants