Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCU] Ethernet lane update fails, hung on patch sorting step #2263

Open
isabelmsft opened this issue Jul 9, 2022 · 5 comments · May be fixed by #2273
Open

[GCU] Ethernet lane update fails, hung on patch sorting step #2263

isabelmsft opened this issue Jul 9, 2022 · 5 comments · May be fixed by #2273

Comments

@isabelmsft
Copy link
Contributor

Description

GCU patch application to replace/update lanes for Ethernet interface never terminates, hangs on on Sorting patch updates step

Steps to reproduce the issue

  1. Apply patch to replace lanes for Ethernet interface

Describe the results you received

Patch application never terminates

Describe the results you expected

Successful patch application, ethernet interface lane updates go through

Additional information you deem important (e.g. issue happens only occasionally)

Output of show version

SONiC Software Version: SONiC.internal-202205.57377412-84a9a7f11b
Distribution: Debian 11.3
Kernel: 5.10.0-12-2-amd64
Build commit: 84a9a7f11b
Build date: Wed Jul  6 12:59:01 UTC 2022
Built by: cloudtest@d7f67934c00000B



Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2028X35367
Model Number: MSN2700-CS2FO
Hardware Revision: A2
Uptime: 00:27:50 up 2 days,  3:27,  3 users,  load average: 1.85, 1.39, 1.21
Date: Sat 09 Jul 2022 00:27:50

See compressed folder for sample patches and DUT config that result in this error
test_files.zip

@ghooo
Copy link
Contributor

ghooo commented Jul 19, 2022

Investigated the issue and made a few observations:

  1. GCU has no timeout, there should be a timeout and the operations should not be running forever
  2. The backlink of /PORT/Ethernet124 in /PORTCHANNEL/PortChannel104/members/0 failed to be retrieved because of the import sonic-portchannel in /usr/local/yang-models/sonic-vlan-sub-interface.yang ... This is causing GCU to go in circles trying to remove all referrers but failing.
  3. lane is a create-only and the parent Ethernet124 has lots references in the config around 18refs. The code tries to delete the refs then add then back, then delete them, and so on. The code enters a lot of unnecessary code paths.
  4. There are a bug in JsonPointerFilter specifically
            suffix = token[1:] # the suffix will be `|...`
            prefix = token[:-1] # the prefix will be `...|`
    

@ghooo
Copy link
Contributor

ghooo commented Jul 27, 2022

Found another issue in BUFFER_PG, when deleting /BUFFER_PG/Ethernet124|0, /BUFFER_PG/Ethernet124|3-4 also gets deleted.

Assume the same full config in the issue description zipped file

admin@str-msn2700-04:~$ sudo config apply-patch -i /FEATURE -i /AUTO_TECHSUPPORT -i /AUTO_TECHSUPPORT_FEATURE delete-buffer-pg.json 
Patch Applier: Patch application starting.
Patch Applier: Patch: [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|0"}, {"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4"}]
Patch Applier: Getting current config db.
Patch Applier: Simulating the target full config after applying the patch.
Patch Applier: Validating target config does not have empty tables, since they do not show up in ConfigDb.
Patch Applier: Sorting patch updates.
[{"op": "remove", "path": "/BUFFER_PG/Ethernet120|0"}]
<class 'generic_config_updater.patch_sorter.RequiredValueMoveValidator'>
[{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4"}]
<class 'generic_config_updater.patch_sorter.RequiredValueMoveValidator'>
[{"op": "remove", "path": "/BUFFER_PG/Ethernet120|0/profile"}]
<class 'generic_config_updater.patch_sorter.FullConfigMoveValidator'>
[{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4/profile"}]
<class 'generic_config_updater.patch_sorter.FullConfigMoveValidator'>
[{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
[{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
  [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|0"}]
  [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|0"}]
    [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4"}]
    [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4"}]
      [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "up"}]
      [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "up"}]
Patch Applier: The patch was sorted into 4 changes:
Patch Applier:   * [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
Patch Applier:   * [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|0"}]
Patch Applier:   * [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4"}]
Patch Applier:   * [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "up"}]
Patch Applier: Applying 4 changes in order:
Patch Applier:   * [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
Patch Applier:   * [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|0"}]
Patch Applier:   * [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4"}]
Failed to apply patch
Usage: config apply-patch [OPTIONS] PATCH_FILE_PATH
Try "config apply-patch -h" for help.

Error: can't remove a non-existent object 'Ethernet120|3-4'
admin@str-msn2700-04:~$

@ghooo
Copy link
Contributor

ghooo commented Jul 27, 2022

Actually just removing /BUFFER_PG/Ethernet120|3-4 causes the problem.

admin@str-msn2700-04:~$ sudo config apply-patch -i /FEATURE -i /AUTO_TECHSUPPORT -i /AUTO_TECHSUPPORT_FEATURE delete-buffer-pg-just-34.json 
Patch Applier: Patch application starting.
Patch Applier: Patch: [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4"}]
Patch Applier: Getting current config db.
Patch Applier: Simulating the target full config after applying the patch.
Patch Applier: Validating target config does not have empty tables, since they do not show up in ConfigDb.
Patch Applier: Sorting patch updates.
[{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4"}]
<class 'generic_config_updater.patch_sorter.RequiredValueMoveValidator'>
[{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4/profile"}]
<class 'generic_config_updater.patch_sorter.FullConfigMoveValidator'>
[{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
[{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
  [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4"}]
  [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4"}]
    [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "up"}]
    [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "up"}]
Patch Applier: The patch was sorted into 3 changes:
Patch Applier:   * [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
Patch Applier:   * [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4"}]
Patch Applier:   * [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "up"}]
Patch Applier: Applying 3 changes in order:
Patch Applier:   * [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
Patch Applier:   * [{"op": "remove", "path": "/BUFFER_PG/Ethernet120|3-4"}]
Failed to apply patch
Usage: config apply-patch [OPTIONS] PATCH_FILE_PATH
Try "config apply-patch -h" for help.

Error: can't remove a non-existent object 'Ethernet120|3-4'
admin@str-msn2700-04:~$

@ghooo
Copy link
Contributor

ghooo commented Jul 27, 2022

Actually shutting down the port Ethernet120 causes the /BUFFER_PG/Ethernet120|3-4 to be deleted.

admin@str-msn2700-04:~$ sudo config apply-patch -i /FEATURE -i /AUTO_TECHSUPPORT -i /AUTO_TECHSUPPORT_FEATURE port-shutdown.json 
Patch Applier: Patch application starting.
Patch Applier: Patch: [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
Patch Applier: Getting current config db.
Patch Applier: Simulating the target full config after applying the patch.
Patch Applier: Validating target config does not have empty tables, since they do not show up in ConfigDb.
Patch Applier: Sorting patch updates.
[{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
[{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
Patch Applier: The patch was sorted into 1 change:
Patch Applier:   * [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
Patch Applier: Applying 1 change in order:
Patch Applier:   * [{"op": "replace", "path": "/PORT/Ethernet120/admin_status", "value": "down"}]
Patch Applier: Verifying patch updates are reflected on ConfigDB.
Failed to apply patch
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/config/main.py", line 1252, in apply_patch
    GenericUpdater().apply_patch(patch, config_format, verbose, dry_run, ignore_non_yang_tables, ignore_path)
  File "/usr/local/lib/python3.9/dist-packages/generic_config_updater/generic_updater.py", line 425, in apply_patch
    patch_applier.apply(patch)
  File "/usr/local/lib/python3.9/dist-packages/generic_config_updater/generic_updater.py", line 282, in apply
    self.execute_write_action(Decorator.apply, self, patch)
  File "/usr/local/lib/python3.9/dist-packages/generic_config_updater/generic_updater.py", line 295, in execute_write_action
    action(*args)
  File "/usr/local/lib/python3.9/dist-packages/generic_config_updater/generic_updater.py", line 239, in apply
    self.decorated_patch_applier.apply(patch)
  File "/usr/local/lib/python3.9/dist-packages/generic_config_updater/generic_updater.py", line 81, in apply
    raise GenericConfigUpdaterError(f"After applying patch to config, there are still some parts not updated")
generic_config_updater.gu_common.GenericConfigUpdaterError: After applying patch to config, there are still some parts not updated
Usage: config apply-patch [OPTIONS] PATCH_FILE_PATH
Try "config apply-patch -h" for help.

Error: After applying patch to config, there are still some parts not updated
admin@str-msn2700-04:~$ show runningconfiguration all > after4.json
admin@str-msn2700-04:~$ diff after4.json running-config-backup.json 
329a330,332
>         "Ethernet120|3-4": {
>             "profile": "pg_lossless_40000_300m_profile"
>         },
1731c1734
<             "admin_status": "down",
---
>             "admin_status": "up",
admin@str-msn2700-04:~$ 

@isabelmsft
Copy link
Contributor Author

isabelmsft commented Sep 29, 2022

replace_lanes issue persists- sorting hangs and never finishes, results in Error: maximum recursion depth exceeded while calling a Python object
config_db.zip

To repro, apply sample patch on attached config_db:

[
    {
        "op": "replace",
        "path": "/PORT/Ethernet0/lanes",
        "value": "0,1,2,103"
    },
    {
        "op": "replace",
        "path": "/PORT/Ethernet100/lanes",
        "value": "100,101,102,3"
    }
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants