Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container_checker on supervisor should check containers based on asic presence #11442

Merged
merged 5 commits into from
Aug 22, 2022
Merged

container_checker on supervisor should check containers based on asic presence #11442

merged 5 commits into from
Aug 22, 2022

Conversation

anamehra
Copy link
Contributor

Signed-off-by: anamehra anamehra@cisco.com

On Supervisor/RP card, some application containers may not run if
the asic is not present due to a missing Fabric card. The container checker
should skip those container instances.
Container instances which run only if asic present: swss, syncd, lldp,
teamd
Exception: All instances of database and bgp containers run irrespective
of asic presence.

Why I did it

On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

The monit 'container_checker' fails in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).

#8520 (comment)

How I did it

Get the aisc presence list from CHASSIS_ASIC_TABLE on CHASSIS_STATE_DB. Use this list to check for state of the dockers expected to be up.

How to verify it

  1. Bringup Supervisor card with one or more missing fabric cards. Check for 'monit' process logs in syslogs. monit process should not report failure due to missing dockers for the asics on the fabric cards which are not present.
  2. execute 'container_checker'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205

Description for the changelog

container_checker on supervisor should check containers based on asic presence

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

    On Supervisor/RP card, some application containers may not run if
    the asic is not present due to a missing Fabric card. The container checker
    should skip those container instances.
    Container instances which run only if asic present: swss, syncd, lldp,
    teamd
    Exception: All instances of database and bgp containers run irrespective
    of asic presence.

Signed-off-by: anamehra anamehra@cisco.com
@anamehra anamehra requested a review from lguohan as a code owner July 13, 2022 16:52
@anamehra
Copy link
Contributor Author

Hi @rlhui @abdosi , FYI-
Please assign for review and add chassis tag.

@abdosi abdosi added Chassis 🤖 Modular chassis support Multi-ASIC labels Jul 13, 2022
    Added more comments.

Signed-off-by: anamehra <anamehra@cisco.com>
@anamehra anamehra requested a review from a team as a code owner July 26, 2022 22:00
@anamehra anamehra requested a review from abdosi July 26, 2022 23:03
@abdosi
Copy link
Contributor

abdosi commented Jul 26, 2022

LGTM

@abdosi
Copy link
Contributor

abdosi commented Jul 26, 2022

@judyjoseph / @arlakshm

Please take a look

@abdosi
Copy link
Contributor

abdosi commented Jul 26, 2022

LGTM

bgp will be removed from exception list. I am creating PR to make it disable in supervisor feature list.

Signed-off-by: anamehra <anamehra@cisco.com>
@abdosi
Copy link
Contributor

abdosi commented Aug 3, 2022

/azp run

@azure-pipelines
Copy link

You have several pipelines (over 10) configured to build pull requests in this repository. Specify which pipelines you would like to run by using /azp run [pipelines] command. You can specify multiple pipelines using a comma separated list.

@abdosi
Copy link
Contributor

abdosi commented Aug 5, 2022

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 11, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@lgtm-com
Copy link

lgtm-com bot commented Aug 11, 2022

This pull request introduces 1 alert when merging bd30ff2 into 147f736 - view on LGTM.com

new alerts:

  • 1 for Unused import

    Removed daemon_base import
@arlakshm
Copy link
Contributor

Changes look good to me. One comment, if Fabric asic is not detected, will the CHASSIS_STATE_DB have that that asic information? Will monit still be able to detect such errors?

@anamehra
Copy link
Contributor Author

Changes look good to me. One comment, if Fabric asic is not detected, will the CHASSIS_STATE_DB have that that asic information? Will monit still be able to detect such errors?

If fabric is not detected, the corresponding asic entry should not be present in CHASSIS_STATE_DB. Platform needs to make sure that the asic list include only the detected asics. get_all_asics() function, which the platform layer implements in modules.py, should return only detected asics. For Linecards, as asics are not removable, this may list all asics. But for Supervisor, it should return only the asics present.

@anamehra
Copy link
Contributor Author

Hi @abdosi , any known issue related to build check failure? The errors do not look like this PR change related and I have seen the same errors reported in other PRs as well.

@abdosi abdosi merged commit f404ce6 into sonic-net:master Aug 22, 2022
yxieca pushed a commit that referenced this pull request Aug 26, 2022
… presence (#11442)

Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

The monit 'container_checker' fails in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).
rlhui pushed a commit that referenced this pull request Feb 11, 2023
…13497)

Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).

system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity.

How I did it
Port container_checker logic from #11442 into service_checker for system-health.

How to verify it
Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Feb 17, 2023
…onic-net#13497)

Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).

system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity.

How I did it
Port container_checker logic from sonic-net#11442 into service_checker for system-health.

How to verify it
Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
mssonicbld pushed a commit that referenced this pull request Feb 17, 2023
…13497)

Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).

system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity.

How I did it
Port container_checker logic from #11442 into service_checker for system-health.

How to verify it
Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
spilkey-cisco added a commit to spilkey-cisco/sonic-buildimage that referenced this pull request Feb 23, 2023
…onic-net#13497)

Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).

system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity.

How I did it
Port container_checker logic from sonic-net#11442 into service_checker for system-health.

How to verify it
Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
yxieca pushed a commit that referenced this pull request Feb 28, 2023
…13497) (#13966)

Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).

system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity.

How I did it
Port container_checker logic from #11442 into service_checker for system-health.

How to verify it
Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
StormLiangMS pushed a commit to StormLiangMS/sonic-buildimage that referenced this pull request Mar 28, 2023
Related work items: sonic-net#276, sonic-net#305, sonic-net#332, sonic-net#338, sonic-net#339, sonic-net#1188, sonic-net#1192, sonic-net#1197, sonic-net#1206, sonic-net#1685, sonic-net#1690, sonic-net#1696, sonic-net#1699, sonic-net#1709, sonic-net#1727, sonic-net#1737, sonic-net#1741, sonic-net#1742, sonic-net#2511, sonic-net#2512, sonic-net#2532, sonic-net#2559, sonic-net#2626, sonic-net#2638, sonic-net#2645, sonic-net#2649, sonic-net#2660, sonic-net#2669, sonic-net#2670, sonic-net#2678, sonic-net#10084, sonic-net#11442, sonic-net#11873, sonic-net#12047, sonic-net#12110, sonic-net#12207, sonic-net#12529, sonic-net#12678, sonic-net#13235, sonic-net#13287, sonic-net#13372, sonic-net#13395, sonic-net#13456, sonic-net#13497, sonic-net#13522, sonic-net#13545, sonic-net#13547, sonic-net#13552, sonic-net#13569, sonic-net#13572, sonic-net#13578, sonic-net#13591, sonic-net#13611, sonic-net#13647, sonic-net#13649, sonic-net#13660, sonic-net#13710, sonic-net#13716, sonic-net#13724, sonic-net#13726, sonic-net#13732, sonic-net#13735, sonic-net#13739, sonic-net#13757, sonic-net#13786, sonic-net#13792, sonic-net#13800, sonic-net#13801, sonic-net#13802, sonic-net#13805, sonic-net#13806, sonic-net#13812, sonic-net#13814, sonic-net#13822, sonic-net#13831, sonic-net#13834, sonic-net#13847, sonic-net#13870, sonic-net#13882, sonic-net#13884, sonic-net#13885, sonic-net#13894, sonic-net#13895, sonic-net#13926, sonic-net#13932, sonic-net#13935, sonic-net#13942, sonic-net#13951, sonic-net#13953, sonic-net#13964
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Monit container_checker swss/syncd/teamd containers not running error in supervisor
7 participants