Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[supervisor]monit container-checker failed due to unexpected "database-chassis" docker running #9043

Merged
merged 1 commit into from
Mar 4, 2022

Conversation

mlok-nokia
Copy link
Contributor

@mlok-nokia mlok-nokia commented Oct 22, 2021

Why I did it

Fixed the monit container_checker fails due to unexpected "database-chassis" docker running on Supervisor card in the VOQ chassis. fixes #9042

How I did it

Added database-chassis to the always running docker list if platform is supervisor card.

How to verify it

Execute the CLI command "sudo monit status container_checker"

admin@sonic:~$ sudo monit status container_checker
Monit 5.20.0 uptime: 22h 18m

Program 'container_checker'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Fri, 22 Oct 2021 15:12:50 

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

@mlok-nokia mlok-nokia requested a review from lguohan as a code owner October 22, 2021 15:34
@lguohan lguohan added the Chassis 🤖 Modular chassis support label Oct 28, 2021
@mlok-nokia
Copy link
Contributor Author

@judyjoseph @arlakshm Please help to review this PR. Thanks.

@@ -56,6 +56,10 @@ def get_expected_running_containers():
else:
expected_running_containers.add(container_name)

if container_name == "database" and device_info.is_supervisor():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I think it is better to add the database-chassis in the feature table for the supervisor card iso for adding the check here

Copy link
Contributor Author

@mlok-nokia mlok-nokia Nov 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the feature table is in the init_cfg.json file which is generated while building the image. If we want to add the database-chassis docker to the feature table, we have to make the init_cfg.json as J2 template then allow the the database-chassis container to be inserted/rendered in the feature table for Supervisor card during runtime at system boot up. Is this what you suggested to do or another method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init_cfg.json is generated from init_cfg.json.j2, recently change was done to make dhcp_relay feature enabled or disabled based on the attribute in DEVICE_METADATA.
https://github.com/Azure/sonic-buildimage/blob/51a0aed02af5e2586b207b2493de28180dd22bed/files/build_templates/init_cfg.json.j2#L38

I was wondering if we can follow the same approach and enable database-chassis when switch_type in DEVICE_METADATA is fabric

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weeks ago, when I looked at this init_cfg.json.j2 with this change, it looks like there is an issue with this change. During image build, the init_cfg.json is generated with the following value for the "dhcp-relay".

 "dhcp_relay": {
            "state": "{% if not (DEVICE_METADATA is defined and DEVICE_METADATA['localhost'] is defined and DEVICE_METADATA['localhost']['type'] is defined and DEVICE_METADATA['localhost']['type'] != 'ToRRouter') %}enabled{% else %}disabled{% endif %}", false, "enabled")) %}",
            "has_timer" : false,
            "has_global_scope": false,
            "has_per_asic_scope": false,
            "auto_restart": "enabled",
            "high_mem_alert": "disabled"
        },

And, at boot up for the first time installation, the dhcp-relay state value is no longer rendered to either "enabled" or "disabled" when generated the default configuration. Based on what I saw, script "sonic-cfggen" will not render init_cfg.json file since it is using option "-j". if using option "-t" as template, it will work. Am I missing anything with this change? Or do you know who renders it before the "sonic-cfggen" combine it to the config_db.json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arlakshm When the init_cfg,json with value ' "dhcp_relay": {"state": "{% if not (DEVICE_META....." ' is rendered to either "enabled" and "disabled"? I cannot find where the code renders the value of "state' for dhcp_relay to "enabled" or "disable". Please help me understand it and I can make the same design for the database-chassis and test it. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abdosi - will you be able to help answer this? Thanks.

Copy link
Contributor Author

@mlok-nokia mlok-nokia Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the investigation, I don't think we can use DEVICE_METATDATA|localhost|swtich_type to determine to enable the database-chassis or not. The dependency of database.service required 'database-chassis' to be started before the 'database'. Without start the database-chassis first, 'database' cannot be started and not able to load the DVEVICE_METADATA and the feature table

In file  database.service

 Wants=database-chassis.service
 After=database-chassis.service

@rlhui rlhui assigned abdosi and arlakshm and unassigned abdosi Dec 14, 2021
@mlok-nokia mlok-nokia changed the title {supervisor]monit container-checker failed due to unexpected "database-chassis" docker running [supervisor]monit container-checker failed due to unexpected "database-chassis" docker running Mar 2, 2022
@zhangyanzhao
Copy link
Collaborator

@rlhui will follow up @arlakshm

@arlakshm
Copy link
Contributor

arlakshm commented Mar 2, 2022

Hi @mlok-nokia, the container_checker was modified recently to support checking database container in multi asic platforms. Details in this PR #10067.

I think we can follow the same approach and add the database-chassis container to the always_running_containers if is_supervisor is True

@mlok-nokia
Copy link
Contributor Author

mlok-nokia commented Mar 2, 2022

Hi @mlok-nokia, the container_checker was modified recently to support checking database container in multi asic platforms. Details in this PR #10067.

I think we can follow the same approach and add the database-chassis container to the always_running_containers if is_supervisor is True

We can to do. The logic needs to be like below since Supervisor card can be multiasic too:

     if feature_table[container_name]["state"] == 'always_enabled':
           if container_name == "database-chassis":
                if device_info.is_supervisor():
                    always_running_containers.add(container_name) 
            elif multi_asic.is_multi_asic():
                if feature_table[container_name]["has_global_scope"] == "True":

@mlok-nokia
Copy link
Contributor Author

@arlakshm Please take a look at the reply which I made based on your suggestion -- adding the database-chassis in the init_cfg.json.j2 file and use the logic I mentioned in the reply. Is that correct? Thanks

@arlakshm
Copy link
Contributor

arlakshm commented Mar 3, 2022

Hi @mlok-nokia, the container_checker was modified recently to support checking database container in multi asic platforms. Details in this PR #10067.
I think we can follow the same approach and add the database-chassis container to the always_running_containers if is_supervisor is True

We can to do. The logic needs to be like below since Supervisor card can be multiasic too:

     if feature_table[container_name]["state"] == 'always_enabled':
           if container_name == "database-chassis":
                if device_info.is_supervisor():
                    always_running_containers.add(container_name) 
            elif multi_asic.is_multi_asic():
                if feature_table[container_name]["has_global_scope"] == "True":

Hi @mlok-nokia, the container_checker was modified recently to support checking database container in multi asic platforms. Details in this PR #10067.
I think we can follow the same approach and add the database-chassis container to the always_running_containers if is_supervisor is True

We can to do. The logic needs to be like below since Supervisor card can be multiasic too:

     if feature_table[container_name]["state"] == 'always_enabled':
           if container_name == "database-chassis":
                if device_info.is_supervisor():
                    always_running_containers.add(container_name) 
            elif multi_asic.is_multi_asic():
                if feature_table[container_name]["has_global_scope"] == "True":

@mlok-nokia. The feature table may not have 'database-chassis' container. I think we can add the check of the supervisor at the end, somtehing like this?

        if feature_table[container_name]["state"] == 'always_enabled':
            if multi_asic.is_multi_asic():
                if feature_table[container_name]["has_global_scope"] == "True":
                    always_running_containers.add(container_name)
                if feature_table[container_name]["has_per_asic_scope"] == "True":
                    num_asics = multi_asic.get_num_asics()
                    for asic_id in range(num_asics):
                        always_running_containers.add(container_name + str(asic_id))
            else:
                always_running_containers.add(container_name)

    if device_info.is_supervisor():
        always_running_containers.add(CHASSIS_DATABASE)

@mlok-nokia
Copy link
Contributor Author

Hi @mlok-nokia, the container_checker was modified recently to support checking database container in multi asic platforms. Details in this PR #10067.
I think we can follow the same approach and add the database-chassis container to the always_running_containers if is_supervisor is True

We can to do. The logic needs to be like below since Supervisor card can be multiasic too:

     if feature_table[container_name]["state"] == 'always_enabled':
           if container_name == "database-chassis":
                if device_info.is_supervisor():
                    always_running_containers.add(container_name) 
            elif multi_asic.is_multi_asic():
                if feature_table[container_name]["has_global_scope"] == "True":

Hi @mlok-nokia, the container_checker was modified recently to support checking database container in multi asic platforms. Details in this PR #10067.
I think we can follow the same approach and add the database-chassis container to the always_running_containers if is_supervisor is True

We can to do. The logic needs to be like below since Supervisor card can be multiasic too:

     if feature_table[container_name]["state"] == 'always_enabled':
           if container_name == "database-chassis":
                if device_info.is_supervisor():
                    always_running_containers.add(container_name) 
            elif multi_asic.is_multi_asic():
                if feature_table[container_name]["has_global_scope"] == "True":

@mlok-nokia. The feature table may not have 'database-chassis' container. I think we can add the check of the supervisor at the end, somtehing like this?

        if feature_table[container_name]["state"] == 'always_enabled':
            if multi_asic.is_multi_asic():
                if feature_table[container_name]["has_global_scope"] == "True":
                    always_running_containers.add(container_name)
                if feature_table[container_name]["has_per_asic_scope"] == "True":
                    num_asics = multi_asic.get_num_asics()
                    for asic_id in range(num_asics):
                        always_running_containers.add(container_name + str(asic_id))
            else:
                always_running_containers.add(container_name)

    if device_info.is_supervisor():
        always_running_containers.add(CHASSIS_DATABASE)

Ok. I will make the change as you suggested -- always add for the Supervisor card. Thanks

…e-chassis" docker running sonic-net#9042

 -- Added database-chassis to the expected docker list for the monit to check.

Signed-off-by: mlok <marty.lok@nokia.com>
@mlok-nokia mlok-nokia force-pushed the monit-container-chk branch from fa451f0 to 6befb6a Compare March 3, 2022 03:20
@mlok-nokia
Copy link
Contributor Author

@arlakshm I have made the change. Please review it. Thanks

@arlakshm arlakshm merged commit c40f04f into sonic-net:master Mar 4, 2022
arlakshm pushed a commit that referenced this pull request Mar 4, 2022
… "database-chassis" docker running #9042 (#9043)

Why I did it
Fixed the monit container_checker fails due to unexpected "database-chassis" docker running on Supervisor card in the VOQ chassis. fixes #9042

How I did it
Added database-chassis to the always running docker list if platform is supervisor card.

How to verify it
Execute the CLI command "sudo monit status container_checker"


Signed-off-by: mlok <marty.lok@nokia.com>
judyjoseph pushed a commit that referenced this pull request Mar 7, 2022
… "database-chassis" docker running #9042 (#9043)

Why I did it
Fixed the monit container_checker fails due to unexpected "database-chassis" docker running on Supervisor card in the VOQ chassis. fixes #9042

How I did it
Added database-chassis to the always running docker list if platform is supervisor card.

How to verify it
Execute the CLI command "sudo monit status container_checker"


Signed-off-by: mlok <marty.lok@nokia.com>
@mlok-nokia mlok-nokia deleted the monit-container-chk branch April 16, 2022 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[supervisor]monit container-checker failed due to unexpected "database-chassis" docker running
7 participants