Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is a problem that hostcfgd blocks forever and does not react to SIGTERM causing a delay in warm boot. #603

Closed
qiluo-msft opened this issue Apr 16, 2022 · 8 comments
Assignees

Comments

@qiluo-msft
Copy link
Contributor

There is a problem that hostcfgd blocks forever and does not react to SIGTERM causing a delay in warm boot.
When hitting this line https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-host-services/scripts/hostcfgd#L1241 .

Example:

>>> from swsscommon import swsscommon
>>> c=swsscommon.ConfigDBConnector()
>>> c.subscribe('A', lambda a: None)
>>> c.connect()
>>> c.listen()


^C^C^C^C

@qiluo-msft This looks like a swss-common issue.

This wasn't seen for some reason during 500 warm boot tests on 202012 since warm reboot is executed early enough before hostcfgd stucks in a listen().

Originally posted by @stepanblyschak in sonic-net/sonic-buildimage#10510 (comment)

@liuh-80
Copy link
Contributor

liuh-80 commented Apr 16, 2022

ACK, will start investigation this issue.

@liuh-80 liuh-80 self-assigned this Apr 16, 2022
@liuh-80
Copy link
Contributor

liuh-80 commented Apr 16, 2022

I found there are 2 place have infinite loop issue:

https://github.com/Azure/sonic-swss-common/blob/15c0f729648ee9f26ac9aea767d5712db34ca298/common/pubsub.cpp#L125

https://github.com/Azure/sonic-swss-common/blob/36e1f61691df7aa44782a377e4082cc4380f1018/common/configdb.h#L97

We need exit these infinite loop when receive SIGTERM by add a SIGTERM handler and check SIGTERM status in loop. also we may need check if other signal also need handle, for example SIGKILL

Also need check if some other place in swss common have same issue.

@liat-grozovik
Copy link
Collaborator

@qiluo-msft any reason the issue is not opened on buildimage for tracking? i believe we only track it and not submodules.

@liuh-80
Copy link
Contributor

liuh-80 commented Apr 19, 2022

To minimize code change and make code scalable, my proposal is:
1.Add a CancellationToken class.
2.And add new parameter to these existed API.
3.Code change in hostcfgd to use the new API with a signal handler.

@liuh-80
Copy link
Contributor

liuh-80 commented Apr 19, 2022

Here is a draft PR for this issue:

#606

from swsscommon import swsscommon

global_cancellation_token = swsscommon.SignalHandlerHelper.GetCancellationToken(2)
swsscommon.SignalHandlerHelper.RegisterSignalHandler(2)

c=swsscommon.ConfigDBConnector()
c.subscribe('A', lambda a: None)
c.connect()
c.listen(global_cancellation_token, None)
^C>>>

@liuh-80
Copy link
Contributor

liuh-80 commented Apr 19, 2022

Update: here is latest example:

from swsscommon import swsscommon

global_cancellation_token = swsscommon.CancellationToken()

swsscommon.SignalHandlerHelper.RegisterCancellationToken(2, global_cancellation_token)
swsscommon.SignalHandlerHelper.RegisterSignalHandler(2)

c=swsscommon.ConfigDBConnector()
c.subscribe('A', lambda a: None)
c.connect()
c.listen(global_cancellation_token, None)

^C>>>

@qiluo-msft
Copy link
Contributor Author

@qiluo-msft any reason the issue is not opened on buildimage for tracking? i believe we only track it and not submodules.

If it is obvious issue in submodule, we will move it to submodule's repo.

@liat-grozovik
Copy link
Collaborator

@liuh-80, @qiluo-msft as this one is blocking sonic-net/sonic-buildimage#10510. can you please see how we can expedite the solution and the review?

liat-grozovik pushed a commit to sonic-net/sonic-utilities that referenced this issue May 11, 2022
…ystemd (#2133)" (#2161)

This reverts commit 23e9398.

- What I did
Revert "[scripts/fast-reboot] Shutdown remaining containers through systemd (#2133)"

This reverted PR is part of a story that refactors warm/fast shutdown sequence to gracefully stop services instead of killing them without any ordering and dependency requirements which creates several issues and is error prone for the future.

This PR must come together with sonic-net/sonic-buildimage#10510.
However, #10510 is blocked due to an issue in swss-common sonic-net/sonic-swss-common#603
And a fix by MSFT is in review sonic-net/sonic-swss-common#606

I am reverting it because its dependency is still blocked and we cannot update submodule pointer. Once the dependency of the reverted PR is resolved, it shall be re-committed.
liat-grozovik pushed a commit to sonic-net/sonic-utilities that referenced this issue May 11, 2022
…ystemd (#2133)" (#2166)

- What I did
This reverts commit a5f55aa.

Revert "[scripts/fast-reboot] Shutdown remaining containers through systemd (#2133)"

This reverted PR is part of a story that refactors warm/fast shutdown sequence to gracefully stop services instead of killing them without any ordering and dependency requirements which creates several issues and is error prone for the future.

This PR must come together with sonic-net/sonic-buildimage#10510.
However, #10510 is blocked due to an issue in swss-common sonic-net/sonic-swss-common#603
And a fix by MSFT is in review sonic-net/sonic-swss-common#606

I am reverting it because its dependency is still blocked and we cannot update submodule pointer. Once the dependency of the reverted PR is resolved, it shall be re-committed.

- How I did it
git revert a5f55aa

- How to verify it
Run tests
liuh-80 added a commit that referenced this issue May 19, 2022
…token support. (#606)

Why I did it
    There are infinite loops inside PubSub::listen() method, so application using this method can't handle SIGTERM correctly.
    #603

How I did it
    Add following class:
    1. CancellationToken: this class will help exist the infinite loops when SIGTERM or other signal happen.
    2. SignalHandlerHelper: Provide a native signal handler.

How to verify it
   1. manually test.
   2. Pass all test case.
itamar-talmon pushed a commit to itamar-talmon/sonic-swss-common that referenced this issue Jul 19, 2022
…token support. (sonic-net#606)

Why I did it
    There are infinite loops inside PubSub::listen() method, so application using this method can't handle SIGTERM correctly.
    sonic-net#603

How I did it
    Add following class:
    1. CancellationToken: this class will help exist the infinite loops when SIGTERM or other signal happen.
    2. SignalHandlerHelper: Provide a native signal handler.

How to verify it
   1. manually test.
   2. Pass all test case.
@liuh-80 liuh-80 closed this as completed Jul 27, 2022
malletvapid23 added a commit to malletvapid23/Sonic-Utility that referenced this issue Aug 3, 2023
…ystemd (#2133)" (#2161)

This reverts commit 23e9398.

- What I did
Revert "[scripts/fast-reboot] Shutdown remaining containers through systemd (#2133)"

This reverted PR is part of a story that refactors warm/fast shutdown sequence to gracefully stop services instead of killing them without any ordering and dependency requirements which creates several issues and is error prone for the future.

This PR must come together with sonic-net/sonic-buildimage#10510.
However, #10510 is blocked due to an issue in swss-common sonic-net/sonic-swss-common#603
And a fix by MSFT is in review sonic-net/sonic-swss-common#606

I am reverting it because its dependency is still blocked and we cannot update submodule pointer. Once the dependency of the reverted PR is resolved, it shall be re-committed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants