Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xcvrd should restart if any child thread crashes #326

Merged
merged 5 commits into from
Jan 19, 2023

Conversation

mihirpat1
Copy link
Contributor

@mihirpat1 mihirpat1 commented Dec 28, 2022

Description

Make all task workers as child thread of main process.
The main process should watch for any child thread crash and do suicide to restart itself. Also ensure the backtrace of crashing child thread is shown

Motivation and Context

Overall, the change set now adheres to a template for creating thread and handles exceptions followed by restarting xcvrd.
With this change set, 3 threads will be created:

  1. CmisManagerTask (This was formerly a process)
  2. DomInfoUpdateTask (This already exists as a thread)
  3. SfpStateUpdateTask (This was formerly a process)

All the above threads have now been named accordingly.
Once an exception is raised, the run() method will generate a traceback and will set the main_thread_stop event.
The xcvrd daemon keeps waiting until main_thread_stop event is generated. Once this event is generated, we first check if any thread is dead. If so, we capture the exception, evaluate if other threads have an exception generated as well and then force the xcvrd process to kill itself.
In case all threads are alive, we allow xcvrd daemon to gracefully exit.

Note - In order to kill SfpStateUpdateTask thread gracefully, we are manually raising an exception since SfpStateUpdateTask has a call to an API which could potentially sleep in the order of seconds and hence, could block the xcvrd daemon graceful shutdown process for a prolonged time.

How Has This Been Tested?

Please find the summary below.
1. Kill xcvrd
2. Restart xcvrd
3. Restart pmon
4. Kill 1 thread and ensure all threads are killed and xcvrd is respawned
5. Ensure backtrace is generated with a manual exception - CmisManagerTask (Please refer to the below note for details)
6. Ensure backtrace is generated with a manual exception - DomInfoUpdateTask (Please refer to the below note for details)
7. Ensure backtrace is generated with a manual exception - SfpStateUpdateTask (Please refer to the below note for details)
8. Ensure on Mellanox, no side effect is seen since they do not spawn CMIS thread

For a detailed UT, please refer to the below enclosure.
xcvrd_thread_support.txt

Additional Information (Optional)

Signed-off-by: Mihir Patel patelmi@microsoft.com

@mihirpat1 mihirpat1 marked this pull request as ready for review December 28, 2022 11:43
@prgeor
Copy link
Collaborator

prgeor commented Jan 3, 2023

@keboliu can you review

@prgeor prgeor added the xcvrd label Jan 3, 2023
self.exc = e
self.main_thread_stop_event.set()

def raise_exception(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add comment here why this exception is raised here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a comment now.

@prgeor
Copy link
Collaborator

prgeor commented Jan 3, 2023

@mihirpat1 can you check the behavior if you kill the threads with kill -9

@prgeor
Copy link
Collaborator

prgeor commented Jan 3, 2023

@jaganbal-a @shyam77git please review

Signed-off-by: Mihir Patel <patelmi@microsoft.com>
@mihirpat1
Copy link
Contributor Author

@mihirpat1 can you check the behavior if you kill the threads with kill -9

It seems we won't be able to kill threads using kill -9 since kill requires process ID as a parameter. I have attached UT for killing xcvrd process though.

@keboliu
Copy link
Collaborator

keboliu commented Jan 5, 2023

What if "CmisManagerTask " is disabled with intention? On the Mellanox platforms "CmisManagerTask" is disabled by setting the "skip_xcvrd_cmis_mgr" to "true".

@mihirpat1
Copy link
Contributor Author

What if "CmisManagerTask " is disabled with intention? On the Mellanox platforms "CmisManagerTask" is disabled by setting the "skip_xcvrd_cmis_mgr" to "true".

If "CmisManagerTask" is disabled with intention, we will not spawn thread for it. Also, I have attached the UT for it (#8 in xcvrd_thread_support.txt)

@@ -2199,9 +2247,10 @@ class DaemonXcvrd(daemon_base.DaemonBase):
def __init__(self, log_identifier, skip_cmis_mgr=False):
super(DaemonXcvrd, self).__init__(log_identifier)
self.stop_event = threading.Event()
self.sfp_error_event = multiprocessing.Event()
self.sfp_error_event = threading.Event()
Copy link

@jaganbal-a jaganbal-a Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sfp_error_event event seems not handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we are using it at the below:
https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-xcvrd/xcvrd/xcvrd.py#L2391-L2392
if self.sfp_error_event.is_set():
sys.exit(SFP_SYSTEM_ERROR)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that code will never get hit, when the event get set due to error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, since we are killing the parent process (xcvrd) if next_state is STATE_EXIT, the above code will never get hit. This seems like a day 1 behavior though.

threading.Thread.__init__(self)
self.name = "CmisManagerTask"
self.exc = None
self.task_stopping_event = threading.Event()
Copy link

@jaganbal-a jaganbal-a Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain how this event created inside the thread will be used?

Copy link
Contributor Author

@mihirpat1 mihirpat1 Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This event will be used to stop the handling the task_worker routine (in case of exception handling or graceful shutdown of the thread). The task_worker routine will be running in a while loop waiting for task_stopping_event to be set in addition to performing necessary actions.
Hence, once we want the thread to exit, we must stop the execution of task_worker routine so that it comes out of the while loop and allows us to proceed with exception handling or graceful shutdown.


# Stop the sfp state info update thread
if sfp_state_update.is_alive():
sfp_state_update.raise_exception()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be good to join(1) with timeout and raise exception if timeout expires and the thread is still alive?

Copy link
Contributor Author

@mihirpat1 mihirpat1 Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's better to raise exception right away since based on my current analysis on Arista testbed, it takes about ~5s for the vendor API to come out of sleep. This timeout I assume could be dynamic based on the transceivers present on the device so I think having 1s delay will not give us a much advantage for join function with 1s timeout.

jaganbal-a
jaganbal-a previously approved these changes Jan 9, 2023
sonic-xcvrd/xcvrd/xcvrd.py Outdated Show resolved Hide resolved
vdahiya12
vdahiya12 previously approved these changes Jan 12, 2023
…sed to be skipped. Also, moidified join function to handle accordingly. Added logs for showing names of threads spawned

Signed-off-by: Mihir Patel <patelmi@microsoft.com>
@mihirpat1 mihirpat1 dismissed stale reviews from vdahiya12 and jaganbal-a via 311ed90 January 13, 2023 20:29
yxieca pushed a commit that referenced this pull request Jan 19, 2023
* Xcvrd should restart if any child thread crashes

Signed-off-by: Mihir Patel <patelmi@microsoft.com>

* Resolved test_SfpStateUpdateTask_task_run_stop test failure

* Added comment for raise_exception

Signed-off-by: Mihir Patel <patelmi@microsoft.com>

* Added check for avoiding cmis_manager.start() if CMIS thread is supposed to be skipped. Also, moidified join function to handle accordingly. Added logs for showing names of threads spawned

Signed-off-by: Mihir Patel <patelmi@microsoft.com>

Signed-off-by: Mihir Patel <patelmi@microsoft.com>
@prgeor
Copy link
Collaborator

prgeor commented Feb 2, 2023

@StormLiangMS please pick to 202211

StormLiangMS pushed a commit that referenced this pull request Feb 4, 2023
* Xcvrd should restart if any child thread crashes

Signed-off-by: Mihir Patel <patelmi@microsoft.com>

* Resolved test_SfpStateUpdateTask_task_run_stop test failure

* Added comment for raise_exception

Signed-off-by: Mihir Patel <patelmi@microsoft.com>

* Added check for avoiding cmis_manager.start() if CMIS thread is supposed to be skipped. Also, moidified join function to handle accordingly. Added logs for showing names of threads spawned

Signed-off-by: Mihir Patel <patelmi@microsoft.com>

Signed-off-by: Mihir Patel <patelmi@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants