-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[auto-techsupport] support techsupport generation on potential memory… #939
[auto-techsupport] support techsupport generation on potential memory… #939
Conversation
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
|
||
the SONiC techsupport is automatically generated. | ||
|
||
```mem_free_threshold``` is there to invoke dump when there is quite small amount of memory left that is needed to successfully execute "show techsupport". This is going to be 200 Mb by default, as at least 80-90 Mb takes "show techsupport" execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft If user configures too low mem_free_thrshold the "show techsupport" will fail. Currently there is no upper limit of "show techsupport" peak memory consumption, which will make it hard to verify the user does not configure this threshold too low.
This invocation will make situation worse.
It is assumed that the situation is already critical and we try to take the dump as best effort.
BTW, the swap partition would make it at least possible to execute some commands, however, it is not configured in SONiC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for misleading. I am asking if the free memory suddenly drop to below 80 MB, show techsupport will always fail. This invocation will make situation worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft Do you mean that if suddenly an application perfroms a very big allocation so that we won't be able to perform "show techsupport". A similar situation might happen when the process leaks memory very quickly so that we don't have enough memory to inoke "show techsupport". In this design the dump collection is best effort, so we don't guaranty that the dump will be actually generated because of that. It is also assumed that if the system is running with that low free memory the situation is already critical (very likelly that reboot will happen) so that additional "show tech" does not help or make it worth, it is making the reboot happen earlier. If the user does want the system to stay with a very low free memory than the user has to disable this feature. Otherwise I don't see a way to handle such a situation.
if status != 0 for 10 times within 20 cycles then exec /usr/local/bin/mem_threshold_check_handler" | ||
``` | ||
|
||
The action is going to be ran only once the mem_check script detects memory usage above threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft I mean that the techsupport will run once if the script returns non zero status code for 10 times within 20 cycles. If memory usage is constantly too high for e.g 100 cycles, only 1 dump will be taken. If lets say the memory usage fluctuates, in example, 10 cycles too high in a row and then goes down for 10 cycles and then again up - there is a protection against too frequent auto tech dumps already present in the auto tech infrastructure.
|
||
the SONiC techsupport is automatically generated. | ||
|
||
```mem_free_threshold``` is there to invoke dump when there is quite small amount of memory left that is needed to successfully execute "show techsupport". This is going to be 200 Mb by default, as at least 80-90 Mb takes "show techsupport" execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft updated in next patch
|
||
### Config DB | ||
|
||
#### AUTO_TECHSUPPORT Table | ||
``` | ||
key = "AUTO_TECHSUPPORT|global" | ||
state = "enabled" / "disabled" ; Enable this to make the Techsupport Invocation event driven based on core-dump generation | ||
available_mem_threshold = 1*2DIGIT ; Memory threshold; 0 to disable techsupport invocation on mem leak. | ||
min-available-mem = 1*5DIGIT ; Minimum free memory amount in Kb when techsupport will be executed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft updated in next patch
|
||
### Config DB | ||
|
||
#### AUTO_TECHSUPPORT Table | ||
``` | ||
key = "AUTO_TECHSUPPORT|global" | ||
state = "enabled" / "disabled" ; Enable this to make the Techsupport Invocation event driven based on core-dump generation | ||
available_mem_threshold = 1*2DIGIT ; Memory threshold; 0 to disable techsupport invocation on mem leak. | ||
min-available-mem = 1*5DIGIT ; Minimum free memory amount in Kb when techsupport will be executed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft Right, updated in next patch
|
||
### Config DB | ||
|
||
#### AUTO_TECHSUPPORT Table | ||
``` | ||
key = "AUTO_TECHSUPPORT|global" | ||
state = "enabled" / "disabled" ; Enable this to make the Techsupport Invocation event driven based on core-dump generation | ||
available_mem_threshold = 1*2DIGIT ; Memory threshold; 0 to disable techsupport invocation on mem leak. | ||
min-available-mem = 1*5DIGIT ; Minimum free memory amount in Kb when techsupport will be executed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If always > 1MB, use MB as unit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft Agree, updated in next patch
@@ -226,27 +298,43 @@ module sonic-auto_techsupport { | |||
#### AUTO_TECHSUPPORT_DUMP_INFO Table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft There is a rate limiting mechanism already available in auto techsupport infrastructure that limits the number of dumps that can be generated.
@@ -36,13 +37,20 @@ Currently, techsupport is run by invoking `show techsupport` either by orchestra | |||
|
|||
However if the techsupport invocation can be made event-driven based on core dump generation, that would definitely improve the debuggability. That is the overall idea behind this HLD. All the high-level requirements are summarized in the next section | |||
|
|||
Another use case is to gather more information about the system in case there is a memory usage threshold crossed. | |||
SONiC dump generated after system reboots due to out of memory is not enough for debugging the issue | |||
as all the information about processes and their mem usage, smaps (/proc/PID/smaps) is lost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft The user can configure the "since" parameter in CFG_DB and achive that result. If "since" is set to "0 sec" that would mean that no logs, sai, swss recs are recorded.
if status != 0 for 10 times within 20 cycles then exec /usr/local/bin/mem_threshold_check_handler" | ||
``` | ||
|
||
The action is going to be ran only once the mem_check script detects memory usage above threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft There are "rate-limit-interval", "max-techsupport-limit" parameters in AUTO_TECHSUPPORT table which are already part of auto tech infrastructure that limit the number of dumps. This is only applicable for automatically generated dumps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Is auto rotation covered by design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@stepanblyschak please update PR description with the PRs list covering this feature. |
- What I did Implemented memory threashold check in auto techsupport feature according to sonic-net/SONiC#939. - How I did it Added two scripts. The check script and the handler script. Few modifications made in auto tech implementation. UT added. - How to verify it Run the action script and the handler script on the switch. Run UT. Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
#### Why I did it To support automatic techsupport invokation in case memory usage is too high. #### How I did it Implemented according to sonic-net/SONiC#939 #### How to verify it UT, manual test on the switch. *DEPENDS* on sonic-net/sonic-utilities#2116
Backport of #2116 - What I did Implemented memory threashold check in auto techsupport feature according to sonic-net/SONiC#939. - How I did it Added two scripts. The check script and the handler script. Few modifications made in auto tech implementation. UT added. - How to verify it Run the action script and the handler script on the switch. Run UT. Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
#### Why I did it To support automatic techsupport invokation in case memory usage is too high. #### How I did it Implemented according to sonic-net/SONiC#939 #### How to verify it UT, manual test on the switch. *DEPENDS* on sonic-net/sonic-utilities#2116
#### Why I did it To support automatic techsupport invokation in case memory usage is too high. #### How I did it Implemented according to sonic-net/SONiC#939 #### How to verify it UT, manual test on the switch. *DEPENDS* on sonic-net/sonic-utilities#2116
#### Why I did it To support automatic techsupport invokation in case memory usage is too high. #### How I did it Implemented according to sonic-net/SONiC#939 #### How to verify it UT, manual test on the switch. *DEPENDS* on sonic-net/sonic-utilities#2116
- What I did Implemented memory threashold check in auto techsupport feature according to sonic-net/SONiC#939. - How I did it Added two scripts. The check script and the handler script. Few modifications made in auto tech implementation. UT added. - How to verify it Run the action script and the handler script on the switch. Run UT. Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
… leaks