-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[auto-techsupport] support techsupport generation on potential memory… #939
Changes from all commits
ea5bc56
9cdfc05
991eba4
e178596
86c46a3
ad03b32
c7b9093
669409c
6a588d7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,17 +8,17 @@ | |
* [2. High Level Requirements](#2-high-level-requirements) | ||
* [3. Core Dump Generation in SONiC](#3-core-dump-generation-in-sonic) | ||
* [4. Schema Additions](#4-schema-additions) | ||
* [5. CLI Enhancements](#5-cli-enhancements) | ||
* [6. Design](#6-design) | ||
* [6. CLI Enhancements](#5-cli-enhancements) | ||
* [7. Design](#6-design) | ||
* [6.1 Modifications to coredump-compress script](#61-Modifications-to-coredump-compress-script) | ||
* [6.2 coredump_gen_handler script](#62-coredump_gen_handler-script) | ||
* [6.3 Modifications to generate_dump script](#64-Modifications-to-generate-dump-script) | ||
* [6.4 techsupport_cleanup script](#65-techsupport_cleanup-script) | ||
* [6.5 Warmboot consideration](#65-Warmboot-consideration) | ||
* [6.6 MultiAsic consideration](#66-MultiAsic-consideration) | ||
* [6.7 Design choices for max-techsupport-limit & max-techsupport-limit arguments](#67-Design-choices-for-max-core-limit-&-max-techsupport-limit-arguments) | ||
* [7. Test Plan](#7-Test-Plan) | ||
* [8. SONiC-to-SONiC Upgrade Considerations](#8-SONiC-to-SONiC-Upgrade-Considerations) | ||
* [8. Test Plan](#7-Test-Plan) | ||
* [9. SONiC-to-SONiC Upgrade Considerations](#8-SONiC-to-SONiC-Upgrade-Considerations) | ||
|
||
|
||
### Revision | ||
|
@@ -27,6 +27,7 @@ | |
| 1.0 | 06/22/2021 | Vivek Reddy Karri | Auto Invocation of Techsupport, triggered by a core dump | | ||
| 1.1 | TBD | Vivek Reddy Karri | Add the capability to Register/Deregister app extension to AUTO_TECHSUPPORT_FEATURE table | | ||
| 2.0 | TBD | Vivek Reddy Karri | Extending Support for Kernel Dumps | | ||
| 3.0 | 02/2022 | Stepan Blyshchak | Extending Support for memory usage threshold crossed | | ||
|
||
## About this Manual | ||
This document describes the details of the system which facilitates the auto techsupport invocation support in SONiC. The auto invocation is triggered when any process inside the docker crashes and a core dump is generated. | ||
|
@@ -36,13 +37,20 @@ Currently, techsupport is run by invoking `show techsupport` either by orchestra | |
|
||
However if the techsupport invocation can be made event-driven based on core dump generation, that would definitely improve the debuggability. That is the overall idea behind this HLD. All the high-level requirements are summarized in the next section | ||
|
||
Another use case is to gather more information about the system in case there is a memory usage threshold crossed. | ||
SONiC dump generated after system reboots due to out of memory is not enough for debugging the issue | ||
as all the information about processes and their mem usage, smaps (/proc/PID/smaps) is lost. | ||
Once the system detects abnormal memory usage SONiC dump is generated automatically. | ||
|
||
## 2. High Level Requirements | ||
### Global Scope | ||
* Techsupport invocation should also be made event-driven based on core dump generation. | ||
* This is only applicable for the processes running inside the dockers. Does not apply for other processes. | ||
* init_cfg.json will be enhanced to include the "CONFIG" required for this feature (described in section 4) and is enabled by default. | ||
* To provide flexibility, a compile time flag "ENABLE_AUTO_TECH_SUPPORT" should be provided to enable/disable the "CONFIG" for this feature. | ||
* Users should have the abiliity to enable/disable this capability through CLI. | ||
* Techsupport invocation should also be made event-driven based on memory usage threshold crossing. | ||
* The memory usage threshold should be configurable system-wise and per container. | ||
|
||
### Configurable Params | ||
* A configurable "rate_limit_interval" should be introduced to limit the number consecutive of techsupport invocations. | ||
|
@@ -68,14 +76,59 @@ The naming format and compression is governed by the script `/usr/local/bin/core | |
|
||
Where `<comm>` value in the command name associated with a process. comm value of a running process can be read from `/proc/[pid]/comm` file | ||
|
||
## 4. Schema Additions | ||
## 4. Memory usage based techsupport invocation | ||
|
||
If the following condition resolves to true: | ||
``` | ||
(mem_usage > mem_usage_threshold || ${container}_mem_usage > ${container}_mem_usage_threshold) || mem_free <= mem_free_threshold | ||
``` | ||
|
||
where ```mem_usage``` is total system memory used (MemAvailable from /proc/meminfo), | ||
```mem_usage_threshold``` configured threshold, (100 - available_mem_threshold), | ||
|
||
```${container}_mem_usage``` used memory by $container ("docker stats --no-stream --format {{.MemUsage}}" $container), | ||
|
||
```${container}_mem_usage_threshold``` configured memory threshold for $container, (100 - ${container}available_mem_threshold), | ||
|
||
```mem_free``` is the total minus mem usage, ```mem_free_threshold``` - mem free threshold. | ||
|
||
the SONiC techsupport is automatically generated. | ||
|
||
```mem_free_threshold``` is there to invoke dump when there is quite small amount of memory left that is needed to successfully execute "show techsupport". This is going to be 200 MB by default, as at least 80-90 MB takes "show techsupport" execution. | ||
|
||
The check will be implemented as a script that is ran by monit periodically: | ||
|
||
``` | ||
check program mem_checker with path "/usr/bin/mem_threshold_check" | ||
if status != 0 for 10 times within 20 cycles then exec /usr/local/bin/mem_threshold_check_handler" | ||
``` | ||
|
||
The action is going to be ran only once the mem_check script detects memory usage above threshold. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @qiluo-msft I mean that the techsupport will run once if the script returns non zero status code for 10 times within 20 cycles. If memory usage is constantly too high for e.g 100 cycles, only 1 dump will be taken. If lets say the memory usage fluctuates, in example, 10 cycles too high in a row and then goes down for 10 cycles and then again up - there is a protection against too frequent auto tech dumps already present in the auto tech infrastructure. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @qiluo-msft There are "rate-limit-interval", "max-techsupport-limit" parameters in AUTO_TECHSUPPORT table which are already part of auto tech infrastructure that limit the number of dumps. This is only applicable for automatically generated dumps. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. Is auto rotation covered by design? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes |
||
|
||
The "10 times within 20 cycles" part is kept in sync with mem_usage alert from sonic-host monit configuration. | ||
It is possible to make those values configurable however, only the threshold value is considered to be configurable. | ||
|
||
The rate limit as well as techsupport maximum limit is applicable to techsupport generated by memory check. | ||
|
||
#### 202106 and older | ||
|
||
To support thechsupport generation on memory leaks a simple rule to monit is added: | ||
|
||
``` | ||
check system $HOST | ||
if memory usage > 90% for 10 times within 20 cycles then exec /usr/bin/generate_dump | ||
``` | ||
|
||
## 5. Schema Additions | ||
|
||
### Config DB | ||
|
||
#### AUTO_TECHSUPPORT Table | ||
``` | ||
key = "AUTO_TECHSUPPORT|global" | ||
state = "enabled" / "disabled" ; Enable this to make the Techsupport Invocation event driven based on core-dump generation | ||
available_mem_threshold = 1*2DIGIT ; Memory threshold; 0 to disable techsupport invocation on mem leak. | ||
min_available_mem = 1*5DIGIT ; Minimum free memory amount in MB when techsupport will be executed. | ||
rate_limit_interval = 1*5DIGIT ; Minimum Time in seconds, between two successive techsupport invocations. | ||
Manual Invocations will be considered as well in the calculation. | ||
Configure 0 to explicitly disable | ||
|
@@ -99,6 +152,7 @@ since = 1*32VCHAR; ; This limits the auto-invoke | |
``` | ||
key = feature name | ||
state = "enabled" / "disabled" ; Enable auto techsupport invocation on the critical processes running inside this feature | ||
available_mem_threshold = 1*2DIGIT ; Memory threshold; 0 to disable techsupport invocation on mem leak in this container. | ||
rate_limit_interval = 1*5DIGIT ; Rate limit interval for the corresponding feature. Configure 0 to explicitly disable | ||
``` | ||
|
||
|
@@ -143,6 +197,18 @@ module sonic-auto_techsupport { | |
type stypes:admin_mode; | ||
} | ||
|
||
leaf available_mem_threshold { | ||
description "Enable techsupport invocation on available memory threshold crossing; 0 to disable" | ||
type decimal-repr; | ||
default 10.0; | ||
} | ||
|
||
leaf min_available_mem { | ||
description "Minimum free memory amount in MB when techsupport will be executed" | ||
type uint32; | ||
default 200; | ||
} | ||
|
||
leaf rate_limit_interval { | ||
description "Minimum time in seconds between two successive techsupport invocations. Configure 0 to explicitly disable"; | ||
type uint16; | ||
|
@@ -206,6 +272,12 @@ module sonic-auto_techsupport { | |
type stypes:admin_mode; | ||
} | ||
|
||
leaf available_mem_threshold { | ||
description "Enable techsupport invocation on available memory threshold crossing; 0 to disable" | ||
type decimal-repr; | ||
default 10.0; | ||
} | ||
|
||
leaf rate_limit_interval { | ||
description "Rate limit interval for the corresponding feature. Configure 0 to explicitly disable"; | ||
type uint16; | ||
|
@@ -226,27 +298,43 @@ module sonic-auto_techsupport { | |
#### AUTO_TECHSUPPORT_DUMP_INFO Table | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @qiluo-msft There is a rate limiting mechanism already available in auto techsupport infrastructure that limits the number of dumps that can be generated. |
||
``` | ||
key = Techsupport Dump Name | ||
core_dump = 1*64VCHAR ; Core Dump Name | ||
timestamp = 1*12DIGIT ; epoch of this record creation | ||
container_name = 1*64VCHAR ; Container in which the process crashed | ||
event_type = "core" / "memory" ; Type of event caused techsupport invocation | ||
core_dump = 1*64VCHAR ; Core Dump Name | ||
timestamp = 1*12DIGIT ; epoch of this record creation | ||
container_name = 1*64VCHAR ; Container in which the process crashed/mem threshold. Unset when triggered from host. | ||
``` | ||
|
||
Eg: | ||
|
||
``` | ||
hgetall "AUTO_TECHSUPPORT_DUMP_INFO|sonic_dump_sonic_20210412_223645" | ||
1) "core_dump" | ||
2) "orchagent.1599047232.39.core" | ||
1) "event_type" | ||
2) "core" | ||
2) "core_dump" | ||
3) "orchagent.1599047232.39.core" | ||
4) "timestamp" | ||
5) "1599047233" | ||
6) "container_name" | ||
7) "swss" | ||
``` | ||
|
||
``` | ||
hgetall "AUTO_TECHSUPPORT_DUMP_INFO|sonic_dump_sonic_20210412_223123" | ||
1) "event_type" | ||
2) "memory" | ||
3) "timestamp" | ||
4) "1599047233" | ||
4) "1612045251" | ||
5) "container_name" | ||
6) "swss" | ||
``` | ||
|
||
|
||
## 5. CLI Enhancements. | ||
## 6. CLI Enhancements. | ||
|
||
### config cli | ||
``` | ||
config auto-techsupport global state <enabled/disabled> | ||
config auto-techsupport global available-mem-threshold <float upto two decimal places> | ||
config auto-techsupport global min-available-mem <float upto two decimal places> | ||
config auto-techsupport global rate-limit-interval <uint16> | ||
config auto-techsupport global max-techsupport-limit <float upto two decimal places> | ||
config auto-techsupport global max-core-limit <float upto two decimal places> | ||
|
@@ -261,26 +349,26 @@ config auto-techsupport-feature delete restapi | |
|
||
``` | ||
admin@sonic:~$ show auto-techsupport global | ||
STATE RATE LIMIT INTERVAL (sec) MAX TECHSUPPORT LIMIT (%) MAX CORE SIZE (%) SINCE | ||
------- --------------------------- -------------------------- ------------------ ---------- | ||
enabled 180 10.0 5.0 2 days ago | ||
STATE RATE LIMIT INTERVAL (sec) MAX TECHSUPPORT LIMIT (%) MAX CORE SIZE (%) MEM THRESHOLD (%) MEM THRESHOLD (%) SINCE | ||
------- --------------------------- -------------------------- ------------------ ------------------ ------------------- --------- | ||
enabled 180 10.0 5.0 10.0 10.0 2 days ago | ||
|
||
admin@sonic:~$ show auto-techsupport-feature | ||
FEATURE NAME STATE RATE LIMIT INTERVAL (sec) | ||
-------------- -------- -------------------------- | ||
bgp enabled 600 | ||
database enabled 600 | ||
dhcp_relay enabled 600 | ||
lldp enabled 600 | ||
macsec enabled 600 | ||
mgmt-framework enabled 600 | ||
nat enabled 600 | ||
pmon enabled 600 | ||
radv enabled 600 | ||
restapi disabled 800 | ||
sflow enabled 600 | ||
snmp enabled 600 | ||
swss disabled 800 | ||
FEATURE NAME STATE MEM THRESHOLD (%) RATE LIMIT INTERVAL (sec) | ||
-------------- -------- ------------------ -------------------------- | ||
bgp enabled 10.0 600 | ||
database enabled 10.0 600 | ||
dhcp_relay enabled 10.0 600 | ||
lldp enabled 10.0 600 | ||
macsec enabled 10.0 600 | ||
mgmt-framework enabled 10.0 600 | ||
nat enabled 10.0 600 | ||
pmon enabled 10.0 600 | ||
radv enabled 10.0 600 | ||
restapi disabled 10.0 800 | ||
sflow enabled 10.0 600 | ||
snmp enabled 10.0 600 | ||
swss disabled 10.0 800 | ||
|
||
|
||
admin@sonic:~$ show auto-techsupport history | ||
|
@@ -374,7 +462,7 @@ Enhance the existing techsupport sonic-mgmt test with the following cases. | |
| 2 | Check if the techsupport cleanup is working as expected | | ||
| 3 | Check if the global rate-& & per-process rate-limit-interval is working as expected | | ||
| 4 | Check if the core-dump cleanup is working as expected | | ||
|
||
| 5 | Check if the core-dump generated when reaching memory threshold | | ||
## 8. SONiC-to-SONiC Upgrade Considerations | ||
|
||
The default config required for auto_techsupport is present in the init_cfg.json. Therefore, when a clean installation of SONiC is performed, the configuration is found in the config DB and the feature is active. | ||
|
@@ -391,6 +479,7 @@ Load this Example config provided below to enable the feature. Each of the field | |
"rate_limit_interval": "180", | ||
"max_techsupport_limit": "10.0", | ||
"max_core_limit": "5.0", | ||
"available_mem_threashold": "10.0", | ||
"since": "2 days ago" | ||
} | ||
}, | ||
|
@@ -461,4 +550,8 @@ Load this Example config provided below to enable the feature. Each of the field | |
} | ||
} | ||
} | ||
``` | ||
|
||
# Open question | ||
|
||
1. Is 10 % free memory/90 % used memory threshold a reasonable default? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of invoking a full
show techsupport
, can we just save volatile information before reboot? For example, it makes no sense to save syslog during the memory pressure time. We could collect them after reboot. #ClosedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qiluo-msft The user can configure the "since" parameter in CFG_DB and achive that result. If "since" is set to "0 sec" that would mean that no logs, sai, swss recs are recorded.