Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[auto-techsupport] support techsupport generation on potential memory… #939

Merged
merged 9 commits into from
Mar 20, 2022
157 changes: 125 additions & 32 deletions doc/auto_techsupport_and_coredump_mgmt.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,17 @@
* [2. High Level Requirements](#2-high-level-requirements)
* [3. Core Dump Generation in SONiC](#3-core-dump-generation-in-sonic)
* [4. Schema Additions](#4-schema-additions)
* [5. CLI Enhancements](#5-cli-enhancements)
* [6. Design](#6-design)
* [6. CLI Enhancements](#5-cli-enhancements)
* [7. Design](#6-design)
* [6.1 Modifications to coredump-compress script](#61-Modifications-to-coredump-compress-script)
* [6.2 coredump_gen_handler script](#62-coredump_gen_handler-script)
* [6.3 Modifications to generate_dump script](#64-Modifications-to-generate-dump-script)
* [6.4 techsupport_cleanup script](#65-techsupport_cleanup-script)
* [6.5 Warmboot consideration](#65-Warmboot-consideration)
* [6.6 MultiAsic consideration](#66-MultiAsic-consideration)
* [6.7 Design choices for max-techsupport-limit & max-techsupport-limit arguments](#67-Design-choices-for-max-core-limit-&-max-techsupport-limit-arguments)
* [7. Test Plan](#7-Test-Plan)
* [8. SONiC-to-SONiC Upgrade Considerations](#8-SONiC-to-SONiC-Upgrade-Considerations)
* [8. Test Plan](#7-Test-Plan)
* [9. SONiC-to-SONiC Upgrade Considerations](#8-SONiC-to-SONiC-Upgrade-Considerations)


### Revision
Expand All @@ -27,6 +27,7 @@
| 1.0 | 06/22/2021 | Vivek Reddy Karri | Auto Invocation of Techsupport, triggered by a core dump |
| 1.1 | TBD | Vivek Reddy Karri | Add the capability to Register/Deregister app extension to AUTO_TECHSUPPORT_FEATURE table |
| 2.0 | TBD | Vivek Reddy Karri | Extending Support for Kernel Dumps |
| 3.0 | 02/2022 | Stepan Blyshchak | Extending Support for memory usage threshold crossed |

## About this Manual
This document describes the details of the system which facilitates the auto techsupport invocation support in SONiC. The auto invocation is triggered when any process inside the docker crashes and a core dump is generated.
Expand All @@ -36,13 +37,20 @@ Currently, techsupport is run by invoking `show techsupport` either by orchestra

However if the techsupport invocation can be made event-driven based on core dump generation, that would definitely improve the debuggability. That is the overall idea behind this HLD. All the high-level requirements are summarized in the next section

Another use case is to gather more information about the system in case there is a memory usage threshold crossed.
SONiC dump generated after system reboots due to out of memory is not enough for debugging the issue
as all the information about processes and their mem usage, smaps (/proc/PID/smaps) is lost.
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lost

Instead of invoking a full show techsupport, can we just save volatile information before reboot? For example, it makes no sense to save syslog during the memory pressure time. We could collect them after reboot. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft The user can configure the "since" parameter in CFG_DB and achive that result. If "since" is set to "0 sec" that would mean that no logs, sai, swss recs are recorded.

Once the system detects abnormal memory usage SONiC dump is generated automatically.

## 2. High Level Requirements
### Global Scope
* Techsupport invocation should also be made event-driven based on core dump generation.
* This is only applicable for the processes running inside the dockers. Does not apply for other processes.
* init_cfg.json will be enhanced to include the "CONFIG" required for this feature (described in section 4) and is enabled by default.
* To provide flexibility, a compile time flag "ENABLE_AUTO_TECH_SUPPORT" should be provided to enable/disable the "CONFIG" for this feature.
* Users should have the abiliity to enable/disable this capability through CLI.
* Techsupport invocation should also be made event-driven based on memory usage threshold crossing.
* The memory usage threshold should be configurable system-wise and per container.

### Configurable Params
* A configurable "rate_limit_interval" should be introduced to limit the number consecutive of techsupport invocations.
Expand All @@ -68,14 +76,59 @@ The naming format and compression is governed by the script `/usr/local/bin/core

Where `<comm>` value in the command name associated with a process. comm value of a running process can be read from `/proc/[pid]/comm` file

## 4. Schema Additions
## 4. Memory usage based techsupport invocation

If the following condition resolves to true:
```
(mem_usage > mem_usage_threshold || ${container}_mem_usage > ${container}_mem_usage_threshold) || mem_free <= mem_free_threshold
```

where ```mem_usage``` is total system memory used (MemAvailable from /proc/meminfo),
```mem_usage_threshold``` configured threshold, (100 - available_mem_threshold),

```${container}_mem_usage``` used memory by $container ("docker stats --no-stream --format {{.MemUsage}}" $container),

```${container}_mem_usage_threshold``` configured memory threshold for $container, (100 - ${container}available_mem_threshold),

```mem_free``` is the total minus mem usage, ```mem_free_threshold``` - mem free threshold.

the SONiC techsupport is automatically generated.

```mem_free_threshold``` is there to invoke dump when there is quite small amount of memory left that is needed to successfully execute "show techsupport". This is going to be 200 MB by default, as at least 80-90 MB takes "show techsupport" execution.

The check will be implemented as a script that is ran by monit periodically:

```
check program mem_checker with path "/usr/bin/mem_threshold_check"
if status != 0 for 10 times within 20 cycles then exec /usr/local/bin/mem_threshold_check_handler"
```

The action is going to be ran only once the mem_check script detects memory usage above threshold.
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once

Do you mean once after each bootup? Considering some fluctuations and finally a memory exhausting, you will keep only the first fluctuation "show techsupport" file? #Closed

Copy link
Contributor Author

@stepanblyschak stepanblyschak Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft I mean that the techsupport will run once if the script returns non zero status code for 10 times within 20 cycles. If memory usage is constantly too high for e.g 100 cycles, only 1 dump will be taken. If lets say the memory usage fluctuates, in example, 10 cycles too high in a row and then goes down for 10 cycles and then again up - there is a protection against too frequent auto tech dumps already present in the auto tech infrastructure.

Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

threshold

If auto invoking too many times, do we have file rotation? Do we only rotate the auto dump files and keep user manual dump? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft There are "rate-limit-interval", "max-techsupport-limit" parameters in AUTO_TECHSUPPORT table which are already part of auto tech infrastructure that limit the number of dumps. This is only applicable for automatically generated dumps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Is auto rotation covered by design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes


The "10 times within 20 cycles" part is kept in sync with mem_usage alert from sonic-host monit configuration.
It is possible to make those values configurable however, only the threshold value is considered to be configurable.

The rate limit as well as techsupport maximum limit is applicable to techsupport generated by memory check.

#### 202106 and older

To support thechsupport generation on memory leaks a simple rule to monit is added:

```
check system $HOST
if memory usage > 90% for 10 times within 20 cycles then exec /usr/bin/generate_dump
```

## 5. Schema Additions

### Config DB

#### AUTO_TECHSUPPORT Table
```
key = "AUTO_TECHSUPPORT|global"
state = "enabled" / "disabled" ; Enable this to make the Techsupport Invocation event driven based on core-dump generation
available_mem_threshold = 1*2DIGIT ; Memory threshold; 0 to disable techsupport invocation on mem leak.
min_available_mem = 1*5DIGIT ; Minimum free memory amount in MB when techsupport will be executed.
rate_limit_interval = 1*5DIGIT ; Minimum Time in seconds, between two successive techsupport invocations.
Manual Invocations will be considered as well in the calculation.
Configure 0 to explicitly disable
Expand All @@ -99,6 +152,7 @@ since = 1*32VCHAR; ; This limits the auto-invoke
```
key = feature name
state = "enabled" / "disabled" ; Enable auto techsupport invocation on the critical processes running inside this feature
available_mem_threshold = 1*2DIGIT ; Memory threshold; 0 to disable techsupport invocation on mem leak in this container.
rate_limit_interval = 1*5DIGIT ; Rate limit interval for the corresponding feature. Configure 0 to explicitly disable
```

Expand Down Expand Up @@ -143,6 +197,18 @@ module sonic-auto_techsupport {
type stypes:admin_mode;
}

leaf available_mem_threshold {
description "Enable techsupport invocation on available memory threshold crossing; 0 to disable"
type decimal-repr;
default 10.0;
}

leaf min_available_mem {
description "Minimum free memory amount in MB when techsupport will be executed"
type uint32;
default 200;
}

leaf rate_limit_interval {
description "Minimum time in seconds between two successive techsupport invocations. Configure 0 to explicitly disable";
type uint16;
Expand Down Expand Up @@ -206,6 +272,12 @@ module sonic-auto_techsupport {
type stypes:admin_mode;
}

leaf available_mem_threshold {
description "Enable techsupport invocation on available memory threshold crossing; 0 to disable"
type decimal-repr;
default 10.0;
}

leaf rate_limit_interval {
description "Rate limit interval for the corresponding feature. Configure 0 to explicitly disable";
type uint16;
Expand All @@ -226,27 +298,43 @@ module sonic-auto_techsupport {
#### AUTO_TECHSUPPORT_DUMP_INFO Table
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table

Is there an entry number limit in this table? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft There is a rate limiting mechanism already available in auto techsupport infrastructure that limits the number of dumps that can be generated.

```
key = Techsupport Dump Name
core_dump = 1*64VCHAR ; Core Dump Name
timestamp = 1*12DIGIT ; epoch of this record creation
container_name = 1*64VCHAR ; Container in which the process crashed
event_type = "core" / "memory" ; Type of event caused techsupport invocation
core_dump = 1*64VCHAR ; Core Dump Name
timestamp = 1*12DIGIT ; epoch of this record creation
container_name = 1*64VCHAR ; Container in which the process crashed/mem threshold. Unset when triggered from host.
```

Eg:

```
hgetall "AUTO_TECHSUPPORT_DUMP_INFO|sonic_dump_sonic_20210412_223645"
1) "core_dump"
2) "orchagent.1599047232.39.core"
1) "event_type"
2) "core"
2) "core_dump"
3) "orchagent.1599047232.39.core"
4) "timestamp"
5) "1599047233"
6) "container_name"
7) "swss"
```

```
hgetall "AUTO_TECHSUPPORT_DUMP_INFO|sonic_dump_sonic_20210412_223123"
1) "event_type"
2) "memory"
3) "timestamp"
4) "1599047233"
4) "1612045251"
5) "container_name"
6) "swss"
```


## 5. CLI Enhancements.
## 6. CLI Enhancements.

### config cli
```
config auto-techsupport global state <enabled/disabled>
config auto-techsupport global available-mem-threshold <float upto two decimal places>
config auto-techsupport global min-available-mem <float upto two decimal places>
config auto-techsupport global rate-limit-interval <uint16>
config auto-techsupport global max-techsupport-limit <float upto two decimal places>
config auto-techsupport global max-core-limit <float upto two decimal places>
Expand All @@ -261,26 +349,26 @@ config auto-techsupport-feature delete restapi

```
admin@sonic:~$ show auto-techsupport global
STATE RATE LIMIT INTERVAL (sec) MAX TECHSUPPORT LIMIT (%) MAX CORE SIZE (%) SINCE
------- --------------------------- -------------------------- ------------------ ----------
enabled 180 10.0 5.0 2 days ago
STATE RATE LIMIT INTERVAL (sec) MAX TECHSUPPORT LIMIT (%) MAX CORE SIZE (%) MEM THRESHOLD (%) MEM THRESHOLD (%) SINCE
------- --------------------------- -------------------------- ------------------ ------------------ ------------------- ---------
enabled 180 10.0 5.0 10.0 10.0 2 days ago

admin@sonic:~$ show auto-techsupport-feature
FEATURE NAME STATE RATE LIMIT INTERVAL (sec)
-------------- -------- --------------------------
bgp enabled 600
database enabled 600
dhcp_relay enabled 600
lldp enabled 600
macsec enabled 600
mgmt-framework enabled 600
nat enabled 600
pmon enabled 600
radv enabled 600
restapi disabled 800
sflow enabled 600
snmp enabled 600
swss disabled 800
FEATURE NAME STATE MEM THRESHOLD (%) RATE LIMIT INTERVAL (sec)
-------------- -------- ------------------ --------------------------
bgp enabled 10.0 600
database enabled 10.0 600
dhcp_relay enabled 10.0 600
lldp enabled 10.0 600
macsec enabled 10.0 600
mgmt-framework enabled 10.0 600
nat enabled 10.0 600
pmon enabled 10.0 600
radv enabled 10.0 600
restapi disabled 10.0 800
sflow enabled 10.0 600
snmp enabled 10.0 600
swss disabled 10.0 800


admin@sonic:~$ show auto-techsupport history
Expand Down Expand Up @@ -374,7 +462,7 @@ Enhance the existing techsupport sonic-mgmt test with the following cases.
| 2 | Check if the techsupport cleanup is working as expected |
| 3 | Check if the global rate-& & per-process rate-limit-interval is working as expected |
| 4 | Check if the core-dump cleanup is working as expected |

| 5 | Check if the core-dump generated when reaching memory threshold |
## 8. SONiC-to-SONiC Upgrade Considerations

The default config required for auto_techsupport is present in the init_cfg.json. Therefore, when a clean installation of SONiC is performed, the configuration is found in the config DB and the feature is active.
Expand All @@ -391,6 +479,7 @@ Load this Example config provided below to enable the feature. Each of the field
"rate_limit_interval": "180",
"max_techsupport_limit": "10.0",
"max_core_limit": "5.0",
"available_mem_threashold": "10.0",
"since": "2 days ago"
}
},
Expand Down Expand Up @@ -461,4 +550,8 @@ Load this Example config provided below to enable the feature. Each of the field
}
}
}
```

# Open question

1. Is 10 % free memory/90 % used memory threshold a reasonable default?