Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syslog rate limit design #1049

Merged
merged 8 commits into from
Nov 8, 2022

Conversation

Junchao-Mellanox
Copy link
Contributor

@Junchao-Mellanox Junchao-Mellanox commented Aug 8, 2022

Overview:

Logging in SONiC is organized with rsyslogd. Each container has its own rsyslogd instance plus a daemon running on a host. The rsyslogd instance which is running on the host is used to collect the messages from within containers and store them at /var/log/syslog path. Rsyslog config file are generated from templates:

Currently, each container has hardcoded message rate limiting to avoid receiving flooded log messages:

$SystemLogRateLimitInterval 300
$SystemLogRateLimitBurst 20000

There is no rate limiting configured on host side for now.

The SystemLogRateLimitInterval determines the amount of time that is being measured for rate limiting. The SystemLogRateLimitBurst defines the amount of messages, that have to occur in the time limit of SystemLogRateLimitInterval, to trigger rate limiting. For example, SystemLogRateLimitInterval=300, SystemLogRateLimitBurst=20000, it means that if one daemon generate more than 20000 messages in 300 seconds, rsyslogd will start to drop messages after that(FIFO).

This feature allows user to configure SystemLogRateLimitInterval and SystemLogRateLimitBurst for host, containers.

Related PR:

PR title state context
[sonic-host-services] Support host side rate limit configuration GitHub issue/pull request detail GitHub pull request check contexts
[sonic-utilities] Support host side rate limit configuration GitHub issue/pull request detail GitHub pull request check contexts
[YANG] Support host side rate limit configuration GitHub issue/pull request detail GitHub pull request check contexts
[infra] Support syslog rate limit configuration GitHub issue/pull request detail GitHub pull request check contexts
[containercfgd] Add containercfgd and syslog rate limit configuration support GitHub issue/pull request detail GitHub pull request check contexts
[sonic-mgmt] Add test case for syslog rate limit feature GitHub issue/pull request detail GitHub pull request check contexts

@Junchao-Mellanox
Copy link
Contributor Author

Hi @wen587 , comment fixed. Could you please review again?

@wen587 wen587 requested a review from qiluo-msft August 15, 2022 04:19
@wen587
Copy link
Contributor

wen587 commented Aug 15, 2022

LGTM

doc/syslog/syslog-rate-limit-design.md Outdated Show resolved Hide resolved
doc/syslog/syslog-rate-limit-design.md Outdated Show resolved Hide resolved
@Junchao-Mellanox Junchao-Mellanox requested review from venkatmahalingam, lguohan and nazariig and removed request for venkatmahalingam September 6, 2022 06:50
Copy link
Contributor

@wen587 wen587 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please ask for more approvals to merge.

@Junchao-Mellanox
Copy link
Contributor Author

@venkatmahalingam , @lguohan , @madhupalu , @tzack000 , could you please review and sign-off?


![container_config_on_fly](/doc/syslog/images/container_rate_limit_config_flow.svg).

> Note: according to test, syslog rate limit configuration on host side would not affect container side.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox IMHO, it's bit confusing comparing to how it's done in AUTO TECHSUPPORT, where global configuration overrides feature rate limit configuration. Here, using 'GLOBAL', does not reflect the actual intentions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about HOST?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox unfortunately i don't have neither a strong opinion here, nor a good suggestion


![cli_rate_limit_config_flow](/doc/syslog/images/cli_rate_limit_config_flow.svg)

#### Host side flow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox this approach introduces double host side configuration: first time with rsyslog-config service and second time with hostcfgd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, if configuration does not change, hostcfgd will not restart rsyslog

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox how do you plan to achieve that? The only reliable way which i see is to parse the config_db.json (taking into consideration also init_cfg.json) and fill daemon's internal cache on start

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will parse /etc/rsyslog.conf for the first time, and comparing its configuration with CONFIG DB value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox what is the guarantee that rsyslog.conf will be up-to-date before the hostcfgd is started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check hostcfgd logic: https://github.com/sonic-net/sonic-host-services/blob/bc8698d1d760fefedaeb4742ad19b25ef2b3c17b/scripts/hostcfgd#L1661. Hostcfgd will delay itself by waiting for other system services. Thus, rsyslog-config.service and rsyslog.service should be started before hostcfgd,


![host_config_on_fly](/doc/syslog/images/host_rate_limit_config_flow.svg).

#### Container side flow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox this approach introduces double container side configuration: first time with container preStartAction and second time with containercfgd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, if configuration does not change, hostcfgd will not restart rsyslog

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox how do you plan to achieve that? The only reliable way which i see is to parse the config_db.json (taking into consideration also init_cfg.json) and fill daemon's internal cache on start

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will parse /etc/rsyslog.conf for the first time, and comparing its configuration with CONFIG DB value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox makes sense


#### CLI change

Config rate limit:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox why not to have like this:

config
|--- syslog
     |--- rate-limit
          |--- host --interval <interval> --burst <burst>
          |--- container <service_name> --interval <interval> --burst <burst>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. But I don't see an obviously benefit comparing to the current design. As the code has been implemented for a long time, I would like to keep it as rate-limit-host and rate-limit-container

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox ok. Got it.


> Note: set interval or burst to 0 will disable rate limit.

Show rate limit:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox why not to have like this:

show
|--- syslog
     |--- rate-limit
          |--- host
          |--- container <service_name>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. But I don't see an obviously benefit comparing to the current design. As the code has been implemented for a long time, I would like to keep it as rate-limit-host and rate-limit-container

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Junchao-Mellanox ok. Got it.

@Junchao-Mellanox Junchao-Mellanox requested review from nazariig and removed request for venkatmahalingam October 18, 2022 08:32
@liat-grozovik
Copy link
Collaborator

Last call on HLD review, @tzack000 and @venkatmahalingam if you have further comments this is the time. Otherwise I will cont with merging it.
PR should be ready very soon and will be tracked with this PR description.

@zhangyanzhao
Copy link
Collaborator

@qiluo-msft can you please merge this PR? Thanks.

@liat-grozovik liat-grozovik merged commit bd2cfde into sonic-net:master Nov 8, 2022
@kannankvs
Copy link
Collaborator

@Junchao-Mellanox @liat-grozovik The document seems to have special characters and hence the page deployment workflow is failing. Can you check if the failure on page deployment workflow is not due to this merge ?

@Junchao-Mellanox
Copy link
Contributor Author

@Junchao-Mellanox @liat-grozovik The document seems to have special characters and hence the page deployment workflow is failing. Can you check if the failure on page deployment workflow is not due to this merge ?

Hi, what do you mean by "page deployment workflow"? where can I get the log?

@zhangyanzhao
Copy link
Collaborator

@StormLiangMS can you please back port this to 202211? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.