Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[memory_monitoring] Enhance monitoring the memory usage of containers #19179

Merged
merged 26 commits into from
Sep 6, 2024

Conversation

FengPan-Frank
Copy link
Contributor

@FengPan-Frank FengPan-Frank commented Jun 3, 2024

Why I did it

We need to restrict memory usage of container specifically, and the reliable option is to read cgroup subsystem files instead of using "docker stats" commands, since the commands will be no response if containers hits hard limit.

Work item tracking
  • Microsoft ADO (number only):28334202

How I did it

Instead of depending on the output of docker stats, the background script memory_checker will calculate the memory usage of a container based on values reading from the cgroup subsystem files /sys/fs/cgroup/memory/docker/<container_name>/memory.usage_in_bytes and /sys/fs/cgroup/memory/docker/<container_name>/memory.stats.

Refer to this Docker official document (https://docs.docker.com/engine/reference/commandline/stats/#extended-description) to make sure the memory usage of a specific container reading from command output of docker stats is equal to the value subtracting cache usage from the total memory usage.

How to verify it

Local verified, since it's just internal enhancement for getting memory usage of container, below are comparison between new memory_check and previous implementation based on "docker stats --no-stream --format {{.MemUsage}} telemetry"

image

Added Unit test code, since there's no build time UT available in this repo currently so verified manually as below:
image

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

files/image_config/monit/memory_checker Outdated Show resolved Hide resolved
files/image_config/monit/memory_checker Outdated Show resolved Hide resolved
files/image_config/monit/memory_checker Show resolved Hide resolved
@liushilongbuaa
Copy link
Contributor

/azpw ms_conflict -f

1 similar comment
@FengPan-Frank
Copy link
Contributor Author

/azpw ms_conflict -f

@FengPan-Frank
Copy link
Contributor Author

/azpw ms_checker -f

@liushilongbuaa
Copy link
Contributor

/azpw ms_checker

files/image_config/monit/memory_checker Outdated Show resolved Hide resolved
files/image_config/monit/memory_checker Outdated Show resolved Hide resolved
files/image_config/monit/memory_checker Show resolved Hide resolved
files/image_config/monit/memory_checker Show resolved Hide resolved
files/image_config/monit/memory_checker Outdated Show resolved Hide resolved
files/image_config/monit/memory_checker Show resolved Hide resolved
files/image_config/monit/memory_checker Show resolved Hide resolved
@FengPan-Frank
Copy link
Contributor Author

/azpw ms_checker

@qiluo-msft
Copy link
Collaborator

/azpw ms_checker

@qiluo-msft
Copy link
Collaborator

/azpw ms_conflict

@qiluo-msft qiluo-msft merged commit 0f5166d into sonic-net:master Sep 6, 2024
23 checks passed
FengPan-Frank added a commit to FengPan-Frank/sonic-buildimage that referenced this pull request Sep 9, 2024
…sonic-net#19179)

### Why I did it
We need to restrict memory usage of container specifically, and the reliable option is to read cgroup subsystem files instead of using "docker stats" commands, since the commands will be no response if containers hits hard limit.

### How I did it
Instead of depending on the output of docker stats, the background script memory_checker will calculate the memory usage of a container based on values reading from the cgroup subsystem files /sys/fs/cgroup/memory/docker/<container_name>/memory.usage_in_bytes and /sys/fs/cgroup/memory/docker/<container_name>/memory.stats.

Refer to this Docker official document (https://docs.docker.com/engine/reference/commandline/stats/#extended-description) to make sure the memory usage of a specific container reading from command output of docker stats is equal to the value subtracting cache usage from the total memory usage.

#### How to verify it
Local verified, since it's just internal enhancement for getting memory usage of container, below are comparison between new memory_check and previous implementation based on "docker stats --no-stream --format {{.MemUsage}} telemetry"

<img width="799" alt="image" src="https://github.com/sonic-net/sonic-buildimage/assets/97083744/3807fc7f-cfc2-4e2f-a078-eaf08b68f803">


Added Unit test code, since there's no build time UT available in this repo currently so verified manually as below:
<img width="1121" alt="image" src="https://github.com/user-attachments/assets/2c7ce241-7967-41ee-a2e9-4bdb2e43f8c2">
FengPan-Frank added a commit to FengPan-Frank/sonic-buildimage that referenced this pull request Sep 9, 2024
…sonic-net#19179)

We need to restrict memory usage of container specifically, and the reliable option is to read cgroup subsystem files instead of using "docker stats" commands, since the commands will be no response if containers hits hard limit.

Instead of depending on the output of docker stats, the background script memory_checker will calculate the memory usage of a container based on values reading from the cgroup subsystem files /sys/fs/cgroup/memory/docker/<container_name>/memory.usage_in_bytes and /sys/fs/cgroup/memory/docker/<container_name>/memory.stats.

Refer to this Docker official document (https://docs.docker.com/engine/reference/commandline/stats/#extended-description) to make sure the memory usage of a specific container reading from command output of docker stats is equal to the value subtracting cache usage from the total memory usage.

Local verified, since it's just internal enhancement for getting memory usage of container, below are comparison between new memory_check and previous implementation based on "docker stats --no-stream --format {{.MemUsage}} telemetry"

<img width="799" alt="image" src="https://github.com/sonic-net/sonic-buildimage/assets/97083744/3807fc7f-cfc2-4e2f-a078-eaf08b68f803">

Added Unit test code, since there's no build time UT available in this repo currently so verified manually as below:
<img width="1121" alt="image" src="https://github.com/user-attachments/assets/2c7ce241-7967-41ee-a2e9-4bdb2e43f8c2">
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Sep 11, 2024
…sonic-net#19179)

### Why I did it
We need to restrict memory usage of container specifically, and the reliable option is to read cgroup subsystem files instead of using "docker stats" commands, since the commands will be no response if containers hits hard limit.

### How I did it
Instead of depending on the output of docker stats, the background script memory_checker will calculate the memory usage of a container based on values reading from the cgroup subsystem files /sys/fs/cgroup/memory/docker/<container_name>/memory.usage_in_bytes and /sys/fs/cgroup/memory/docker/<container_name>/memory.stats.

Refer to this Docker official document (https://docs.docker.com/engine/reference/commandline/stats/#extended-description) to make sure the memory usage of a specific container reading from command output of docker stats is equal to the value subtracting cache usage from the total memory usage.

#### How to verify it
Local verified, since it's just internal enhancement for getting memory usage of container, below are comparison between new memory_check and previous implementation based on "docker stats --no-stream --format {{.MemUsage}} telemetry"

<img width="799" alt="image" src="https://github.com/sonic-net/sonic-buildimage/assets/97083744/3807fc7f-cfc2-4e2f-a078-eaf08b68f803">


Added Unit test code, since there's no build time UT available in this repo currently so verified manually as below:
<img width="1121" alt="image" src="https://github.com/user-attachments/assets/2c7ce241-7967-41ee-a2e9-4bdb2e43f8c2">
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #20234

mssonicbld pushed a commit that referenced this pull request Sep 12, 2024
…#19179)

### Why I did it
We need to restrict memory usage of container specifically, and the reliable option is to read cgroup subsystem files instead of using "docker stats" commands, since the commands will be no response if containers hits hard limit.

### How I did it
Instead of depending on the output of docker stats, the background script memory_checker will calculate the memory usage of a container based on values reading from the cgroup subsystem files /sys/fs/cgroup/memory/docker/<container_name>/memory.usage_in_bytes and /sys/fs/cgroup/memory/docker/<container_name>/memory.stats.

Refer to this Docker official document (https://docs.docker.com/engine/reference/commandline/stats/#extended-description) to make sure the memory usage of a specific container reading from command output of docker stats is equal to the value subtracting cache usage from the total memory usage.

#### How to verify it
Local verified, since it's just internal enhancement for getting memory usage of container, below are comparison between new memory_check and previous implementation based on "docker stats --no-stream --format {{.MemUsage}} telemetry"

<img width="799" alt="image" src="https://github.com/sonic-net/sonic-buildimage/assets/97083744/3807fc7f-cfc2-4e2f-a078-eaf08b68f803">


Added Unit test code, since there's no build time UT available in this repo currently so verified manually as below:
<img width="1121" alt="image" src="https://github.com/user-attachments/assets/2c7ce241-7967-41ee-a2e9-4bdb2e43f8c2">
vvolam pushed a commit to vvolam/sonic-buildimage that referenced this pull request Sep 12, 2024
…sonic-net#19179)

### Why I did it
We need to restrict memory usage of container specifically, and the reliable option is to read cgroup subsystem files instead of using "docker stats" commands, since the commands will be no response if containers hits hard limit.

### How I did it
Instead of depending on the output of docker stats, the background script memory_checker will calculate the memory usage of a container based on values reading from the cgroup subsystem files /sys/fs/cgroup/memory/docker/<container_name>/memory.usage_in_bytes and /sys/fs/cgroup/memory/docker/<container_name>/memory.stats.

Refer to this Docker official document (https://docs.docker.com/engine/reference/commandline/stats/#extended-description) to make sure the memory usage of a specific container reading from command output of docker stats is equal to the value subtracting cache usage from the total memory usage.

#### How to verify it
Local verified, since it's just internal enhancement for getting memory usage of container, below are comparison between new memory_check and previous implementation based on "docker stats --no-stream --format {{.MemUsage}} telemetry"

<img width="799" alt="image" src="https://github.com/sonic-net/sonic-buildimage/assets/97083744/3807fc7f-cfc2-4e2f-a078-eaf08b68f803">


Added Unit test code, since there's no build time UT available in this repo currently so verified manually as below:
<img width="1121" alt="image" src="https://github.com/user-attachments/assets/2c7ce241-7967-41ee-a2e9-4bdb2e43f8c2">
@zbud-msft
Copy link
Contributor

Hi @FengPan-Frank, publish_events is now deleted in this PR, which is not expected.

@FengPan-Frank
Copy link
Contributor Author

Hi @FengPan-Frank, publish_events is now deleted in this PR, which is not expected.

@zbud-msft sorry seems this publish_events was missed previously, may I ask which test case failed? so that I can further check it if update this part code later.

qiluo-msft pushed a commit that referenced this pull request Sep 23, 2024
### Why I did it

#19179 removed call to publish_events when memory usage container exceeds threshold, causing test_events to fail.

### How I did it

Add back call to publish_events

#### How to verify it

Manual test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants