Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsyslog service restart during eventd service start causes failures #20544

Closed
stepanblyschak opened this issue Oct 18, 2024 · 4 comments · Fixed by #20947
Closed

rsyslog service restart during eventd service start causes failures #20544

stepanblyschak opened this issue Oct 18, 2024 · 4 comments · Fixed by #20947
Assignees
Labels
Triaged this issue has been triaged

Comments

@stepanblyschak
Copy link
Collaborator

Description

Rsyslog service restart by eventd service causes service start failures:

Steps to reproduce the issue:

  1. Reboot

Describe the results you received:

2024 Oct 15 00:33:28.682673 sonic NOTICE admin: Starting bgp service...
2024 Oct 15 00:33:28.712769 sonic NOTICE admin: Warm boot flag: bgp true.
2024 Oct 15 00:33:28.718818 sonic NOTICE admin: Fast boot flag: bgp .
2024 Oct 15 00:33:28.781090 sonic INFO systemd[1]: rsyslog.service: Deactivated successfully.
2024 Oct 15 00:33:28.781536 sonic INFO systemd[1]: Stopped rsyslog.service - System Logging Service.
2024 Oct 15 00:33:28.822264 sonic INFO systemd[1]: Starting rsyslog.service - System Logging Service...
2024 Oct 15 00:33:28.825426 sonic NOTICE systemd[1]: rsyslog.service: Main process exited, code=exited, status=1/FAILURE
2024 Oct 15 00:33:28.825539 sonic WARNING systemd[1]: rsyslog.service: Failed with result 'exit-code'.
2024 Oct 15 00:33:28.825905 sonic ERR systemd[1]: Failed to start rsyslog.service - System Logging Service.
2024 Oct 15 00:33:28.827440 sonic ERR systemctl[6191]: Job for rsyslog.service failed because the control process exited with error code.
2024 Oct 15 00:33:28.827512 sonic ERR systemctl[6191]: See "systemctl status rsyslog.service" and "journalctl -xeu rsyslog.service" for details.
2024 Oct 15 00:33:28.828924 sonic NOTICE systemd[1]: eventd.service: Control process exited, code=exited, status=1/FAILURE
2024 Oct 15 00:33:28.829152 sonic WARNING systemd[1]: eventd.service: Failed with result 'exit-code'.
2024 Oct 15 00:33:28.829769 sonic ERR systemd[1]: Failed to start eventd.service - EVENTD container.
2024 Oct 15 00:33:28.880062 sonic INFO bgp.sh[6707]: Starting existing bgp container with HWSKU Mellanox-SN4700-O8V48
2024 Oct 15 00:33:29.090871 sonic INFO systemd[1]: rsyslog.service: Scheduled restart job, restart counter is at 1.
2024 Oct 15 00:33:29.091220 sonic INFO systemd[1]: Stopped rsyslog.service - System Logging Service.

When doing "warm-reboot" after the switch has booted up there is a crash of eventd:

2024 Oct 15 00:46:53.199914 sonic NOTICE root: Stopping database service...
2024 Oct 15 00:46:53.231922 sonic NOTICE root: DEVICE_SUBTYPE: , CONFIG_KNOB:
2024 Oct 15 00:46:53.234147 sonic NOTICE root: Warm boot flag: database true.
2024 Oct 15 00:46:53.236090 sonic NOTICE root: Fast boot flag: database true.
2024 Oct 15 00:46:53.238770 sonic NOTICE root: Killing Docker database...
2024 Oct 15 00:46:53.285571 sonic INFO containerd[1275]: time="2024-10-15T00:46:53.285451683+03:00" level=info msg="shim disconnected" id=bfab5e784ebb87a07890b6761f46297ec7c94b5a8a245f4a7871
e1f61234d982
2024 Oct 15 00:46:53.285678 sonic INFO containerd[1275]: time="2024-10-15T00:46:53.285514300+03:00" level=warning msg="cleaning up after shim disconnected" id=bfab5e784ebb87a07890b6761f46297
ec7c94b5a8a245f4a7871e1f61234d982 namespace=moby
2024 Oct 15 00:46:53.285732 sonic INFO containerd[1275]: time="2024-10-15T00:46:53.285534422+03:00" level=info msg="cleaning up dead shim"
2024 Oct 15 00:46:53.285788 sonic INFO dockerd[1358]: time="2024-10-15T00:46:53.285562129+03:00" level=info msg="ignoring event" container=bfab5e784ebb87a07890b6761f46297ec7c94b5a8a245f4a787
1e1f61234d982 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
2024 Oct 15 00:46:53.294631 sonic INFO containerd[1275]: time="2024-10-15T00:46:53.294547337+03:00" level=warning msg="cleanup warnings time=\"2024-10-15T00:46:53+03:00\" level=info msg=\"st
arting signal loop\" namespace=moby pid=43027 runtime=io.containerd.runc.v2\n"
2024 Oct 15 00:46:53.305913 sonic INFO systemd[1]: var-lib-docker-overlay2-26686d5cfd186c57cfae80bef2ac2042c2c3870822b0022ed85f3141aa7aab1e-merged.mount: Deactivated successfully.
2024 Oct 15 00:46:53.309127 sonic INFO database.sh[43006]: database
2024 Oct 15 00:46:53.309423 sonic INFO database.sh[4774]: 137
2024 Oct 15 00:46:53.311147 sonic INFO systemd[1]: database.service: Deactivated successfully.
2024 Oct 15 00:46:53.311402 sonic INFO systemd[1]: Stopped database.service - Database container.
2024 Oct 15 00:46:53.311537 sonic NOTICE rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-host:event-stopped-ctr":{"ctr_name":"Database","timestamp":"2024-10-15T00:46:53.311471Z"}
}
2024 Oct 15 00:46:53.312052 sonic INFO dockerd[1358]: time="2024-10-15T00:46:53.311974412+03:00" level=info msg="Processing signal 'terminated'"
2024 Oct 15 00:46:53.312112 sonic INFO systemd[1]: Stopping docker.service - Docker Application Container Engine...
2024 Oct 15 00:46:53.314406 sonic INFO eventd#eventd#supervisord: eventd terminate called after throwing an instance of 'swss::RedisError'
2024 Oct 15 00:46:53.314406 sonic INFO eventd#eventd#supervisord: eventd   what():  RedisError: Failed to redisGetReply in RedisPipeline::pop, err=3: errstr=Server closed the connection
2024 Oct 15 00:46:53.867509 sonic INFO eventd#eventd#supervisord 2024-10-14 21:46:53,865 WARN exited: eventd (terminated by SIGABRT (core dumped); not expected)
2024 Oct 15 00:46:53.867509 sonic INFO eventd#eventd#supervisord 2024-10-14 21:46:53,866 WARN received SIGTERM indicating exit request

Describe the results you expected:

No errors during system start.

Output of show version:

SONiC Software Version: SONiC.202405_RC.31-d2ba308bd_Internal
SONiC OS Version: 12
Distribution: Debian 12.7
Kernel: 6.1.0-22-2-amd64
Build commit: d2ba308bd
Build date: Sun Oct 13 14:15:01 UTC 2024
Built by: sw-r2d2-bot@r-build-sonic-ci03-241

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@zbud-msft
Copy link
Contributor

Looking into this issue

@zbud-msft zbud-msft self-assigned this Oct 21, 2024
@vdahiya12 vdahiya12 added the Triaged this issue has been triaged label Oct 23, 2024
@zbud-msft
Copy link
Contributor

zbud-msft commented Oct 28, 2024

Updated ETA: 12/5

@zbud-msft
Copy link
Contributor

@stepanblyschak What is the current impact of this issue? What blockers is it causing?

@stepanblyschak
Copy link
Collaborator Author

@zbud-msft

  1. This causes additional CPU load on system start as noted in PR comment.

  2. Due to rsyslog restart there could be missing syslog during boot which makes it harder to debug boot and upgrade issues.

  3. Also, I see container creation failures (syncd or could be any other docker container) which seem related to rsyslog service restart. This issue has low reproduce rate (we are testing hundreds of reboot/upgrade iterations), but can be reliably reproduced with the following steps:

Create a service and a script that create docker container:

/home/admin/test.sh

#!/bin/bash -xe

docker create --name abc docker-syncd-mlnx || { rc=$?; echo "RC: $rc" >> /tmp/log ; exit $rc; }
docker rm -f abc

a corresponding test.service file:

[Unit]
Description=syncd service
Requires=database.service
After=database.service
After=swss.service



[Service]
User=root
Environment=sonic_asic_platform=mellanox
ExecStartPre=/home/admin/test.sh
ExecStart=/bin/sleep inf
ExecStop=

TimeoutStartSec=480


[Install]
WantedBy=sonic.target

In one terminal run (this is to replicate what eventd service would do):

while :; do systemctl reset-failed rsyslog  ;  systemctl restart rsyslog;   echo done; done

In another terminal restart test.service:

while :; do systemctl reset-failed test; systemctl restart test ; echo done ; done

In third terminal overwrite rsyslog.conf (this will be done by rsyslog-config service which is in race with eventd):

while :; do
    sonic-cfggen -d -t /usr/share/sonic/templates/rsyslog.conf.j2     -a "{\"udp_server_ip\": \"$udp_server_ip\", \"hostname\": \"$hostname\", \"docker0_ip\": \"$docker0_ip\"}"     > /etc/rsyslog.conf
done

Eventually test.service crashes and the following is printed to /tmp/log:

RC: 141

Indicating docker create command crashed with SIGPIPE. This seems like some other bug exposed by your changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants