Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes problems with database - swss - syncd synchronization. #110

Conversation

vitaliy-senchyshyn
Copy link

@vitaliy-senchyshyn vitaliy-senchyshyn commented Feb 3, 2017

This PR fixes problem with database - swss - syncd synchronization.

There are two problems:

  1. The database is flushed in swss.service ExecStartPre without check that redis server is started already. Sometimes the flush is executed too early which makes swss failing. Here is the error log of this case:

Feb 2 12:56:23 switch2 INFO docker[798]: Could not connect to Redis at 127.0.0.1:6379: Connection refused
Feb 2 12:56:23 switch2 INFO docker[798]: Could not connect to Redis at 127.0.0.1:6379: Connection refused
Feb 2 12:56:23 switch2 NOTICE systemd[1]: swss.service: control process exited, code=exited status=1
Feb 2 12:56:23 switch2 ERR systemd[1]: Failed to start switch state service container.
Feb 2 12:56:23 switch2 NOTICE systemd[1]: Unit swss.service entered failed state.

In order to solve this to swss.service is added a bash loop which checks that redis server is up using redis-cli ping command. If it's not the loop sleeps for a second before the next try.

  1. The second problem is related to a race condition between database flush performed in swss.service and HIDDEN variable set by syncd. Both the services are started simultaneously and sometimes the flush in swss is performed later than the set in syncd. When after this orchagent starting script checks the HLEN of HIDDEN variable it's always zero so the orchagent just doesn't start.

As a solution syncd.service is made dependant on swss.service and should be executed after the last one is started.

@@ -5,6 +5,8 @@ After=database.service

[Service]
User={{ sonicadmin_user }}
# Wait for redis server start before database clean by checking the server listening port 6379
ExecStartPre=/bin/bash -c "while true; do if [ -n \"$(netstat -l | grep 6379)\" ]; then break; fi; sleep 1; done"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using "nc -z -w 5 127.0.0.1 6379" to check if the port is open? there could a port 36379 that match your criteria.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lguohan netsat output looks as follows on the box:

admin@switch2:~$ netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost:6379 : LISTEN

So I guess it's better to check for ":6379". The it will cover all the cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis CLI has ping command

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, let's check with PING

@lguohan
Copy link
Contributor

lguohan commented Feb 3, 2017

on my box there are lots of more ports.

acsadmin@CCPSCH01030BBLF:~$ sudo netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 :ssh : LISTEN
tcp 0 0 localhost:4000 : LISTEN
tcp 0 0 localhost:zebra : LISTEN
tcp 0 0 localhost:6379 : LISTEN
tcp 0 0 localhost:bgpd : LISTEN
tcp 0 0 :bgp : LISTEN
tcp6 0 0 [::]:ssh [::]:
LISTEN
tcp6 0 0 [::]:bgp [::]:
LISTEN
udp 0 0 *:33222 :
udp 0 0 localhost:syslog :
udp 0 0 *:57900 :
udp 0 0 *:37743 :
udp 0 0 *:50129 :
udp 0 0 *:42391 :
udp 0 0 *:58961 :
udp 0 0 *:22173 :
udp 0 0 *:42721 :
udp 0 0 :47221 :
udp 0 0 :43373 :
udp 0 0 :59820 :
udp 0 0 :bootpc :
udp 0 0 ccpsch01030bblf.phx:ntp :
udp 0 0 localhost:ntp :
udp 0 0 ccpsch01030bblf-lo:snmp :
udp 0 0 ccpsch01030bblf.ph:snmp :
udp 0 0 localhost:snmp :
udp6 0 0 [::]:53888 [::]:

udp6 0 0 [::]:38078 [::]:

udp6 0 0 localhost:ntp [::]:

raw6 0 0 [::]:ipv6-icmp [::]:
7
Active UNIX domain sockets (only servers)
Proto RefCnt Flags Type State I-Node Path
unix 2 [ ACC ] STREAM LISTENING 1290 /run/systemd/private
unix 2 [ ACC ] STREAM LISTENING 11289 /var/run/docker.sock
unix 2 [ ACC ] STREAM LISTENING 9042 /var/lib/docker/network/files/738130dd349f83861d47520786267cc26128056961ac2ae54121c475a5f99610.sock
unix 2 [ ACC ] SEQPACKET LISTENING 1315 /run/udev/control
unix 2 [ ACC ] STREAM LISTENING 1318 /run/systemd/journal/stdout
unix 2 [ ACC ] STREAM LISTENING 182052 /var/run/supervisor.sock.1
unix 2 [ ACC ] STREAM LISTENING 181439 /var/run/quagga/zserv.api
unix 2 [ ACC ] STREAM LISTENING 181441 /var/run/quagga/zebra.vty
unix 2 [ ACC ] STREAM LISTENING 180016 /var/run/lldpd.socket
unix 2 [ ACC ] STREAM LISTENING 180768 /var/run/quagga/bgpd.vty
unix 2 [ ACC ] STREAM LISTENING 149167 /var/run/redis/redis.sock
unix 2 [ ACC ] STREAM LISTENING 27286466 /var/run/sswsyncd/sswsyncd.socket
unix 2 [ ACC ] STREAM LISTENING 181808 /var/run/supervisor.sock.1
unix 2 [ ACC ] STREAM LISTENING 10462 /var/run/docker/libcontainerd/docker-containerd.sock
unix 2 [ ACC ] STREAM LISTENING 181191 /var/agentx/master
unix 2 [ ACC ] STREAM LISTENING 36911 /var/opt/quest/vas/vasd/.vasd_62
unix 2 [ ACC ] STREAM LISTENING 35980 /var/opt/quest/vas/vasd/.vasd40_ipc_sock
unix 2 [ ACC ] STREAM LISTENING 34857 /var/opt/quest/vas/vasd/.vasd_63
unix 2 [ ACC ] STREAM LISTENING 34563 /var/opt/quest/vas/vasd/.vasd_65
unix 2 [ ACC ] STREAM LISTENING 35982 /var/opt/quest/vas/vasd/.vasd_64

@vitaliy-senchyshyn
Copy link
Author

On my too. I've just posted a part of the log.

@lguohan lguohan merged commit a2ab261 into sonic-net:master Feb 3, 2017
Requires=database.service
After=database.service
Requires=database.service swss.service
After=database.service swss.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncd is not depending on swss
and syncd should start before swss
@vitaliy-senchyshyn @lguohan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swss depends on syncd and swss starts after syncd

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swss.service is clearing the database

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I will give a test on this

lguohan added a commit to lguohan/sonic-buildimage that referenced this pull request Feb 9, 2017
lguohan added a commit to sonic-net/sonic-buildimage that referenced this pull request Feb 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants