Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warm reboot: Support vlanmgrd process warm restart #550

Merged
merged 7 commits into from
Aug 16, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions cfgmgr/Makefile.am
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
INCLUDES = -I $(top_srcdir) -I $(top_srcdir)/orchagent
INCLUDES = -I $(top_srcdir) -I $(top_srcdir)/orchagent -I $(top_srcdir)/warmrestart
CFLAGS_SAI = -I /usr/include/sai

bin_PROGRAMS = vlanmgrd intfmgrd buffermgrd
Expand All @@ -9,7 +9,8 @@ else
DBGFLAGS = -g
endif

vlanmgrd_SOURCES = vlanmgrd.cpp vlanmgr.cpp $(top_srcdir)/orchagent/orch.cpp $(top_srcdir)/orchagent/request_parser.cpp shellcmd.h
vlanmgrd_SOURCES = vlanmgrd.cpp vlanmgr.cpp $(top_srcdir)/orchagent/orch.cpp $(top_srcdir)/orchagent/request_parser.cpp \
shellcmd.h $(top_srcdir)/warmrestart/warm_restart.cpp $(top_srcdir)/warmrestart/warm_restart.h
vlanmgrd_CFLAGS = $(DBGFLAGS) $(AM_CFLAGS) $(CFLAGS_COMMON) $(CFLAGS_SAI)
vlanmgrd_CPPFLAGS = $(DBGFLAGS) $(AM_CFLAGS) $(CFLAGS_COMMON) $(CFLAGS_SAI)
vlanmgrd_LDADD = -lswsscommon
Expand Down
76 changes: 68 additions & 8 deletions cfgmgr/vlanmgr.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
#include "exec.h"
#include "tokenize.h"
#include "shellcmd.h"
#include "warm_restart.h"

using namespace std;
using namespace swss;
Expand All @@ -26,11 +27,26 @@ VlanMgr::VlanMgr(DBConnector *cfgDb, DBConnector *appDb, DBConnector *stateDb, c
m_statePortTable(stateDb, STATE_PORT_TABLE_NAME),
m_stateLagTable(stateDb, STATE_LAG_TABLE_NAME),
m_stateVlanTable(stateDb, STATE_VLAN_TABLE_NAME),
m_stateVlanMemberTable(stateDb, STATE_VLAN_MEMBER_TABLE_NAME),
m_appVlanTableProducer(appDb, APP_VLAN_TABLE_NAME),
m_appVlanMemberTableProducer(appDb, APP_VLAN_MEMBER_TABLE_NAME)
{
SWSS_LOG_ENTER();

if (WarmStart::isWarmStart())
{
const std::string cmds = std::string("")
+ IP_CMD + " link show " + DOT1Q_BRIDGE_NAME + " 2>/dev/null";

std::string res;
int ret = swss::exec(cmds, res);
if (ret == 0)
{
// Don't reset vlan aware bridge upon swss docker warm restart.
SWSS_LOG_INFO("vlanmgrd warm start, skipping bridge create");
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure if this approach is bullet proof.

what if we later change vlan_filtering option, or enable more option for the bridge. it could happen that older version does not have that option, but new vlanmgrd will enable that option, but the warm reboot will miss it.

I think the right approach should still remove all of them and add new, this is mainly control plane, then we can still achieve non data plane disruption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removal of vlan doesn't affect data plane directly, but BGP docker and BGP will be affected and cause route flapping.

If there is vlan_filtering option change though unlikely for now, probably the easier way is to handle that explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not particularly worry about vlan_filtering, I am more worry about future bridge attribute, maybe disable unknown multicast, unknown unicast options.

for bgp docker, I think we can do docker pause.

https://docs.docker.com/engine/reference/commandline/pause/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete/create of bridge and vlan is done in linux kernel, and zebra listen on that, pausing docker in this case may trigger unknown side effect since we don't know the exact timing of netlink message. It also makes interface handling more complex.

For disabling unknown multicast, unknown unicast, probably we should add configuration option for them, that was in the original vlan trunk pull request. We may bring it back and refine the change later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is that in the future you never know what you are going to add for the bridge. Therefore, we should create exactly the same one as we create in cold boot.

To ensure that, the cleanest approach is to remove and recreate, then we can share the same code path as the cold boot.

if this can cause control plane disruption, we should then stop the bgp container and do the bgp gr.


In reply to: 205924420 [](ancestors = 205924420)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides system level warm reboot, we want to support docker warm restart. We intend to have same code path for cold boot and warm boot whenever possible.

For this case, it is in constructor phase. I don't see why new option has to be put here instead of as a configuration option. Also we use docker to separate the services, stopping other docker when doing operation in one docker doesn't seem that clean.

}
}
// Initialize Linux dot1q bridge and enable vlan filtering
// The command should be generated as:
// /bin/bash -c "/sbin/ip link del Bridge 2>/dev/null ;
Expand Down Expand Up @@ -236,6 +252,19 @@ void VlanMgr::doVlanTask(Consumer &consumer)
vector<FieldValueTuple> fvVector;
string members;

/*
* Don't program vlan again if state is already set.
* will hit this for docker warm restart.
* Just set the internal data structure and remove the request.
*/
if (isVlanStateOk(key))
{
m_vlans.insert(key);
it = consumer.m_toSync.erase(it);
SWSS_LOG_DEBUG("%s already created", kfvKey(t).c_str());
continue;
}

/* Add host VLAN when it has not been created. */
if (m_vlans.find(key) == m_vlans.end())
{
Expand Down Expand Up @@ -355,6 +384,18 @@ bool VlanMgr::isVlanStateOk(const string &alias)
return false;
}

bool VlanMgr::isVlanMemberStateOk(const string &vlanMemberKey)
{
vector<FieldValueTuple> temp;

if (m_stateVlanMemberTable.get(vlanMemberKey, temp))
{
SWSS_LOG_DEBUG("%s is ready", vlanMemberKey.c_str());
return true;
}
return false;
}

/*
* members is grouped in format like
* "Ethernet1,Ethernet2,Ethernet3,Ethernet4,Ethernet5,Ethernet6,
Expand Down Expand Up @@ -442,6 +483,13 @@ void VlanMgr::doVlanMemberTask(Consumer &consumer)
// TODO: store port/lag/VLAN data in local data structure and perform more validations.
if (op == SET_COMMAND)
{
if (isVlanMemberStateOk(kfvKey(t)))
{
SWSS_LOG_DEBUG("%s already set", kfvKey(t).c_str());
it = consumer.m_toSync.erase(it);
continue;
}

/* Don't proceed if member port/lag is not ready yet */
if (!isMemberStateOk(port_alias) || !isVlanStateOk(vlan_alias))
{
Expand Down Expand Up @@ -474,24 +522,36 @@ void VlanMgr::doVlanMemberTask(Consumer &consumer)
key += DEFAULT_KEY_SEPARATOR;
key += port_alias;
m_appVlanMemberTableProducer.set(key, kfvFieldsValues(t));

vector<FieldValueTuple> fvVector;
FieldValueTuple s("state", "ok");
fvVector.push_back(s);
m_stateVlanMemberTable.set(kfvKey(t), fvVector);
}
it = consumer.m_toSync.erase(it);
}
else if (op == DEL_COMMAND)
{
removeHostVlanMember(vlan_id, port_alias);
key = VLAN_PREFIX + to_string(vlan_id);
key += DEFAULT_KEY_SEPARATOR;
key += port_alias;
m_appVlanMemberTableProducer.del(key);
if (isVlanMemberStateOk(kfvKey(t)))
{
removeHostVlanMember(vlan_id, port_alias);
key = VLAN_PREFIX + to_string(vlan_id);
key += DEFAULT_KEY_SEPARATOR;
key += port_alias;
m_appVlanMemberTableProducer.del(key);
m_stateVlanMemberTable.del(kfvKey(t));
}
else
{
SWSS_LOG_DEBUG("%s doesn't exist", kfvKey(t).c_str());
}
SWSS_LOG_DEBUG("%s", (dumpTuple(consumer, t)).c_str());
it = consumer.m_toSync.erase(it);
}
else
{
SWSS_LOG_ERROR("Unknown operation type %s", op.c_str());
it = consumer.m_toSync.erase(it);
}
/* Other than the case of member port/lag is not ready, no retry will be performed */
it = consumer.m_toSync.erase(it);
}
}

Expand Down
3 changes: 2 additions & 1 deletion cfgmgr/vlanmgr.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ class VlanMgr : public Orch
ProducerStateTable m_appVlanTableProducer, m_appVlanMemberTableProducer;
Table m_cfgVlanTable, m_cfgVlanMemberTable;
Table m_statePortTable, m_stateLagTable;
Table m_stateVlanTable;
Table m_stateVlanTable, m_stateVlanMemberTable;
std::set<std::string> m_vlans;

void doTask(Consumer &consumer);
Expand All @@ -38,6 +38,7 @@ class VlanMgr : public Orch
bool isMemberStateOk(const string &alias);
bool isVlanStateOk(const string &alias);
bool isVlanMacOk();
bool isVlanMemberStateOk(const string &vlanMemberKey);
};

}
Expand Down
3 changes: 3 additions & 0 deletions cfgmgr/vlanmgrd.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#include "producerstatetable.h"
#include "vlanmgr.h"
#include "shellcmd.h"
#include "warm_restart.h"

using namespace std;
using namespace swss;
Expand Down Expand Up @@ -56,6 +57,8 @@ int main(int argc, char **argv)
DBConnector appDb(APPL_DB, DBConnector::DEFAULT_UNIXSOCKET, 0);
DBConnector stateDb(STATE_DB, DBConnector::DEFAULT_UNIXSOCKET, 0);

WarmStart::checkWarmStart("vlanmgrd");

/*
* swss service starts after interfaces-config.service which will have
* switch_mac set.
Expand Down
202 changes: 202 additions & 0 deletions tests/test_warm_reboot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
from swsscommon import swsscommon
import os
import re
import time
import json

# Get restart count of all processes supporting warm restart
def swss_get_RestartCount(state_db):
restart_count = {}
warmtbl = swsscommon.Table(state_db, swsscommon.STATE_WARM_RESTART_TABLE_NAME)
keys = warmtbl.getKeys()
assert len(keys) != 0
for key in keys:
(status, fvs) = warmtbl.get(key)
assert status == True
for fv in fvs:
if fv[0] == "restart_count":
restart_count[key] = int(fv[1])
print(restart_count)
return restart_count

# function to check the restart count incremented by 1 for all processes supporting warm restart
def swss_check_RestartCount(state_db, restart_count):
warmtbl = swsscommon.Table(state_db, swsscommon.STATE_WARM_RESTART_TABLE_NAME)
keys = warmtbl.getKeys()
print(keys)
assert len(keys) > 0
for key in keys:
(status, fvs) = warmtbl.get(key)
assert status == True
for fv in fvs:
if fv[0] == "restart_count":
assert int(fv[1]) == restart_count[key] + 1
elif fv[0] == "state":
assert fv[1] == "reconciled"

def check_port_oper_status(appl_db, port_name, state):
portTbl = swsscommon.Table(appl_db, swsscommon.APP_PORT_TABLE_NAME)
(status, fvs) = portTbl.get(port_name)
assert status == True

oper_status = "unknown"
for v in fvs:
if v[0] == "oper_status":
oper_status = v[1]
break
assert oper_status == state

# function to check the restart count incremented by 1 for a single process
def swss_app_check_RestartCount_single(state_db, restart_count, name):
warmtbl = swsscommon.Table(state_db, swsscommon.STATE_WARM_RESTART_TABLE_NAME)
keys = warmtbl.getKeys()
print(keys)
print(restart_count)
assert len(keys) > 0
for key in keys:
if key != name:
continue
(status, fvs) = warmtbl.get(key)
assert status == True
for fv in fvs:
if fv[0] == "restart_count":
assert int(fv[1]) == restart_count[key] + 1
elif fv[0] == "state":
assert fv[1] == "reconciled"
def create_entry(tbl, key, pairs):
fvs = swsscommon.FieldValuePairs(pairs)
tbl.set(key, fvs)

# FIXME: better to wait until DB create them
time.sleep(1)

def create_entry_tbl(db, table, key, pairs):
tbl = swsscommon.Table(db, table)
create_entry(tbl, key, pairs)

def del_entry_tbl(db, table, key):
tbl = swsscommon.Table(db, table)
tbl._del(key)

def create_entry_pst(db, table, key, pairs):
tbl = swsscommon.ProducerStateTable(db, table)
create_entry(tbl, key, pairs)

def how_many_entries_exist(db, table):
tbl = swsscommon.Table(db, table)
return len(tbl.getKeys())


def test_VlanMgrdWarmRestart(dvs):

conf_db = swsscommon.DBConnector(swsscommon.CONFIG_DB, dvs.redis_sock, 0)
appl_db = swsscommon.DBConnector(swsscommon.APPL_DB, dvs.redis_sock, 0)
state_db = swsscommon.DBConnector(swsscommon.STATE_DB, dvs.redis_sock, 0)

dvs.runcmd("ifconfig Ethernet16 0")
dvs.runcmd("ifconfig Ethernet20 0")

dvs.runcmd("ifconfig Ethernet16 up")
dvs.runcmd("ifconfig Ethernet20 up")

time.sleep(1)

# enable warm restart
# TODO: use cfg command to config it
create_entry_tbl(
conf_db,
swsscommon.CFG_WARM_RESTART_TABLE_NAME, "swss",
[
("enable", "true"),
]
)

# create vlan
create_entry_tbl(
conf_db,
"VLAN", "Vlan16",
[
("vlanid", "16"),
]
)
# create vlan
create_entry_tbl(
conf_db,
"VLAN", "Vlan20",
[
("vlanid", "20"),
]
)
# create vlan member entry in config db. Don't use Ethernet0/4/8/12 as IP configured on them in previous testing.
create_entry_tbl(
conf_db,
"VLAN_MEMBER", "Vlan16|Ethernet16",
[
("tagging_mode", "untagged"),
]
)

create_entry_tbl(
conf_db,
"VLAN_MEMBER", "Vlan20|Ethernet20",
[
("tagging_mode", "untagged"),
]
)

time.sleep(1)

dvs.runcmd("ifconfig Vlan16 11.0.0.1/29 up")
dvs.runcmd("ifconfig Vlan20 11.0.0.9/29 up")

dvs.servers[4].runcmd("ifconfig eth0 11.0.0.2/29")
dvs.servers[4].runcmd("ip route add default via 11.0.0.1")

dvs.servers[5].runcmd("ifconfig eth0 11.0.0.10/29")
dvs.servers[5].runcmd("ip route add default via 11.0.0.9")

time.sleep(1)

# Ping should work between servers via vs vlan interfaces
ping_stats = dvs.servers[4].runcmd("ping -c 1 11.0.0.10")
time.sleep(1)

tbl = swsscommon.Table(appl_db, "NEIGH_TABLE")
(status, fvs) = tbl.get("Vlan16:11.0.0.2")
assert status == True

(status, fvs) = tbl.get("Vlan20:11.0.0.10")
assert status == True


bv_before = dvs.runcmd("bridge vlan")
print(bv_before)

restart_count = swss_get_RestartCount(state_db)

dvs.runcmd(['sh', '-c', 'pkill -x vlanmgrd; cp /var/log/swss/sairedis.rec /var/log/swss/sairedis.rec.b; echo > /var/log/swss/sairedis.rec'])
dvs.runcmd(['sh', '-c', 'supervisorctl start vlanmgrd'])
time.sleep(2)

bv_after = dvs.runcmd("bridge vlan")
assert bv_after == bv_before

# No create/set/remove operations should be passed down to syncd for vlanmgr warm restart
num = dvs.runcmd(['sh', '-c', 'grep \|c\| /var/log/swss/sairedis.rec | wc -l'])
assert num == '0\n'
num = dvs.runcmd(['sh', '-c', 'grep \|s\| /var/log/swss/sairedis.rec | wc -l'])
assert num == '0\n'
num = dvs.runcmd(['sh', '-c', 'grep \|r\| /var/log/swss/sairedis.rec | wc -l'])
assert num == '0\n'

#new ip on server 5
dvs.servers[5].runcmd("ifconfig eth0 11.0.0.11/29")

# Ping should work between servers via vs vlan interfaces
ping_stats = dvs.servers[4].runcmd("ping -c 1 11.0.0.11")

# new neighbor learn on VS
(status, fvs) = tbl.get("Vlan20:11.0.0.11")
assert status == True

swss_app_check_RestartCount_single(state_db, restart_count, "vlanmgrd")