Warm reboot for general Orch #572

qiluo-msft · 2018-08-10T01:04:18Z

The non-warm reboot behavior is backward compatible, and tested in lab. Sample log

Aug 10 21:06:19.241459 sonic NOTICE swss/orchagent: :- bake: Add warm input: SWITCH_TABLE, 0
Aug 10 21:06:19.241535 sonic NOTICE swss/orchagent: :- bake: Add warm input: CRM, 1
Aug 10 21:06:19.242007 sonic NOTICE swss/orchagent: :- bake: Add warm input: BUFFER_PG, 33
Aug 10 21:06:19.242105 sonic NOTICE swss/orchagent: :- bake: Add warm input: BUFFER_POOL, 4
Aug 10 21:06:19.242185 sonic NOTICE swss/orchagent: :- bake: Add warm input: BUFFER_PORT_EGRESS_PROFILE_LIST, 1
Aug 10 21:06:19.242273 sonic NOTICE swss/orchagent: :- bake: Add warm input: BUFFER_PORT_INGRESS_PROFILE_LIST, 1
Aug 10 21:06:19.242353 sonic NOTICE swss/orchagent: :- bake: Add warm input: BUFFER_PROFILE, 7
Aug 10 21:06:19.242424 sonic NOTICE swss/orchagent: :- bake: Add warm input: BUFFER_QUEUE, 2
Aug 10 21:06:19.242518 sonic NOTICE swss/orchagent: :- bake: foundPortConfigDone = 0
Aug 10 21:06:19.242617 sonic NOTICE swss/orchagent: :- bake: foundPortInitDone = 0
Aug 10 21:06:19.242689 sonic NOTICE swss/orchagent: :- bake: m_portTable->getKeys 0
Aug 10 21:06:19.242750 sonic NOTICE swss/orchagent: :- bake: No port table, fallback to cold start
Aug 10 21:06:19.242835 sonic NOTICE swss/orchagent: :- bake: Add warm input: INTF_TABLE, 0
Aug 10 21:06:19.243010 sonic NOTICE swss/orchagent: :- bake: Add warm input: NEIGH_TABLE, 0
Aug 10 21:06:19.243191 sonic NOTICE swss/orchagent: :- bake: Add warm input: ROUTE_TABLE, 0
Aug 10 21:06:19.243396 sonic NOTICE swss/orchagent: :- bake: Add warm input: COPP_TABLE, 0
Aug 10 21:06:19.243575 sonic NOTICE swss/orchagent: :- bake: Add warm input: TUNNEL_DECAP_TABLE, 0
Aug 10 21:06:19.243665 sonic NOTICE swss/orchagent: :- bake: Add warm input: DSCP_TO_TC_MAP, 1
Aug 10 21:06:19.243771 sonic NOTICE swss/orchagent: :- bake: Add warm input: MAP_PFC_PRIORITY_TO_QUEUE, 1
Aug 10 21:06:19.243851 sonic NOTICE swss/orchagent: :- bake: Add warm input: PFC_PRIORITY_TO_PRIORITY_GROUP_MAP, 1
Aug 10 21:06:19.243944 sonic NOTICE swss/orchagent: :- bake: Add warm input: PORT_QOS_MAP, 1
Aug 10 21:06:19.244029 sonic NOTICE swss/orchagent: :- bake: Add warm input: QUEUE, 4
Aug 10 21:06:19.244130 sonic NOTICE swss/orchagent: :- bake: Add warm input: SCHEDULER, 3
Aug 10 21:06:19.244206 sonic NOTICE swss/orchagent: :- bake: Add warm input: TC_TO_PRIORITY_GROUP_MAP, 1
Aug 10 21:06:19.244285 sonic NOTICE swss/orchagent: :- bake: Add warm input: TC_TO_QUEUE_MAP, 1
Aug 10 21:06:19.244371 sonic NOTICE swss/orchagent: :- bake: Add warm input: WRED_PROFILE, 2
Aug 10 21:06:19.244434 sonic NOTICE swss/orchagent: :- bake: Add warm input: MIRROR_SESSION, 1
Aug 10 21:06:19.244505 sonic NOTICE swss/orchagent: :- bake: Add warm input: ACL_RULE, 9
Aug 10 21:06:19.244600 sonic NOTICE swss/orchagent: :- bake: Add warm input: ACL_TABLE, 4
Aug 10 21:06:19.244686 sonic NOTICE swss/orchagent: :- bake: Add warm input: LAG_TABLE, 0
Aug 10 21:06:19.244760 sonic NOTICE swss/orchagent: :- bake: Add warm input: FDB_TABLE, 0
Aug 10 21:06:19.244821 sonic NOTICE swss/orchagent: :- bake: Add warm input: VRF, 0
Aug 10 21:06:19.244881 sonic NOTICE swss/orchagent: :- bake: Add warm input: FLEX_COUNTER_TABLE, 3
Aug 10 21:06:19.244940 sonic NOTICE swss/orchagent: :- bake: Add warm input: PFC_WD_TABLE, 0

The idea is best effort warm reboot based on left over entries in ConfigDB/AppDB/StateDB tables.

During a cold reboot, there are initial data in tables or they are empty. In either case, the initial data is popped and moved to corresponding consumer's m_toSync, and later processed by doTask(). So keep original behavior.

During a warm reboot, there are initial data in tables. The initial data is popped and moved to corresponding consumer's m_toSync, and later processed by warm reboot logic (TODO). The warm reboot is not end-to-end tested.

Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>

jipanyang · 2018-08-10T01:24:08Z

What is the reason you kept re-implementing the functions that have been covered in other PRs?

More explanation about the overlapped function:

All the configDB cases have been handled in the few lines of code change as the link:
https://github.com/Azure/sonic-swss/pull/554/files#diff-59ec609e195cd92cf17a204ea7b57feeR335

Eventually we want to handle the case of configDB change, probably via freezing configDB at pre-warm restart check point, so no config change will happen during the restart window. #Resolved

qiluo-msft · 2018-08-10T04:06:22Z

First the credit for implementing warm reboot should be given to you. My implementation is just inspired by your solution. I admitted your solution is working, also mine. There are some design principle we should consider for comparison.

The Consumer could restore its previous state, not because it has a table in CONFIG_DB, but because it has a SubscriberStateTable.
The SubscriberStateTable by design has 2 data source and 2 phases. First from initial table content, and later field value update. The first data source and phase fit the design of Orch::addExistingData() and Consumer::refillToSync() perfectly, so it's a nature extension. ref https://github.com/Azure/sonic-swss-common/blob/master/common/subscriberstatetable.cpp
The second phase of SubscriberStateTable are for config updating. By design, this should not be handled by warm reboot, but processed normally after warm reboot
SubscriberStateTable/ConsumerTable/ConsumerStateTable are on the same abstract level. The state restoring should be implemented on the same level. The best implementation should be if-else branching or virtual functions.

I should point not-that-good design choices I found:

At the abstract level above Consumer, user should not care too much about the underlying redis database index and table name
Introduce new collaboration between processes, such as freezing redis, or send a special warm reboot event. I don't mean they are totally forbidden, but for warm reboot, they are not necessary. The design of every components could handle the use case well, with some reasonable extension like this PR.

Let me know if you find more design or implementation issues in this PR.

In reply to: 411946969 [](ancestors = 411946969)

jipanyang · 2018-08-10T04:41:28Z

Your change doesn't solve the problem of configDB change during restart window.
SubscriberStateTable will pick up whatever available in configDB table the moment orchagent starts.

Unless configDB deploy write/read mechanism like producerState/consumerState, freezing configDB, or simply avoid changing it during warm restart are the solution I could think of.

The changes made here has the same effect of restoring SubscriberStateTable consumer as the Link provided, though with a lot more lines of code including the ripple effect on all other cases.

#Resolved

qiluo-msft · 2018-08-10T05:09:41Z

I agree "Your change doesn't solve the problem of configDB change during restart window." So I am open with freezing redis if needed.

Could you explain "ripple effect on all other cases"? What exactly is wrong? A use case or code discussion may be helpful here.

In reply to: 411973453 [](ancestors = 411973453)

jipanyang · 2018-08-10T07:27:37Z

If you take this approach, all other orch which use configDB table have to make similar changes.

The original problem has been solved with 5 lines of code changes, now you want to solve the same problem again with probably more than 100 lines of code, but without any real extra benefit.

Maybe I missed something, but my personal opinion is that this is a typical case of over-engineering.

#Resolved

lguohan · 2018-08-10T08:07:30Z

in warm reboot, if we sync once with config db, and do then do not proceed any new config db changes until we are done with warm reboot. that seems will work. if this is already addressed in #554, what is problem we are trying to solve in this PR? #Resolved

jipanyang · 2018-08-10T19:10:41Z

If you really want to go with this approach, I'm fine with it, but please have all the configDB/stateDB cases handled in the PR. So the solution is complete and integration test may be done to figure out any potential issues.

#Resolved

Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>

qiluo-msft · 2018-08-10T21:27:27Z

I finally get what you are worried about. The suggestion is pretty good, and I add new iteration. Now the PR scope is changed to broader, which provides a general solution for all the Orch derived classes.

In reply to: 411999953 [](ancestors = 411999953)

qiluo-msft · 2018-08-10T21:31:55Z

Already handled configDB/stateDB cases handled in the PR. The warm reboot logic is TODO item. All the cold start behavior is kept.

In reply to: 412178195 [](ancestors = 412178195)

qiluo-msft · 2018-08-10T21:37:18Z

Better design.

In reply to: 412009165 [](ancestors = 412009165)

lguohan · 2018-08-10T22:18:54Z

orchagent/orch.cpp

+        }
+
+        size_t refilled = consumer->refillToSync();
+        SWSS_LOG_NOTICE("Add warm input: %s, %zd", executorName.c_str(), refilled);


DEBUG LEVEL IS better? #Resolved

I prefer NOTICE since it is important and one time thing. The total lines are about 35.

In reply to: 209397186 [](ancestors = 209397186)

lguohan · 2018-08-10T22:19:46Z

@jipanyang to review the new design. #Resolved

jipanyang · 2018-08-11T00:28:23Z

The new design looks good to me.

Some extra comments here:
When the implementation is complex enough, there is alway room for improvement or alternative way of implementing it. For major functions in the pending PRs, if you think you may do them better, that is great, we do want a better system. But please have the courtesy to put some note in the original PR first, so people know what to expect and avoid the double efforts. #Resolved

qiluo-msft · 2018-08-11T00:30:50Z

Thanks Jipan for reviewing. Will do.

In reply to: 412237391 [](ancestors = 412237391)

…atic restart (sonic-net#572)

baxia-lan · 2023-12-05T18:51:35Z

orchagent/portsorch.h

@@ -56,7 +56,7 @@ class PortsOrch : public Orch, public Subject
    bool isInitDone();

    map<string, Port>& getAllPorts();
-    bool bake();
+    bool bake() override;


Why PortsOrch has its own bake() instead of using Orch::bake()?

qiluo-msft added 2 commits August 10, 2018 00:56

Warm reboot for BufferOrch

8228420

Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>

Add log for warm reboot

e65828f

Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>

qiluo-msft added 2 commits August 10, 2018 20:01

Add bake() interface

1acd885

Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>

OrchDaemon supports warm start

1146af1

Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>

qiluo-msft changed the title ~~Warm reboot for BufferOrch~~ Warm reboot for general Orch Aug 10, 2018

lguohan reviewed Aug 10, 2018

View reviewed changes

lguohan approved these changes Aug 11, 2018

View reviewed changes

qiluo-msft merged commit c429ddb into sonic-net:master Aug 11, 2018

qiluo-msft deleted the qiluo/warmbuffer branch August 11, 2018 00:46

stcheng added the Enhancement ➕ label Aug 29, 2018

EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022

[fast-reboot] Stop services after killing containers to prevent autom…

ee56d54

…atic restart (sonic-net#572)

baxia-lan reviewed Dec 5, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warm reboot for general Orch #572

Warm reboot for general Orch #572

qiluo-msft commented Aug 10, 2018 •

edited

Loading

jipanyang commented Aug 10, 2018 •

edited by qiluo-msft

Loading

qiluo-msft commented Aug 10, 2018 •

edited

Loading

jipanyang commented Aug 10, 2018 •

edited by qiluo-msft

Loading

qiluo-msft commented Aug 10, 2018

jipanyang commented Aug 10, 2018 •

edited by qiluo-msft

Loading

lguohan commented Aug 10, 2018 •

edited by qiluo-msft

Loading

jipanyang commented Aug 10, 2018 •

edited by qiluo-msft

Loading

qiluo-msft commented Aug 10, 2018

qiluo-msft commented Aug 10, 2018

qiluo-msft commented Aug 10, 2018

lguohan Aug 10, 2018 •

edited by qiluo-msft

Loading

qiluo-msft Aug 10, 2018

lguohan commented Aug 10, 2018 •

edited by qiluo-msft

Loading

jipanyang commented Aug 11, 2018 •

edited by qiluo-msft

Loading

qiluo-msft commented Aug 11, 2018

baxia-lan Dec 5, 2023

Warm reboot for general Orch #572

Warm reboot for general Orch #572

Conversation

qiluo-msft commented Aug 10, 2018 • edited Loading

jipanyang commented Aug 10, 2018 • edited by qiluo-msft Loading

qiluo-msft commented Aug 10, 2018 • edited Loading

jipanyang commented Aug 10, 2018 • edited by qiluo-msft Loading

qiluo-msft commented Aug 10, 2018

jipanyang commented Aug 10, 2018 • edited by qiluo-msft Loading

lguohan commented Aug 10, 2018 • edited by qiluo-msft Loading

jipanyang commented Aug 10, 2018 • edited by qiluo-msft Loading

qiluo-msft commented Aug 10, 2018

qiluo-msft commented Aug 10, 2018

qiluo-msft commented Aug 10, 2018

lguohan Aug 10, 2018 • edited by qiluo-msft Loading

Choose a reason for hiding this comment

qiluo-msft Aug 10, 2018

Choose a reason for hiding this comment

lguohan commented Aug 10, 2018 • edited by qiluo-msft Loading

jipanyang commented Aug 11, 2018 • edited by qiluo-msft Loading

qiluo-msft commented Aug 11, 2018

baxia-lan Dec 5, 2023

Choose a reason for hiding this comment

qiluo-msft commented Aug 10, 2018 •

edited

Loading

jipanyang commented Aug 10, 2018 •

edited by qiluo-msft

Loading

qiluo-msft commented Aug 10, 2018 •

edited

Loading

jipanyang commented Aug 10, 2018 •

edited by qiluo-msft

Loading

jipanyang commented Aug 10, 2018 •

edited by qiluo-msft

Loading

lguohan commented Aug 10, 2018 •

edited by qiluo-msft

Loading

jipanyang commented Aug 10, 2018 •

edited by qiluo-msft

Loading

lguohan Aug 10, 2018 •

edited by qiluo-msft

Loading

lguohan commented Aug 10, 2018 •

edited by qiluo-msft

Loading

jipanyang commented Aug 11, 2018 •

edited by qiluo-msft

Loading