Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warm reboot for general Orch #572

Merged
merged 4 commits into from
Aug 11, 2018

Conversation

qiluo-msft
Copy link
Contributor

@qiluo-msft qiluo-msft commented Aug 10, 2018

The non-warm reboot behavior is backward compatible, and tested in lab. Sample log

Aug 10 21:06:19.241459 sonic NOTICE swss/orchagent: :- bake: Add warm input: SWITCH_TABLE, 0
Aug 10 21:06:19.241535 sonic NOTICE swss/orchagent: :- bake: Add warm input: CRM, 1
Aug 10 21:06:19.242007 sonic NOTICE swss/orchagent: :- bake: Add warm input: BUFFER_PG, 33
Aug 10 21:06:19.242105 sonic NOTICE swss/orchagent: :- bake: Add warm input: BUFFER_POOL, 4
Aug 10 21:06:19.242185 sonic NOTICE swss/orchagent: :- bake: Add warm input: BUFFER_PORT_EGRESS_PROFILE_LIST, 1
Aug 10 21:06:19.242273 sonic NOTICE swss/orchagent: :- bake: Add warm input: BUFFER_PORT_INGRESS_PROFILE_LIST, 1
Aug 10 21:06:19.242353 sonic NOTICE swss/orchagent: :- bake: Add warm input: BUFFER_PROFILE, 7
Aug 10 21:06:19.242424 sonic NOTICE swss/orchagent: :- bake: Add warm input: BUFFER_QUEUE, 2
Aug 10 21:06:19.242518 sonic NOTICE swss/orchagent: :- bake: foundPortConfigDone = 0
Aug 10 21:06:19.242617 sonic NOTICE swss/orchagent: :- bake: foundPortInitDone = 0
Aug 10 21:06:19.242689 sonic NOTICE swss/orchagent: :- bake: m_portTable->getKeys 0
Aug 10 21:06:19.242750 sonic NOTICE swss/orchagent: :- bake: No port table, fallback to cold start
Aug 10 21:06:19.242835 sonic NOTICE swss/orchagent: :- bake: Add warm input: INTF_TABLE, 0
Aug 10 21:06:19.243010 sonic NOTICE swss/orchagent: :- bake: Add warm input: NEIGH_TABLE, 0
Aug 10 21:06:19.243191 sonic NOTICE swss/orchagent: :- bake: Add warm input: ROUTE_TABLE, 0
Aug 10 21:06:19.243396 sonic NOTICE swss/orchagent: :- bake: Add warm input: COPP_TABLE, 0
Aug 10 21:06:19.243575 sonic NOTICE swss/orchagent: :- bake: Add warm input: TUNNEL_DECAP_TABLE, 0
Aug 10 21:06:19.243665 sonic NOTICE swss/orchagent: :- bake: Add warm input: DSCP_TO_TC_MAP, 1
Aug 10 21:06:19.243771 sonic NOTICE swss/orchagent: :- bake: Add warm input: MAP_PFC_PRIORITY_TO_QUEUE, 1
Aug 10 21:06:19.243851 sonic NOTICE swss/orchagent: :- bake: Add warm input: PFC_PRIORITY_TO_PRIORITY_GROUP_MAP, 1
Aug 10 21:06:19.243944 sonic NOTICE swss/orchagent: :- bake: Add warm input: PORT_QOS_MAP, 1
Aug 10 21:06:19.244029 sonic NOTICE swss/orchagent: :- bake: Add warm input: QUEUE, 4
Aug 10 21:06:19.244130 sonic NOTICE swss/orchagent: :- bake: Add warm input: SCHEDULER, 3
Aug 10 21:06:19.244206 sonic NOTICE swss/orchagent: :- bake: Add warm input: TC_TO_PRIORITY_GROUP_MAP, 1
Aug 10 21:06:19.244285 sonic NOTICE swss/orchagent: :- bake: Add warm input: TC_TO_QUEUE_MAP, 1
Aug 10 21:06:19.244371 sonic NOTICE swss/orchagent: :- bake: Add warm input: WRED_PROFILE, 2
Aug 10 21:06:19.244434 sonic NOTICE swss/orchagent: :- bake: Add warm input: MIRROR_SESSION, 1
Aug 10 21:06:19.244505 sonic NOTICE swss/orchagent: :- bake: Add warm input: ACL_RULE, 9
Aug 10 21:06:19.244600 sonic NOTICE swss/orchagent: :- bake: Add warm input: ACL_TABLE, 4
Aug 10 21:06:19.244686 sonic NOTICE swss/orchagent: :- bake: Add warm input: LAG_TABLE, 0
Aug 10 21:06:19.244760 sonic NOTICE swss/orchagent: :- bake: Add warm input: FDB_TABLE, 0
Aug 10 21:06:19.244821 sonic NOTICE swss/orchagent: :- bake: Add warm input: VRF, 0
Aug 10 21:06:19.244881 sonic NOTICE swss/orchagent: :- bake: Add warm input: FLEX_COUNTER_TABLE, 3
Aug 10 21:06:19.244940 sonic NOTICE swss/orchagent: :- bake: Add warm input: PFC_WD_TABLE, 0

The idea is best effort warm reboot based on left over entries in ConfigDB/AppDB/StateDB tables.

During a cold reboot, there are initial data in tables or they are empty. In either case, the initial data is popped and moved to corresponding consumer's m_toSync, and later processed by doTask(). So keep original behavior.

During a warm reboot, there are initial data in tables. The initial data is popped and moved to corresponding consumer's m_toSync, and later processed by warm reboot logic (TODO). The warm reboot is not end-to-end tested.

Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
@jipanyang
Copy link
Contributor

jipanyang commented Aug 10, 2018

What is the reason you kept re-implementing the functions that have been covered in other PRs?

More explanation about the overlapped function:

All the configDB cases have been handled in the few lines of code change as the link:
https://github.com/Azure/sonic-swss/pull/554/files#diff-59ec609e195cd92cf17a204ea7b57feeR335

Eventually we want to handle the case of configDB change, probably via freezing configDB at pre-warm restart check point, so no config change will happen during the restart window. #Resolved

@qiluo-msft
Copy link
Contributor Author

qiluo-msft commented Aug 10, 2018

First the credit for implementing warm reboot should be given to you. My implementation is just inspired by your solution. I admitted your solution is working, also mine. There are some design principle we should consider for comparison.

  1. The Consumer could restore its previous state, not because it has a table in CONFIG_DB, but because it has a SubscriberStateTable.
  2. The SubscriberStateTable by design has 2 data source and 2 phases. First from initial table content, and later field value update. The first data source and phase fit the design of Orch::addExistingData() and Consumer::refillToSync() perfectly, so it's a nature extension. ref https://github.com/Azure/sonic-swss-common/blob/master/common/subscriberstatetable.cpp
  3. The second phase of SubscriberStateTable are for config updating. By design, this should not be handled by warm reboot, but processed normally after warm reboot
  4. SubscriberStateTable/ConsumerTable/ConsumerStateTable are on the same abstract level. The state restoring should be implemented on the same level. The best implementation should be if-else branching or virtual functions.

I should point not-that-good design choices I found:

  1. At the abstract level above Consumer, user should not care too much about the underlying redis database index and table name
  2. Introduce new collaboration between processes, such as freezing redis, or send a special warm reboot event. I don't mean they are totally forbidden, but for warm reboot, they are not necessary. The design of every components could handle the use case well, with some reasonable extension like this PR.

Let me know if you find more design or implementation issues in this PR.


In reply to: 411946969 [](ancestors = 411946969)

@jipanyang
Copy link
Contributor

jipanyang commented Aug 10, 2018

Your change doesn't solve the problem of configDB change during restart window.
SubscriberStateTable will pick up whatever available in configDB table the moment orchagent starts.

Unless configDB deploy write/read mechanism like producerState/consumerState, freezing configDB, or simply avoid changing it during warm restart are the solution I could think of.

The changes made here has the same effect of restoring SubscriberStateTable consumer as the Link provided, though with a lot more lines of code including the ripple effect on all other cases.

#Resolved

@qiluo-msft
Copy link
Contributor Author

I agree "Your change doesn't solve the problem of configDB change during restart window." So I am open with freezing redis if needed.

Could you explain "ripple effect on all other cases"? What exactly is wrong? A use case or code discussion may be helpful here.


In reply to: 411973453 [](ancestors = 411973453)

@jipanyang
Copy link
Contributor

jipanyang commented Aug 10, 2018

If you take this approach, all other orch which use configDB table have to make similar changes.

The original problem has been solved with 5 lines of code changes, now you want to solve the same problem again with probably more than 100 lines of code, but without any real extra benefit.

Maybe I missed something, but my personal opinion is that this is a typical case of over-engineering.

#Resolved

@lguohan
Copy link
Contributor

lguohan commented Aug 10, 2018

in warm reboot, if we sync once with config db, and do then do not proceed any new config db changes until we are done with warm reboot. that seems will work. if this is already addressed in #554, what is problem we are trying to solve in this PR? #Resolved

@jipanyang
Copy link
Contributor

jipanyang commented Aug 10, 2018

If you really want to go with this approach, I'm fine with it, but please have all the configDB/stateDB cases handled in the PR. So the solution is complete and integration test may be done to figure out any potential issues.

#Resolved

Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
@qiluo-msft qiluo-msft changed the title Warm reboot for BufferOrch Warm reboot for general Orch Aug 10, 2018
@qiluo-msft
Copy link
Contributor Author

I finally get what you are worried about. The suggestion is pretty good, and I add new iteration. Now the PR scope is changed to broader, which provides a general solution for all the Orch derived classes.


In reply to: 411999953 [](ancestors = 411999953)

@qiluo-msft
Copy link
Contributor Author

Already handled configDB/stateDB cases handled in the PR. The warm reboot logic is TODO item. All the cold start behavior is kept.


In reply to: 412178195 [](ancestors = 412178195)

@qiluo-msft
Copy link
Contributor Author

Better design.


In reply to: 412009165 [](ancestors = 412009165)

}

size_t refilled = consumer->refillToSync();
SWSS_LOG_NOTICE("Add warm input: %s, %zd", executorName.c_str(), refilled);
Copy link
Contributor

@lguohan lguohan Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEBUG LEVEL IS better? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer NOTICE since it is important and one time thing. The total lines are about 35.


In reply to: 209397186 [](ancestors = 209397186)

@lguohan
Copy link
Contributor

lguohan commented Aug 10, 2018

@jipanyang to review the new design. #Resolved

@jipanyang
Copy link
Contributor

jipanyang commented Aug 11, 2018

The new design looks good to me.

Some extra comments here:
When the implementation is complex enough, there is alway room for improvement or alternative way of implementing it. For major functions in the pending PRs, if you think you may do them better, that is great, we do want a better system. But please have the courtesy to put some note in the original PR first, so people know what to expect and avoid the double efforts. #Resolved

@qiluo-msft
Copy link
Contributor Author

Thanks Jipan for reviewing. Will do.


In reply to: 412237391 [](ancestors = 412237391)

@qiluo-msft qiluo-msft merged commit c429ddb into sonic-net:master Aug 11, 2018
@qiluo-msft qiluo-msft deleted the qiluo/warmbuffer branch August 11, 2018 00:46
EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
@@ -56,7 +56,7 @@ class PortsOrch : public Orch, public Subject
bool isInitDone();

map<string, Port>& getAllPorts();
bool bake();
bool bake() override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why PortsOrch has its own bake() instead of using Orch::bake()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants