-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warm reboot for general Orch #572
Conversation
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
What is the reason you kept re-implementing the functions that have been covered in other PRs? More explanation about the overlapped function: All the configDB cases have been handled in the few lines of code change as the link: Eventually we want to handle the case of configDB change, probably via freezing configDB at pre-warm restart check point, so no config change will happen during the restart window. #Resolved |
First the credit for implementing warm reboot should be given to you. My implementation is just inspired by your solution. I admitted your solution is working, also mine. There are some design principle we should consider for comparison.
I should point not-that-good design choices I found:
Let me know if you find more design or implementation issues in this PR. In reply to: 411946969 [](ancestors = 411946969) |
Your change doesn't solve the problem of configDB change during restart window. Unless configDB deploy write/read mechanism like producerState/consumerState, freezing configDB, or simply avoid changing it during warm restart are the solution I could think of. The changes made here has the same effect of restoring SubscriberStateTable consumer as the Link provided, though with a lot more lines of code including the ripple effect on all other cases. #Resolved |
I agree "Your change doesn't solve the problem of configDB change during restart window." So I am open with freezing redis if needed. Could you explain "ripple effect on all other cases"? What exactly is wrong? A use case or code discussion may be helpful here. In reply to: 411973453 [](ancestors = 411973453) |
If you take this approach, all other orch which use configDB table have to make similar changes. The original problem has been solved with 5 lines of code changes, now you want to solve the same problem again with probably more than 100 lines of code, but without any real extra benefit. Maybe I missed something, but my personal opinion is that this is a typical case of over-engineering. #Resolved |
in warm reboot, if we sync once with config db, and do then do not proceed any new config db changes until we are done with warm reboot. that seems will work. if this is already addressed in #554, what is problem we are trying to solve in this PR? #Resolved |
If you really want to go with this approach, I'm fine with it, but please have all the configDB/stateDB cases handled in the PR. So the solution is complete and integration test may be done to figure out any potential issues. #Resolved |
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
I finally get what you are worried about. The suggestion is pretty good, and I add new iteration. Now the PR scope is changed to broader, which provides a general solution for all the Orch derived classes. In reply to: 411999953 [](ancestors = 411999953) |
Already handled configDB/stateDB cases handled in the PR. The warm reboot logic is TODO item. All the cold start behavior is kept. In reply to: 412178195 [](ancestors = 412178195) |
Better design. In reply to: 412009165 [](ancestors = 412009165) |
} | ||
|
||
size_t refilled = consumer->refillToSync(); | ||
SWSS_LOG_NOTICE("Add warm input: %s, %zd", executorName.c_str(), refilled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEBUG LEVEL IS better? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer NOTICE since it is important and one time thing. The total lines are about 35.
In reply to: 209397186 [](ancestors = 209397186)
@jipanyang to review the new design. #Resolved |
The new design looks good to me. Some extra comments here: |
Thanks Jipan for reviewing. Will do. In reply to: 412237391 [](ancestors = 412237391) |
@@ -56,7 +56,7 @@ class PortsOrch : public Orch, public Subject | |||
bool isInitDone(); | |||
|
|||
map<string, Port>& getAllPorts(); | |||
bool bake(); | |||
bool bake() override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why PortsOrch has its own bake() instead of using Orch::bake()?
The non-warm reboot behavior is backward compatible, and tested in lab. Sample log
The idea is best effort warm reboot based on left over entries in ConfigDB/AppDB/StateDB tables.
During a cold reboot, there are initial data in tables or they are empty. In either case, the initial data is popped and moved to corresponding consumer's m_toSync, and later processed by doTask(). So keep original behavior.
During a warm reboot, there are initial data in tables. The initial data is popped and moved to corresponding consumer's m_toSync, and later processed by warm reboot logic (TODO). The warm reboot is not end-to-end tested.