-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add network restore to support docker live restore container #1244
Conversation
Corresponding changes to @coolljt0725 docker/docker commit can be found at https://github.com/aboch/docker/commit/5f3a9a7f2ed5df06aca88a28ac26ca59a9486069:
|
@@ -198,7 +201,7 @@ func (c *controller) sandboxCleanup() { | |||
dbExists: true, | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aboch let's do
// restore sb.config
sb.processOptions(option...)
sb.restorePath()
here
instead of at https://github.com/docker/libnetwork/pull/1244/files#diff-ffc9c06e306285f214919ed203907012R707
because sb.Key()
relay on sb.config.useDefaultSandBox
I had this change on local but forget to push to github in time :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am refactoring that code. I am now restoring in cleanupSandboxes()
Thanks for the suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aboch nice refactor :)
@aboch should we also rename the |
|
@coolljt0725 AFAIK |
@chenchun yes |
@aboch |
Ah, thanks @coolljt0725 That needs attention. Will think about it tomorrow. |
@@ -1009,6 +1012,10 @@ func (d *driver) CreateEndpoint(nid, eid string, ifInfo driverapi.InterfaceInfo, | |||
} | |||
} | |||
|
|||
if err := d.storeUpdate(endpoint); err != nil { | |||
logrus.Warnf("Failed to save bridge endpoint %s to store: %v", ep.id[0:7], err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return error here given we will invoke necessary rollback functions on error ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The roll-back function should not be invoked given the if scope of err
.
I do not want to fail the container start if we cannot save it to the bridge store.
The container may not even live long enough to go through a daemon reload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a problem in failing the operation if the storeUpdate
fails ?
I think that will be consistent with other store failures in the create path.
(Its understandable that we can warn on Delete - since there is no recovery path).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can fail, just that we will fail even if the user did not start the daemon with the restore flag.
Anyhow, I think that if the save to store fails for this, it would fail for many others, so I do not think the system would be usable anyway. I will make the change.
some test cases about this on daemon side is moby/moby@4cfcd60, I'll add more test cases tomorrow. I'm busy today and got no time to test this and look into this, I'll do this tomorrow. |
// TODO: Do we need this change ? | ||
// check if the rule exists | ||
var ( | ||
doCheck bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coolljt0725 Do you think we still need this change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aboch This is not need any more since we will remove the iptables during bridge driver initialization in https://github.com/docker/libnetwork/blob/master/drivers/bridge/bridge.go#L381
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's what I thought. Thanks for confirming.
@aboch overall I think the refactor has simplified the interactions. Yes, it adds the states to the store. I guess that is okay given the requirement. @coolljt0725 wdyt ? |
@mavenugo I'm agree with this refactor. it's much more simple 👍 |
@coolljt0725 Please let us know if you got a chance to test this out and / or added more test cases |
restoring the overlay network sandbox has some issues. After daemon restart starting another container on an existing network fails. Will update the PR when its addressed. |
@aboch For overlay networks, after a daemon restart I was getting this error when starting a new container..
This was a because of a bug if the osl sandbox restore logic. Its a one liner.. You can make this change in the overlay commit included in the PR |
Thanks @sanimej will update shortly |
@mavenugo working on it :) |
sb.isStub = false | ||
isRestore = true | ||
opts := val.([]SandboxOption) | ||
sb.processOptions(opts...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing sb.restorePath()
here, this will cause docker network connect /disconnect failed on old running container
docker network connect foo 882d7eb8a982 Error response from daemon: rename /var/lib/docker/containers/882d7eb8a982986ac37d75dc2ca5fe18cc0cdc37c8307124de73aace68954362/hash708549390 : no such file or directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Verified.
The daemon tests case for networking restore is coolljt0725/docker@52d5507 |
if key == "" { | ||
return fmt.Errorf("failed to find old sandbox key %s", osl.GenerateKey(fmt.Sprintf("%d-", n.initEpoch)+n.id)) | ||
} | ||
sbox, nerr := osl.NewSandbox(key, false, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be osl.NewSandbox(key, true, true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I'll test |
@aboch @coolljt0725 @sanimej thanks a lot for reducing the complexity and making it clean/modular. 👍 LGTM |
@sanimej I got this error |
ifaceOptions[j] = append(ifaceOptions[j], sb.osSbox.InterfaceOptions().MacAddress(i.mac)) | ||
} | ||
Ifaces[fmt.Sprintf("%s+%s", i.srcName, i.dstPrefix)] = ifaceOptions[j] | ||
if joinInfo != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing LinkLocalAddresses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw this, but it is not really needed. FWIW only i.addr is needed as it is being evaluated during the restore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, I added the link-local addresses. Thanks.
one more thing I concern about is shall we pass the active endpoints to driver on driver initialization so that we can clean up used driver endpoints? |
@coolljt0725 Also, as you know, the libnetwork core performs stale endpoint and sandbox cleanup on bootup and also makes sure that the driver endpoint states are cleaned up (via the deleteEndpoint call) before cleaning up its own states. Hence, the cleanup functionality at bootup should handle the stale endpoint issues for ungraceful restart scenarios. Do you see it behaving otherwise ? Did you hit the |
@coolljt0725 Agree with @mavenugo, libnetwork endpoint removal from store will happen only if driver endpoint removal has succeeded. And libnetwork core will perform stale endpoint removal at each boot. So this ensures we won;t end up in acase where drivers has stale endpoints but libnetwork does not. |
@coolljt0725 I have not seen this problem and I have tested several cases in overlay. Given few changes/refactoring have happened, make sure to start from a clean state. I updated https://github.com/aboch/docker/commits/rst with the latest changes and add the --live-restore commit changes. Did you do anything special or just start a container on overlay network after a daemon restart ? |
I just can't run a container with a overlay network without daemon restart |
I didn't see this error in my testing.
Do you mean containers work after a daemon restart ? If you didn't see it before and it happens now it could be because of the new changes. Can you try from a clean state and with a new network ? |
Thanks a lot for all the hard work on this everyone <3 |
Signed-off-by: Lei Jitang <leijitang@huawei.com>
Signed-off-by: Alessandro Boch <aboch@docker.com>
Signed-off-by: Alessandro Boch <aboch@docker.com>
Signed-off-by: Alessandro Boch <aboch@docker.com>
- also in overlay/encryprion.go Signed-off-by: Alessandro Boch <aboch@docker.com>
Signed-off-by: Santhosh Manohar <santhosh@docker.com>
Signed-off-by: Alessandro Boch <aboch@docker.com>
Tested bridge/overlay network with https://github.com/aboch/docker/commits/rst manually. Everything seems fine except that |
Thanks @chenchun. If it is ok with you, we can take care of the duplicate iptables rules as a subsequent bug fix. |
That's ok. |
@coolljt0725 The error you were hitting |
@coolljt0725 @aboch @sanimej @mavenugo Thanks for the marathon effort in getting this done in time before 1.12-RC1. The design and changes LGTM. |
This is an extract from #1135 retaining the changes in libnetwork core to support the network restore, with few modifications:
Plus other commits which bring the endpoint persist logic to the existing network drivers and which reserve with the ipam driver the addresses of the existing local endpoints.
Note: this change has the advantage it can be merged irrespective of docker side changes
@coolljt0725 changes in overlay driver to restore the osl sandbox will be pushed as part of a separate commit.Signed-off-by: Lei Jitang leijitang@huawei.com
Signed-off-by: Alessandro Boch aboch@docker.com