Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix statefulset Pod recreate case #827

Merged
merged 1 commit into from
Jun 15, 2020
Merged

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Jun 11, 2020

When deleting a StatefulSet Pod with 0 second grace period, the Pod
might be removed from the Kubernetes API very quickly and a new Pod is
created immediately, leading to kubelet processing the deletion of the
previous Pod and the add of the new Pod simultaneously, so CNI cannot
assume PodNamespace+PodName must be unique.

Previously antrea-agent computed the network device name by hashing
PodNamespace and PodName, the CNI DEL of the previous Pod could destroy
the network device of the new Pod if it's executed after the new one's
CNI ADD.

To fix it, antrea-agent must no longer use PodNamespace+PodName as the
identifier of the Pod's network device. This patch uses (sandbox)
container ID as part of the network device name, and uses container ID
as the key when caching interfaceConfig, therefore multiple Pods that
have the same name will be handled individually without interfering with
each other.

To be compatible with network devices created by previous versions, CNI
DEL handler doesn't assume how the network device name is computed and
always looks up network config by container ID.

For #785

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-all: to trigger all tests.
  • /skip-all: to skip all tests.

These commands can only be run by members of the vmware-tanzu organization.

@tnqn tnqn force-pushed the ss-race branch 5 times, most recently from 6533e72 to c970d3a Compare June 11, 2020 10:09
@tnqn
Copy link
Member Author

tnqn commented Jun 11, 2020

/test-all

@tnqn tnqn force-pushed the ss-race branch 3 times, most recently from b680239 to 5c36540 Compare June 11, 2020 13:51
@tnqn
Copy link
Member Author

tnqn commented Jun 11, 2020

/test-all

@tnqn tnqn force-pushed the ss-race branch 2 times, most recently from edbdc80 to c846ed5 Compare June 11, 2020 15:26
@tnqn tnqn changed the title WIP: Fix statefulset Pod recreate case Fix statefulset Pod recreate case Jun 11, 2020
@tnqn
Copy link
Member Author

tnqn commented Jun 11, 2020

/test-all

@tnqn
Copy link
Member Author

tnqn commented Jun 11, 2020

/test-all

@jayunit100
Copy link
Contributor

AWESOME ! thanks quan !

@jayunit100
Copy link
Contributor

is there a docker image pushed somewhere that i can test for this ?

@tnqn
Copy link
Member Author

tnqn commented Jun 11, 2020

is there a docker image pushed somewhere that i can test for this ?

@jayunit100 Yes, "qtian/antrea-ubuntu:v0.8.0-dev-a7a41c3" has the commit. I'm testing it with both containerd and dockerd. Both have passed several times consistently.
It would be great if you can try it on your cluster in case of any configuration difference that could affect the result.

@tnqn
Copy link
Member Author

tnqn commented Jun 11, 2020

I have run "certified-conformance" with containerd 3 times and docker 1 time, all of them passed.
@wenyingd helped verify all sig-windows tests passed.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only a few nits with respect to comments. Thanks for fixing this so quickly and I like the switch to the K8s cache library for the interface store.

I'm pretty sure it is the case after reviewing the diff, but I just want to confirm / call out that:

  • there is no change no the information we store in OVSDB (good thing we were storing the container ID to start with)
  • there is no impact on the Antrea upgrade process: new version will be able to handle Cmd Checks and Cmd Dels for exiting Pods for which the networking was configured by a previous version of Antrea

There are a few typos in the commit message I think:

  • s/graceful period/grace period
  • s/Previously antrea-agent computes/Previously antrea-agent computed
  • s/This patches uses/This patch uses
  • s/then multiple Pods that have the name/therefore multiple Pods that have the same name
  • s/always look up network config by container ID/always looks up network config by container ID

pkg/agent/cniserver/server_windows.go Outdated Show resolved Hide resolved
pkg/agent/cniserver/server_windows.go Outdated Show resolved Hide resolved
pkg/agent/interfacestore/interface_cache.go Outdated Show resolved Hide resolved
pkg/agent/util/net.go Outdated Show resolved Hide resolved
pkg/agent/util/net.go Outdated Show resolved Hide resolved
pkg/agent/util/net.go Outdated Show resolved Hide resolved
pkg/agent/cniserver/pod_configuration.go Outdated Show resolved Hide resolved
@@ -33,6 +34,10 @@ const (

type InterfaceType uint8

func (t InterfaceType) String() string {
return strconv.Itoa(int(t))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this ever used for user-facing logs, or just for the cache index? If it is used for user-facing logs, maybe we could have a switch statement and return a more user-friendly string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently only for cache index.

@tnqn tnqn force-pushed the ss-race branch 2 times, most recently from 46f4557 to 0f8c711 Compare June 15, 2020 04:30
Copy link
Member Author

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it is the case after reviewing the diff, but I just want to confirm / call out that:

  • there is no change no the information we store in OVSDB (good thing we were storing the container ID to start with)
  • there is no impact on the Antrea upgrade process: new version will be able to handle Cmd Checks and Cmd Dels for exiting Pods for which the networking was configured by a previous version of Antrea

Both are correct. CmdDel and CmdCheck won't even read PodNamespace and PodName arguments and will use ContainerID to look up InterfaceCache to get the OVS port and hostIface.

Thanks a lot for your careful review. Addressed all the comments.

pkg/agent/cniserver/server_windows.go Outdated Show resolved Hide resolved
pkg/agent/cniserver/server_windows.go Outdated Show resolved Hide resolved
pkg/agent/interfacestore/interface_cache.go Outdated Show resolved Hide resolved
@@ -33,6 +34,10 @@ const (

type InterfaceType uint8

func (t InterfaceType) String() string {
return strconv.Itoa(int(t))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently only for cache index.

pkg/agent/util/net.go Outdated Show resolved Hide resolved
pkg/agent/util/net.go Outdated Show resolved Hide resolved
pkg/agent/util/net.go Outdated Show resolved Hide resolved
When deleting a StatefulSet Pod with 0 second grace period, the Pod
might be removed from the Kubernetes API very quickly and a new Pod is
created immediately, leading to kubelet processing the deletion of the
previous Pod and the add of the new Pod simultaneously, so CNI cannot
assume PodNamespace+PodName must be unique.

Previously antrea-agent computed the network device name by hashing
PodNamespace and PodName, the CNI DEL of the previous Pod could destroy
the network device of the new Pod if it's executed after the new one's
CNI ADD.

To fix it, antrea-agent must no longer use PodNamespace+PodName as the
identifier of the Pod's network device. This patch uses (sandbox)
container ID as part of the network device name, and uses container ID
as the key when caching interfaceConfig, therefore multiple Pods that
have the same name will be handled individually without interfering with
each other.

To be compatible with network devices created by previous versions, CNI
DEL handler doesn't assume how the network device name is computed and
always looks up network config by container ID.
// See https://github.com/kubernetes/kubernetes/issues/57253#issuecomment-358897721.
_, found := pc.ifaceStore.GetContainerInterface(podName, podNameSpace)
_, found := pc.ifaceStore.GetContainerInterface(containerID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that this containerID is the workload container on Windows? For I didn't see the logics to find the infraContainerID before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, createOVSPort is false for workload container, it doesn't try to create ovs port for workload container whether the port exists or not.

@tnqn
Copy link
Member Author

tnqn commented Jun 15, 2020

/test-all

@tnqn
Copy link
Member Author

tnqn commented Jun 15, 2020

/test-networkpolicy

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor

Merging this for 0.7.2 release

@antoninbas antoninbas merged commit e77a28a into antrea-io:master Jun 15, 2020
Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not reviewing in time. I have no extra comments for the change, but have a question - I remember we had another approach of using Pod UID to the identifier which might be simpler, but I forgot how we get Pod UID in CNI. Probably let us discuss offline on this.

@antoninbas
Copy link
Contributor

@jianjuns related to this kubernetes/kubernetes#62900 ?

@jianjuns
Copy link
Contributor

@jianjuns related to this kubernetes/kubernetes#62900 ?

Yes I know this issue, but remember we used to have another solution to get Pod UID. Maybe I am wrong.
BTW, my main concern for this change is that going forward many features and more code might need to know about the possibility of multiple Pods using the same name (though seems the code change is simpler than I thought).

@tnqn
Copy link
Member Author

tnqn commented Jun 16, 2020

BTW, my main concern for this change is that going forward many features and more code might need to know about the possibility of multiple Pods using the same name (though seems the code change is simpler than I thought).

Even if we are able to get Pod UID from CNI requests in future, the fact that multiple Pods using the same name may coexist at some point is still valid and the code must be aware of it.

@jianjuns
Copy link
Contributor

Even if we are able to get Pod UID from CNI requests in future, the fact that multiple Pods using the same name may coexist at some point is still valid and the code must be aware of it.

I meant if Agent knows about Pod UID, all inputs/code can use Pod UID as the key, and even Controller can add Pod UID to Agent messages. Then most code (except CLI/API) needs not to handle duplicated names.

@tnqn
Copy link
Member Author

tnqn commented Jun 16, 2020

Even if we are able to get Pod UID from CNI requests in future, the fact that multiple Pods using the same name may coexist at some point is still valid and the code must be aware of it.

I meant if Agent knows about Pod UID, all inputs/code can use Pod UID as the key, and even Controller can add Pod UID to Agent messages. Then most code (except CLI/API) needs not to handle duplicated names.

Agree, but it doesn't affect the change made in this PR: not to compute network interface name from Pod name, not to use Pod name as the primary key of config cache?

@jianjuns
Copy link
Contributor

Yes. I just meant the part of changing NetworkPolicyController etc. to update for multiple Pods with the same name. Anyway, as do not have a way to know Pod UID from CNI, probably let us keep the current approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants