Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add host-local IPAM GC on startup #5660

Merged
merged 2 commits into from
Nov 15, 2023

Conversation

antoninbas
Copy link
Contributor

During CNIServer reconciliation, we perform host-local IPAM garbage collection (GC) by comparing the set of IPs allocated to local Pods and the set of IPs currently reserved by the plugin. We release any IP reserved by the plugin that is not in-use by a local Pod. The purpose is to avoid leaking IP addresses when there is a bug in the container runtime, which has happened in the past.

Two key design choices that were made:

  • We do not invoke CNI DEL to release IPs, instead we access the host-local data which is persisted on the Node, and modify it as needed.
  • We do not rely on the interface store (as persisted to OVSDB) to determine the set of IPs that may have been leaked. In case of an Antrea bug, it could be possible (although unlikely) for an IP to still be allocated by host-local but be missing from the interface store. Intead, we list all allocated IPs from the host-local data (an allocated IP corresponds to one disk file).

This approach is essentially the same as our existing script: https://github.com/antrea-io/antrea/blob/main/hack/gc-host-local.sh

Fixes #4326

@antoninbas
Copy link
Contributor Author

/test-all
/test-windows-all

@antoninbas antoninbas marked this pull request as ready for review November 13, 2023 22:35
@antoninbas antoninbas requested a review from tnqn November 13, 2023 22:35
@antoninbas antoninbas added action/release-note Indicates a PR that should be included in release notes. area/component/cni Issues or PRs related to the cni component labels Nov 13, 2023
return fmt.Errorf("path '%s' is not a directory: %w", dir, err)
}

lk, err := NewFileLock(dataDir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a pretty bad typo, fixed

return fmt.Errorf("error when gathering IP filenames in the host-local data directory: %w", err)
}

allocatedIPs := sets.New[string]()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it seems enough to use a bool to track if it ever fails to release an IP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"github.com/alexflint/go-filemutex"
)

// This code was copied from https://github.com/containernetworking/plugins/blob/v1.3.0/plugins/ipam/host-local/backend/disk/lock.go
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we import github.com/containernetworking/plugins/plugins/ipam/host-local/backend given the package is already a dependency?

antrea/go.mod

Line 22 in 4a52bca

github.com/containernetworking/plugins v1.1.1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we actually use it pretty extensively. Changed to an import.

During CNIServer reconciliation, we perform host-local IPAM garbage
collection (GC) by comparing the set of IPs allocated to local Pods and
the set of IPs currently reserved by the plugin. We release any IP
reserved by the plugin that is not in-use by a local Pod. The purpose is
to avoid leaking IP addresses when there is a bug in the container
runtime, which has happened in the past.

Two key design choices that were made:
* We do not invoke CNI DEL to release IPs, instead we access the
  host-local data which is persisted on the Node, and modify it as
  needed.
* We do not rely on the interface store (as persisted to OVSDB) to
  determine the set of IPs that may have been leaked. In case of an
  Antrea bug, it could be possible (although unlikely) for an IP to
  still be allocated by host-local but be missing from the interface
  store. Intead, we list all allocated IPs from the host-local data (an
  allocated IP corresponds to one disk file).

This approach is essentially the same as our existing script:
https://github.com/antrea-io/antrea/blob/main/hack/gc-host-local.sh

Fixes antrea-io#4326

Signed-off-by: Antonin Bas <abas@vmware.com>
@antoninbas antoninbas force-pushed the cniserver-ipam-gc-2 branch 2 times, most recently from 37a66c4 to 23c2725 Compare November 14, 2023 19:56
@antoninbas antoninbas requested a review from tnqn November 14, 2023 19:57
@antoninbas antoninbas force-pushed the cniserver-ipam-gc-2 branch 2 times, most recently from 6b33047 to 17f523c Compare November 14, 2023 23:18
tnqn
tnqn previously approved these changes Nov 15, 2023
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pkg/agent/cniserver/ipam/ipam_delegator.go Outdated Show resolved Hide resolved
Signed-off-by: Antonin Bas <abas@vmware.com>
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas
Copy link
Contributor Author

/test-all
/test-windows-all

@antoninbas antoninbas merged commit 6acd0d9 into antrea-io:main Nov 15, 2023
53 of 59 checks passed
@antoninbas antoninbas deleted the cniserver-ipam-gc-2 branch November 15, 2023 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes. area/component/cni Issues or PRs related to the cni component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IP may leak after agent restart
2 participants