Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add yaml manifest to cleanup all Antrea resources #307

Closed
wants to merge 1 commit into from

Conversation

antoninbas
Copy link
Contributor

This is still very much a work-in-progress, I'm opening this PR to
gather feedback on the approach.

For someone deleting Antrea, the steps would be as follows:

  • kubectl delete -f <path to antrea.yml>
  • kubectl apply -f <path to antrea-cleanup.yml>
  • check that job has completed with kubectl -n kube-system get jobs
  • kubectl delete -f <path to antrea-cleanup.yml>

The cleanup manifest creates a DaemonSet that will perform the necessary
deletion tasks on each Node. After the tasks have been completed, the
"status" is reported to the cleanup controller through a custom
resource. Once the controller has received enough statuses (or after a
timeout of 1 minutes) the controller job completes and the user can
delete the cleanup manifest.

Known remaining items:

  • place cleanup binaries (antrea-cleanup-agent and
    antrea-cleanup-controller) in a separate docker image to avoid
    increasing the size of the main Antrea docker image
  • generate manifest with kustomize?
  • find a way to test this as part of CI?
  • update documentation
  • additional cleanup tasks: as of now we only take care of deleting the
    OVS bridge
  • place cleanup CRD in non-default namespace
  • use kubebuilder instead of the code generator directly (related to
    Use kubebuilder 1.16 instead of the code generator directly for CRDs #16); we probably want to punt this task to a future PR.

See #181

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-all: to trigger all tests.
  • /skip-all: to skip all tests.

These commands can only be run by members of the vmware-tanzu organization.

This is still very much a work-in-progress, I'm opening this PR to
gather feedback on the approach.

For someone deleting Antrea, the steps would be as follows:
 * `kubectl delete -f <path to antrea.yml>`
 * `kubectl apply -f <path to antrea-cleanup.yml>`
 * check that job has completed with `kubectl -n kube-system get jobs`
 * `kubectl delete -f <path to antrea-cleanup.yml>`

The cleanup manifest creates a DaemonSet that will perform the necessary
deletion tasks on each Node. After the tasks have been completed, the
"status" is reported to the cleanup controller through a custom
resource. Once the controller has received enough statuses (or after a
timeout of 1 minutes) the controller job completes and the user can
delete the cleanup manifest.

Known remaining items:
 * place cleanup binaries (antrea-cleanup-agent and
 antrea-cleanup-controller) in a separate docker image to avoid
 increasing the size of the main Antrea docker image
 * generate manifest with kustomize?
 * find a way to test this as part of CI?
 * update documentation
 * additional cleanup tasks: as of now we only take care of deleting the
 OVS bridge
 * place cleanup CRD in non-default namespace
 * use kubebuilder instead of the code generator directly (related to
 antrea-io#16); we probably want to punt this task to a future PR.

See antrea-io#181
@jianjuns
Copy link
Contributor

@antoninbas: do you think we can reuse the AgentInfo and ControllerInfo CRD for the cleanup task status, instead of adding another CRD?

Also two other thoughts:

  1. We need to provide a way to clean up node when K8s is down. Should we consider antctl for that?
  2. We might not be able to do cleanup with a DS/Pod for Windows. In that case, do you assume some powershell scripts to start the cleanup service?
    Thinking if it is easier to let the Agent perform the cleanup in that case.

@jianjuns
Copy link
Contributor

  1. We might not be able to do cleanup with a DS/Pod for Windows. In that case, do you assume some powershell scripts to start the cleanup service?
    Thinking if it is easier to let the Agent perform the cleanup in that case.

Thought over. For Windows, we should use powershell to install Controllers and Agents, and so it makes sense to use powershell to uninstall and start cleanup. But in that case, we need not coordination between cleanup-agent and cleanup-controller, but just need one-off execution for cleaning up both Controller and Agent resources.
@wenyingd might have a better idea.

@antoninbas
Copy link
Contributor Author

@jianjuns FYI, I'm planning on starting work again on this PR this week and address these 2 comments from you:

  • try to reuse existing CRD API groups to report status back to deletion controller
  • provide "local" cleanup mechanism on a single Node when K8s is down (though antctl)

@jianjuns
Copy link
Contributor

@jianjuns FYI, I'm planning on starting work again on this PR this week and address these 2 comments from you:

Sounds great!

* try to reuse existing CRD API groups to report status back to deletion controller

* provide "local" cleanup mechanism on a single Node when K8s is down (though antctl)

Will your cleanup DS also executes antctl?

@antoninbas
Copy link
Contributor Author

Will your cleanup DS also executes antctl?

My idea was that they would share their implementation in pkg/agent/cleanup, but that the DS would not invoke antctl directly.

@antrea-bot
Copy link
Collaborator

Can one of the admins verify this patch?

Base automatically changed from master to main January 26, 2021 00:00
@github-actions
Copy link
Contributor

This PR is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2021
@github-actions github-actions bot closed this Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants