Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

☂️ Introduce Gardener Node Agent (a.k.a. replace cloud-config-downloader) #8023

Closed
55 tasks done
majst01 opened this issue Jun 5, 2023 · 16 comments
Closed
55 tasks done
Assignees
Labels
area/ipcei IPCEI (Important Project of Common European Interest) area/quality Output qualification (tests, checks, scans, automation in general, etc.) related area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension kind/epic Large multi-story topic

Comments

@majst01
Copy link
Contributor

majst01 commented Jun 5, 2023

How to categorize this issue?

/area quality robustness
/kind enhancement epic

What would you like to be added:

Replace the cloud-config-download/executor which are written in the bash programming language with a binary written in go which is a kubernetes controller based on controller-runtime.

This was implemented up to a working implementation during the Gardener Hackathon 2023 in Leverkusen

https://github.com/gardener-community/hackathon

Why is this needed:

With the new Architecture we gain a lot, let's describe the most important gains here.

Developer Productivity

Because we all develop in go day by day, writing business logic in bash is difficult, hard to maintain, almost impossible to test. Getting rid of almost all bash scripts which are currently in use for this very important part of the cluster creation process will enhance the speed of adding new features and removing bugs.

Speed

Until now, the cloud-config-downloader runs in a loop every 60sec to check if something changed on the shoot which requires modifications on the worker node. This produces a lot of unneeded traffic on the api-server and wastes time, it will sometimes take up to 60sec until a desired modification is started on the worker node.
By using the controller-runtime we can watch for the node, theOSC in the secret, and the shoot-access-token in the secret. If any of these object changed, and only then, the required action will take effect immediately.
This will speed up operations and will reduce the load on the api-server of the shoot dramatically.

Scalability

Actually the cloud-config-downloader add a random wait time before restarting the kubelet in case the kubelet was updated or a configuration change was made to it. This is required to reduce the load on the API server and the traffic on the internet uplink. It also reduces the overall downtime of the services in the cluster because every kubelet restart takes a node for several seconds into NotReady state which eventually interrupts service availability.

Decision was made to keep the existing jitter mechanism which calculates the kubelet-download-and-restart-delay-seconds on the controller itself.

Correctness

The configuration of the cloud-config-downloader is actually done by placing a file for every configuration item on the disk on the worker node. This was done because parsing the content of a single file and using this as a value in bash reduces to something like VALUE=$(cat /the/path/to/the/file). Simple but lacks validation, type safety and whatnot.
With the gardener-node-agent we introduce a new API which is then stored in the gardener-node-agent secret and stored on disc in a single yaml file for comparison with the previous known state. This brings all benefits of type safe configuration.
Because actual and previous configuration are compared, removed files and units are also removed and stopped on the worker if removed from the OSC.

Availability

Previously the cloud-config-downloader simply restarted the systemd-units on every change to the OSC, regardless which of the services changed. The gardener-node-agent first checks which systemd-unit was changed, and will only restart these. This will remove unneeded kubelet restarts.

Goals

  • create a new shoot which only uses the new gardener-node-agent
  • update existing workers from cloud-config-downloader to gardener-node-agent
  • same behavior when applying the OSC as before.

Tasks (basic functionality)

OCI Image Support

A decision was made to go ahead with the current approach by extracting the binaries (kubelet, gardener-node-agent) out of docker images as currently implemented. Down the way towards GA of gardener-node-agent, the OCI artifact support is implemented and right before GA, this support is merged. The existing docker images of all still supported kubernetes/kubelet versions need to be pushed as OCI artifact and of course the binary of the gardener-node-agent.

Update: Not needed since we decided to use native containerd functionality for extracting binaries from images, see #8678.

No modifications are made to cloud-config-downloader to support OCI artifacts.

@gardener-prow gardener-prow bot added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension kind/epic Large multi-story topic labels Jun 5, 2023
@Gerrit91
Copy link
Contributor

Gerrit91 commented Jun 5, 2023

👍 for this stuff!

I would like to propose another future goal, which can be turned into an own issue after this one was resolved. The idea is to drop the OSC with -original suffix (purpose reconcile) and migrate its contents into the NodeAgentConfiguration. The -downloader OSC can remain as is.

This way, we can move on to a more concise interface for gardener-node-agent, which has several advantages:

  • The new interface enhances the visibility and testability for the features that are currently wrapped implicitly inside the OSC's file and unit definitions, making them easier to parametrize and to implement
  • Gardener and Extensions should not need to struggle with OS-/node-internals, only the gardener-node-agent should implement this kind of code
  • Extensions that want to hook into node-specific configuration do not need to agree on conventions (an example might be the registry-extension, which wants to add mirror configuration for containerd and has to put it to a specific path in order not to conflict with other extensions that want to modify the containerd configuration as well).
  • The node-agent can directly read its configuration from the secret instead of the described approach where the configuration is wrapped inside the OSC, which the OSC controller writes to the disk, which will then notify other controllers do apply possible changes.

Here is an example of how the configuration could be extended:

type NodeAgentConfiguration struct {
	metav1.TypeMeta

	// APIServer contains the connection configuration for the gardener-node-agent to
	// access the shoot api server.
	APIServer APIServer `json:"apiServer"`

	// ConfigurationSecretName defines the name of the secret in the shoot cluster, which contains
	// the configuration for the gardener-node-agent.
	ConfigurationSecretName string `json:"configSecretName"`
	// TokenSecretName defines the name of the secret in the shoot cluster, which contains
	// the projected shoot access token for the gardener-node-agent.
	TokenSecretName string `json:"tokenSecretName"`

	// Image is the docker image reference to the gardener-node-agent.
	Image string `json:"image"`
	// HyperkubeImage is the docker image reference to the hyperkube containing kubelet.
	HyperkubeImage string `json:"hyperkubeImage"`

	// KubernetesVersion contains the kubernetes version of the kubelet, used for annotating
	// the corresponding node resource with a kubernetes version annotation.
	KubernetesVersion string `json:"kubernetesVersion"`

	// ContainerdConfiguration configures the containerd runtime according to the given configuration.
	// +optional
	ContainerdConfiguration *ContainerdConfiguration `json:"containerdConfiguration,omitempty`

	// ValitailConfiguration configures valitail according to the given configuration.
	// +optional
	ValitailConfiguration *ValitailConfiguration `json:"valitailConfiguration,omitempty`
    
	// KubeletDataVolumeSize sets the data volume size of an unformatted disk on the worker node,
	// which is the be used for /var/lib on the worker.
	// +optional
	KubeletDataVolumeSize *int64 `json:"kubeletDataVolumeSize,omitempty"`
}

type ContainerdConfiguration struct {
	// RegistryMirrors a list of registry mirror to configure for containerd.
	RegistryMirrors []string `json:"registryMirrors,omitempty`
	// MonitorEnabled deploys a service to monitor the containerd service.
	MonitorEnabled bool `json:"monitorEnabled`
	// LogRotationPeriod is cron schedule for when logs of containerd a rotated.
	LogRotationPeriod string `json:"logRotationPeriod"`

	// ... fields have to be worked out in the future
}

type ValitailConfiguration struct {
	// ... fields have to be worked out in the future
}

Do you think this is feasible?

vknabel added a commit to metal-stack/gardener that referenced this issue Jun 27, 2023
vknabel added a commit to metal-stack/gardener that referenced this issue Jun 27, 2023
vknabel added a commit to metal-stack/gardener that referenced this issue Jul 19, 2023
vknabel added a commit to metal-stack/gardener that referenced this issue Jul 19, 2023
vknabel added a commit to metal-stack/gardener that referenced this issue Jul 19, 2023
vknabel added a commit to metal-stack/gardener that referenced this issue Jul 25, 2023
vknabel added a commit to metal-stack/gardener that referenced this issue Jul 31, 2023
vknabel added a commit to metal-stack/gardener that referenced this issue Sep 11, 2023
@rfranzke rfranzke pinned this issue Sep 11, 2023
@vknabel
Copy link
Contributor

vknabel commented Sep 15, 2023

As promised I rebased all currently known future changes from the hackathon following the current PR.
Here you can have a look on the additional changes. As I could not come up with an ideal order of upcoming PRs based on the controllers yet, I did not create isolated commits (although it would be easier to cherry-pick and rebase a controller for a pr).

Sadly the codebase has diverged and I couldn't fix all compile issues by myself.

I hope this serves well as a starting point

vknabel added a commit to metal-stack/gardener that referenced this issue Sep 20, 2023
vknabel added a commit to metal-stack/gardener that referenced this issue Sep 21, 2023
vknabel added a commit to metal-stack/gardener that referenced this issue Sep 25, 2023
vknabel added a commit to metal-stack/gardener that referenced this issue Oct 2, 2023
vknabel added a commit to metal-stack/gardener that referenced this issue Oct 4, 2023
vknabel added a commit to metal-stack/gardener that referenced this issue Oct 4, 2023
gardener-prow bot pushed a commit that referenced this issue Oct 4, 2023
* Introduce lib for gardener-node-agent #8023

* Apply suggestions from code review

Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>

* Revisit first parts of the node agent concept

* Rephrase reason comparision

* Speed benefits mostly in large clusters

* Remove commented imports

* refactor(nodeagent): rename extractTarGz

* fix(nodeagent): pick newest file from layers

* fix(nodeagent): dropped projected info for token

That the token is projected doesn't matter.

* fix(nodeagent): removed empty tests

Nothing to test here

* fix(nodeagent): mirror v1alpha1 changes

* refactor(nodeagent): dbus logs, events and naming

* feat(nodeagent): validate for supported kubernetesversion

* feat(nodeagent): improved coverage for config validation

* revert(nodeagent): fake dbus tests did not provide any value

* docs(nodeagent): fix config registration docs

* fix(docs): reorder the basic design and postpone installation

* docs(nodeagent): binary path

* docs(nodeagent): be more explicit between cloud config and osc

* docs(nodeagent): link operatingsystemconfig extension

* docs(nodeagent): future development section

Removes the TODO inside the Scalability section and appends it in a separate section.

* fix(codegen): generate nodeagent

* fix(nodeagent): fix checks

* Update pkg/nodeagent/apis/config/validation/validation_test.go

Co-authored-by: Oliver Götz <47362717+oliver-goetz@users.noreply.github.com>

* docs(nodeagent): rename architecture svg

* docs(nodeagent): improved wording

* fix(nodeagent): camel cased validation

* fix(nodeagent): wording

* docs(nodeagent): prefer `kubelet`

* docs(nodeagent): `kube-apiserver`

* fix(nodeagent): validation test specs

* fix(dbus): remove empty suite

* refactor(dbus): typo and formatting

* fix(nodeagent): extract secure from remote

* Apply suggestions from code review

Co-authored-by: Rafael Franzke <rafael.franzke@sap.com>

* docs(nodeagent): rephrase gardener community

* docs(nodeagent): remove mentioning of supported archs

* refactor(nodeagent): rename api types

* fix(nodeagent): lowercase kubelet data volume size

* refactor(nodeagent): validation naming and formatting

* refactor(nodeagent): binary cabundle

* refactor(nodeagent): use semver for Kubernetes Version

* fix(nodeagent): remove unused fake dbus

Currently it is unused. In an upcoming PR it will be reintroduced by future controllers.

* Apply suggestions from code review

Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>

* tmp: delete controller-registration to regenerate

* chore: generate

* docs(nodeagent): corrected architecture diagram

* revert: controller registration due to tar incompatabilities

* docs(nodeagent): wording architecture diagram

* fix(generate): add trailing newline for controller registration

* feat(nodeagent): test registry extraction

* fix(nodeagent): lint

* Update pkg/nodeagent/apis/config/validation/validation.go

Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>

* PR review feedback

---------

Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>
Co-authored-by: Oliver Götz <47362717+oliver-goetz@users.noreply.github.com>
Co-authored-by: Rafael Franzke <rafael.franzke@sap.com>
rfranzke pushed a commit to rfranzke/gardener that referenced this issue Oct 5, 2023
@rfranzke
Copy link
Member

rfranzke commented Oct 9, 2023

/assign @oliver-goetz @rfranzke

@oliver-goetz
Copy link
Member

/area ipcei

@rfranzke
Copy link
Member

All tasks have been completed.
/close

@gardener-prow gardener-prow bot closed this as completed Apr 12, 2024
Copy link
Contributor

gardener-prow bot commented Apr 12, 2024

@rfranzke: Closing this issue.

In response to this:

All tasks have been completed.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ipcei IPCEI (Important Project of Common European Interest) area/quality Output qualification (tests, checks, scans, automation in general, etc.) related area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension kind/epic Large multi-story topic
Projects
None yet
Development

No branches or pull requests

6 participants