-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
☂️ Introduce Gardener Node Agent (a.k.a. replace cloud-config-downloader
)
#8023
Comments
👍 for this stuff! I would like to propose another future goal, which can be turned into an own issue after this one was resolved. The idea is to drop the OSC with This way, we can move on to a more concise interface for gardener-node-agent, which has several advantages:
Here is an example of how the configuration could be extended: type NodeAgentConfiguration struct {
metav1.TypeMeta
// APIServer contains the connection configuration for the gardener-node-agent to
// access the shoot api server.
APIServer APIServer `json:"apiServer"`
// ConfigurationSecretName defines the name of the secret in the shoot cluster, which contains
// the configuration for the gardener-node-agent.
ConfigurationSecretName string `json:"configSecretName"`
// TokenSecretName defines the name of the secret in the shoot cluster, which contains
// the projected shoot access token for the gardener-node-agent.
TokenSecretName string `json:"tokenSecretName"`
// Image is the docker image reference to the gardener-node-agent.
Image string `json:"image"`
// HyperkubeImage is the docker image reference to the hyperkube containing kubelet.
HyperkubeImage string `json:"hyperkubeImage"`
// KubernetesVersion contains the kubernetes version of the kubelet, used for annotating
// the corresponding node resource with a kubernetes version annotation.
KubernetesVersion string `json:"kubernetesVersion"`
// ContainerdConfiguration configures the containerd runtime according to the given configuration.
// +optional
ContainerdConfiguration *ContainerdConfiguration `json:"containerdConfiguration,omitempty`
// ValitailConfiguration configures valitail according to the given configuration.
// +optional
ValitailConfiguration *ValitailConfiguration `json:"valitailConfiguration,omitempty`
// KubeletDataVolumeSize sets the data volume size of an unformatted disk on the worker node,
// which is the be used for /var/lib on the worker.
// +optional
KubeletDataVolumeSize *int64 `json:"kubeletDataVolumeSize,omitempty"`
}
type ContainerdConfiguration struct {
// RegistryMirrors a list of registry mirror to configure for containerd.
RegistryMirrors []string `json:"registryMirrors,omitempty`
// MonitorEnabled deploys a service to monitor the containerd service.
MonitorEnabled bool `json:"monitorEnabled`
// LogRotationPeriod is cron schedule for when logs of containerd a rotated.
LogRotationPeriod string `json:"logRotationPeriod"`
// ... fields have to be worked out in the future
}
type ValitailConfiguration struct {
// ... fields have to be worked out in the future
} Do you think this is feasible? |
Co-authored by: @mjast01 @Gerrit91
As promised I rebased all currently known future changes from the hackathon following the current PR. Sadly the codebase has diverged and I couldn't fix all compile issues by myself.
I hope this serves well as a starting point |
* Introduce lib for gardener-node-agent #8023 * Apply suggestions from code review Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com> * Revisit first parts of the node agent concept * Rephrase reason comparision * Speed benefits mostly in large clusters * Remove commented imports * refactor(nodeagent): rename extractTarGz * fix(nodeagent): pick newest file from layers * fix(nodeagent): dropped projected info for token That the token is projected doesn't matter. * fix(nodeagent): removed empty tests Nothing to test here * fix(nodeagent): mirror v1alpha1 changes * refactor(nodeagent): dbus logs, events and naming * feat(nodeagent): validate for supported kubernetesversion * feat(nodeagent): improved coverage for config validation * revert(nodeagent): fake dbus tests did not provide any value * docs(nodeagent): fix config registration docs * fix(docs): reorder the basic design and postpone installation * docs(nodeagent): binary path * docs(nodeagent): be more explicit between cloud config and osc * docs(nodeagent): link operatingsystemconfig extension * docs(nodeagent): future development section Removes the TODO inside the Scalability section and appends it in a separate section. * fix(codegen): generate nodeagent * fix(nodeagent): fix checks * Update pkg/nodeagent/apis/config/validation/validation_test.go Co-authored-by: Oliver Götz <47362717+oliver-goetz@users.noreply.github.com> * docs(nodeagent): rename architecture svg * docs(nodeagent): improved wording * fix(nodeagent): camel cased validation * fix(nodeagent): wording * docs(nodeagent): prefer `kubelet` * docs(nodeagent): `kube-apiserver` * fix(nodeagent): validation test specs * fix(dbus): remove empty suite * refactor(dbus): typo and formatting * fix(nodeagent): extract secure from remote * Apply suggestions from code review Co-authored-by: Rafael Franzke <rafael.franzke@sap.com> * docs(nodeagent): rephrase gardener community * docs(nodeagent): remove mentioning of supported archs * refactor(nodeagent): rename api types * fix(nodeagent): lowercase kubelet data volume size * refactor(nodeagent): validation naming and formatting * refactor(nodeagent): binary cabundle * refactor(nodeagent): use semver for Kubernetes Version * fix(nodeagent): remove unused fake dbus Currently it is unused. In an upcoming PR it will be reintroduced by future controllers. * Apply suggestions from code review Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com> * tmp: delete controller-registration to regenerate * chore: generate * docs(nodeagent): corrected architecture diagram * revert: controller registration due to tar incompatabilities * docs(nodeagent): wording architecture diagram * fix(generate): add trailing newline for controller registration * feat(nodeagent): test registry extraction * fix(nodeagent): lint * Update pkg/nodeagent/apis/config/validation/validation.go Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com> * PR review feedback --------- Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com> Co-authored-by: Oliver Götz <47362717+oliver-goetz@users.noreply.github.com> Co-authored-by: Rafael Franzke <rafael.franzke@sap.com>
/assign @oliver-goetz @rfranzke |
/area ipcei |
All tasks have been completed. |
@rfranzke: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
How to categorize this issue?
/area quality robustness
/kind enhancement epic
What would you like to be added:
Replace the cloud-config-download/executor which are written in the
bash
programming language with a binary written ingo
which is a kubernetes controller based on controller-runtime.This was implemented up to a working implementation during the Gardener Hackathon 2023 in Leverkusen
https://github.com/gardener-community/hackathon
Why is this needed:
With the new Architecture we gain a lot, let's describe the most important gains here.
Developer Productivity
Because we all develop in go day by day, writing business logic in
bash
is difficult, hard to maintain, almost impossible to test. Getting rid of almost allbash
scripts which are currently in use for this very important part of the cluster creation process will enhance the speed of adding new features and removing bugs.Speed
Until now, the
cloud-config-downloader
runs in a loop every 60sec to check if something changed on the shoot which requires modifications on the worker node. This produces a lot of unneeded traffic on the api-server and wastes time, it will sometimes take up to 60sec until a desired modification is started on the worker node.By using the controller-runtime we can watch for the
node
, theOSC
in thesecret
, and the shoot-access-token in thesecret
. If any of these object changed, and only then, the required action will take effect immediately.This will speed up operations and will reduce the load on the api-server of the shoot dramatically.
Scalability
Actually the
cloud-config-downloader
add a random wait time before restarting thekubelet
in case thekubelet
was updated or a configuration change was made to it. This is required to reduce the load on the API server and the traffic on the internet uplink. It also reduces the overall downtime of the services in the cluster because everykubelet
restart takes a node for several seconds intoNotReady
state which eventually interrupts service availability.Decision was made to keep the existing jitter mechanism which calculates the kubelet-download-and-restart-delay-seconds on the controller itself.
Correctness
The configuration of the
cloud-config-downloader
is actually done by placing a file for every configuration item on the disk on the worker node. This was done because parsing the content of a single file and using this as a value inbash
reduces to something likeVALUE=$(cat /the/path/to/the/file)
. Simple but lacks validation, type safety and whatnot.With the
gardener-node-agent
we introduce a new API which is then stored in thegardener-node-agent
secret
and stored on disc in a single yaml file for comparison with the previous known state. This brings all benefits of type safe configuration.Because actual and previous configuration are compared, removed files and units are also removed and stopped on the worker if removed from the
OSC
.Availability
Previously the
cloud-config-downloader
simply restarted thesystemd-units
on every change to theOSC
, regardless which of the services changed. Thegardener-node-agent
first checks which systemd-unit was changed, and will only restart these. This will remove unneededkubelet
restarts.Goals
Tasks (basic functionality)
gardener-node-agent
#8249cmd/gardener-node-agent/app.go
withmanager.Manager
initialization #8627Secret
controller #8629Node
controller #8632OperatingSystemConfig
controller #8683/pkg/node-agent/registry
#8678OperatingSystemConfig
API to with unit files andimageRef
file type #8707files
handling inOperatingSystemConfig
API #8814OperatingSystemConfig
components always specifyfilePaths
in their units #8827filePaths
when merging units #8838token
controller to fetch also other tokens (valitail
) #8839Lease
controller for reporting health/liveness (similar to theLease
thekubelet
writes) #8767Lease
s tokube-system
namespace #8893Lease
cache whenNode
has already been registered #8901/var/lib/gardener-node-agent/tmp
#8894dbus
interface should not block and wait for result channel forever #890710s
#8919gardener-node-init.service
when gardener-node-agent is active #9096gardener-node-agent
gardener-node-init
script and unit #8726Seed
/Shoot
Controllers AdaptationOperatingSystemConfig
API #8756gardener-node-agent
unit #8836gardener-node-agent
intogardenlet
'sShoot
controller #8847gardenlet
health checks to considergardener-node-agent
'sLease
object for theEveryNodeReady
condition #9073cloud-config-downloader
unit (sample) - if they do, then try to remove these or enablegardener-node-init
instead): PrepareOperatingSystemConfig
API for changes needed forgardener-node-agent
#8647provider-local
: PrepareOperatingSystemConfig
API for changes needed forgardener-node-agent
#8647 and [provider-local] Adapt to changes needed forgardener-node-agent
#8834os-gardenlinux
: Adapt to changes needed forgardener-node-agent
gardener-extension-os-gardenlinux#130, released withv0.22.0
os-suse-chost
: Adapt to changes needed forgardener-node-agent
gardener-extension-os-suse-chost#112, released withv1.24.0
os-coreos
: Adapt to changes needed forgardener-node-agent
gardener-extension-os-coreos#80, released withv1.19.0
os-ubuntu
: Adapt to changes needed forgardener-node-agent
gardener-extension-os-ubuntu#99, released withv1.24.0
os-metal
: Adapt to changes needed forgardener-node-agent
metal-stack/os-metal-extension#33, released withv?
(PR is ready, but no need to wait for it as part of this umbrella issue, so let's consider it done)(not needed since it was never used anywhere and is outdated anyways)os-k3os
gardener-node-agent
)gardener-node-agent
#8249Extra build job to publish(not needed since we decided to use nativegardener-node-agent
and potentiallykubelet
/kubectl
as OCI Image, see below sectioncontainerd
functionality for extracting binaries from images, see [node-agent] Refactor/pkg/node-agent/registry
#8678)Build(we decided to drop Prow builds, see Stop buildinggardener-node-agent
container image in Prow (multi-arch!)gardener/gardener
dev-images in prow ci-infra#981 and Remove.ci/resources.yaml
when concourse dev-builds are enabled #8740 for details)OperatingSystemConfig
reconciliation delays evenly #8885Secret
permissions and cache #8857{kubelet,containerd}-monitor
health check bash scripts to Golang #8786gardener-node-agent
inprovider-extensions
setup #9048 (follow-up of Explicitly disableUseGardenerNodeAgent
feature gate in local extensions setup #8905)gardener
Linux user when nodes are bootstrapping #9077hostname
ingardener-node-agent
#9114hostname-check
controller #9280UseGardenerNodeAgent
feature gate ingardenlet
controlling whethergardener-node-{init,agent}
should be used: PrepareOperatingSystemConfig
API for changes needed forgardener-node-agent
#8647UseGardenerNodeAgent
feature gate to beta #9161UseGardenerNodeAgent
feature gate to GA #9208nerdctl
workaround for local OS image gardener-attic/machine-controller-manager-provider-local#37UseGardenerNodeAgent
feature gate,cloud-config-{downloader,executor}
, andoscommon
package 🧹 #9477v1.92
got released)os-gardenlinux
: Cleanup legacy code forcloud-config-downloader
scenario gardener-extension-os-gardenlinux#161os-suse-chost
: Cleanup legacy code forcloud-config-downloader
scenario gardener-extension-os-suse-chost#138os-coreos
: Cleanup legacy code forcloud-config-downloader
scenario gardener-extension-os-coreos#100os-ubuntu
: Cleanup legacy code forcloud-config-downloader
scenario gardener-extension-os-ubuntu#126provider-aws
: Cleanup handling for disabledUseGardenerNodeAgent
feature gate gardener-extension-provider-aws#920OCI Image Support
A decision was made to go ahead with the current approach by extracting the binaries (kubelet, gardener-node-agent) out of docker images as currently implemented. Down the way towards GA of gardener-node-agent, the OCI artifact support is implemented and right before GA, this support is merged. The existing docker images of all still supported kubernetes/kubelet versions need to be pushed as OCI artifact and of course the binary of the gardener-node-agent.Update: Not needed since we decided to use native
containerd
functionality for extracting binaries from images, see #8678.No modifications are made to cloud-config-downloader to support OCI artifacts.
The text was updated successfully, but these errors were encountered: