Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[node-agent] Introduce gardener-node-init script and unit #8726

Merged
merged 13 commits into from
Nov 3, 2023

Conversation

rfranzke
Copy link
Member

How to categorize this PR?

/area dev-productivity
/kind enhancement

What this PR does / why we need it:
This PR introduces the bash script ("gardener-node-init") which will replace the cloud-config-downloader script when new shoot worker nodes are bootstrapped. It uses ctr images {pull,mount} to copy the gardener-node-agent binary from its container image to the node, and then runs gardener-node-agent bootstrap. This is a command which

  • (optional) format kubelet data volume (similar to how it's done today)
  • creates the gardener-node-agent.service systemd unit file (the "real" gardener-node-agent)
  • enables and starts this gardener-node-agent.service
  • disables and stops itself (the gardener-node-init.service)

From then on, gardener-node-agent takes care of reconciling the OSC secret with its OperatingSystemConfig controller (ref #8683). For self-upgrades, it may only restart itself at the very end of its reconciliation (after it persisted the "last applied OSC file") to prevent infinite loops.

Which issue(s) this PR fixes:
Part of #8023

Special notes for your reviewer:
On the way, I added a few more minor enhancements:

  • added missing client connection defaulting for GNA component config
  • extracted common run command initialisation code into reusable function

/cc @oliver-goetz

Release note:

NONE

@gardener-prow gardener-prow bot requested a review from oliver-goetz October 29, 2023 18:01
@gardener-prow gardener-prow bot added area/dev-productivity Developer productivity related (how to improve development) kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 29, 2023
Copy link
Member

@ScheererJ ScheererJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, in particular the refactoring of the commands.

Initial feedback by @oliver-goetz and myself, 2 commits still to review.

pkg/nodeagent/bootstrap/bootstrap.go Outdated Show resolved Hide resolved
Copy link
Member

@ScheererJ ScheererJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished review together with @oliver-goetz and found a few more things.

@rfranzke
Copy link
Member Author

rfranzke commented Nov 1, 2023

@ScheererJ @oliver-goetz Thanks for your reviews. I have addressed the feedback, and I have also added a few more things that I found while integrating this into gardenlet. In a nutshell:

  • both gardener-node-{init,agent} now gracefully shut down
  • file permissions adapted
  • ctr images pull and the containerd extractor should respect the hosts directory on the node to consider registry mirrors
  • containerd extractor needs to copy the file from an image into a temp file first and then rename it (direct copy leads to device or resource busy error)
  • kubeconfig was dropped and component config was enhanced with KAPI url + CA bundle (this simplifies the handling, especially for the migration from cloud-config-downloader later on

Please take another look and let me know what you think.

@rfranzke rfranzke requested a review from ScheererJ November 1, 2023 22:16
@rfranzke rfranzke force-pushed the gna/node-init branch 2 times, most recently from ae1923a to ef86300 Compare November 2, 2023 08:05
rfranzke and others added 11 commits November 3, 2023 09:17
- client connection
- servers
- logging settings
Otherwise, GNA would restart itself too early, effectively triggering an infinite loop because it would always consider itself as "changed unit".
We already followed the same approach in GNA's `node` controller.

Co-Authored-By: Oliver Götz <47362717+oliver-goetz@users.noreply.github.com>
This code is the very same for almost each binary, so let's extract it into a function to not duplicate it again and again.
Co-Authored-By: Oliver Götz <47362717+oliver-goetz@users.noreply.github.com>
Co-Authored-By: Oliver Götz <47362717+oliver-goetz@users.noreply.github.com>
Co-Authored-By: Oliver Götz <47362717+oliver-goetz@users.noreply.github.com>
- OSC controller calls the root `context.CancelFunc` of the command
- Bootstrap package no longer stops gardener-node-init unit
- gardener-node-init unit no longer gets restarted each 30s but is a oneshot which only restarts on failure
Also add mutex to fake DBus to prevent flakes since the OSC controller
in GNA is running systemd commands in parallel (and in the fake DBus
they all write to the same slice in parallel)
Without this, registry mirrors in the hosts dir will not be respected.

`ctr images pull` as well as the `docker.ConfigureHosts` function already ignore "not found" errors (in case the provided hosts dir does not exist), so we can safely specify it unconditionally
and then renames it - a direct copy while a process is running will not work (`device or resource busy`)
Copy link
Member

@ScheererJ ScheererJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a typo.

docs/extensions/operatingsystemconfig.md Outdated Show resolved Hide resolved
@rfranzke rfranzke requested a review from ScheererJ November 3, 2023 08:59
This is much simpler compared to passing all potential arguments
Copy link
Member

@ScheererJ ScheererJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Nov 3, 2023
Copy link
Contributor

gardener-prow bot commented Nov 3, 2023

LGTM label has been added.

Git tree hash: c80b97bc54a42a109d9ae1a648163fc94dc1a014

Copy link
Contributor

gardener-prow bot commented Nov 3, 2023

@rfranzke: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gardener-apidiff a9d26a2 link false /test pull-gardener-apidiff

Full PR test history. Your PR dashboard. Command help for this repository.
Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@oliver-goetz
Copy link
Member

awesome 🚀
/approve

Copy link
Contributor

gardener-prow bot commented Nov 3, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: oliver-goetz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 3, 2023
@gardener-prow gardener-prow bot merged commit 4e17e44 into gardener:master Nov 3, 2023
1 check passed
@rfranzke rfranzke deleted the gna/node-init branch November 3, 2023 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/dev-productivity Developer productivity related (how to improve development) cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants