Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug that caused agents to bypass local loadbalancer #10280

Merged
merged 1 commit into from
Jun 4, 2024

Conversation

brandond
Copy link
Member

@brandond brandond commented Jun 3, 2024

Proposed Changes

Fix bug that caused agents to bypass local loadbalancer

If proxy.SetAPIServerPort was called multiple times, all calls after the first one would cause the apiserver address to be set to the default server address, bypassing the local load-balancer. This was most likely to occur on RKE2, where the supervisor may be up for a longer period of time before it is ready to manage node password secrets, causing the agent to retry. Also, K3s does not ever take the affected code path due to the apiserver and supervisor always using the same port.

I also added some comments to the code as I was stepping through it trying to figure out what's going on.

Types of Changes

bugfix

Verification

Testing

Linked Issues

User-Facing Change

Further Comments

@brandond brandond requested a review from a team as a code owner June 3, 2024 22:17
VestigeJ
VestigeJ previously approved these changes Jun 3, 2024
Copy link

@VestigeJ VestigeJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

codecov bot commented Jun 3, 2024

Codecov Report

Attention: Patch coverage is 26.08696% with 17 lines in your changes missing coverage. Please review.

Project coverage is 41.69%. Comparing base (79ba10f) to head (967cc14).

Files Patch % Lines
pkg/agent/proxy/apiproxy.go 10.00% 9 Missing ⚠️
pkg/agent/config/config.go 11.11% 5 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10280      +/-   ##
==========================================
- Coverage   47.89%   41.69%   -6.20%     
==========================================
  Files         177      177              
  Lines       14792    14801       +9     
==========================================
- Hits         7085     6172     -913     
- Misses       6362     7450    +1088     
+ Partials     1345     1179     -166     
Flag Coverage Δ
e2etests 36.43% <13.04%> (-10.03%) ⬇️
inttests 37.00% <8.69%> (+0.04%) ⬆️
unittests 11.33% <13.04%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

If proxy.SetAPIServerPort was called multiple times, all calls after the
first one would cause the apiserver address to be set to the default
server address, bypassing the local load-balancer. This was most likely
to occur on RKE2, where the supervisor may be up for a period of time
before it is ready to manage node password secrets, causing the agent
to retry.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Copy link
Contributor

@manuelbuil manuelbuil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and that's a great catch! Just a few comments. BTW, is it possible to "decollocate" kube-api and supervisor in K3s via config?

@@ -132,29 +133,35 @@ func (p *proxy) setSupervisorPort(addresses []string) []string {
// load-balancer, and the address of this load-balancer is returned instead of the actual apiserver
// addresses.
func (p *proxy) SetAPIServerPort(port int, isIPv6 bool) error {
if p.apiServerEnabled {
logrus.Debugf("Supervisor proxy apiserver port already set")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is the agent proxy, not the supervisor proxy. By supervisor proxy, I understand the one we use for egressSelector in the supervisor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call this the "supervisor proxy" because it is a proxy for retrieving supervisor (and now apiserver as well) addresses. This is covered in the doc comment:

// NewSupervisorProxy sets up a new proxy for retrieving supervisor and apiserver addresses. If
// lbEnabled is true, a load-balancer is started on the requested port to connect to the supervisor
// address, and the address of this local load-balancer is returned instead of the actual supervisor
// and apiserver addresses.
// NOTE: This is a proxy in the API sense - it returns either actual server URLs, or the URL of the
// local load-balancer. It is not actually responsible for proxying requests at the network level;
// this is handled by the load-balancers that the proxy optionally steers connections towards.
func NewSupervisorProxy(ctx context.Context, lbEnabled bool, dataDir, supervisorURL string, lbServerPort int, isIPv6 bool) (Proxy, error) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaaah ok, ufff it can easily be confusing. The supervisorproxy carries the supervisor and the kube-api addresses and load-balancers

}

logrus.Debugf("Supervisor proxy apiserver port changed; apiserver=%s lb=%v", p.apiServerURL, p.lbEnabled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment. I think supervisor proxy is confusing here

Copy link
Member Author

@brandond brandond Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is a little confusing that we have so many proxies. This one is a proxy for retrieving supervisor and apiserver addresses without having to know if the loadbalancer is enabled, in that it provides a layer of abstraction that allows us to avoid hardcoding addresses.

We have a similar "proxy" for etcd here as well:

// NewETCDProxy initializes a new proxy structure that contain a load balancer
// which listens on port 2379 and proxy between etcd cluster members
func NewETCDProxy(ctx context.Context, supervisorPort int, dataDir, etcdURL string, isIPv6 bool) (Proxy, error) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for the clarification

kubeconfigKubelet := filepath.Join(envInfo.DataDir, "agent", "kubelet.kubeconfig")
if err := deps.KubeConfig(kubeconfigKubelet, proxy.APIServerURL(), serverCAFile, clientKubeletCert, clientKubeletKey); err != nil {
if err := deps.KubeConfig(kubeconfigKubelet, apiServerURL, serverCAFile, clientKubeletCert, clientKubeletKey); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: given that the bug in how we set proxy.APIServerURL() is fixed, both proxy.APIServerURL() or apiServerURL will point to the same server, right? The former via the loadBalancer but at this "booting" step I expect it to only have one server backend

Copy link
Member Author

@brandond brandond Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I figured it made a little more sense to just retrieve it once and then use a local variable? It shouldn't really make a difference, no since the APIServerURL function itself just exposes a field on the proxy struct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I just wanted to doublecheck if my understanding was correct :)

@brandond brandond requested review from manuelbuil and a team June 4, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants