Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Elastic Agent in different namespace than Elastic Stack. #7353

Closed

Conversation

naemono
Copy link
Contributor

@naemono naemono commented Nov 29, 2023

resolves #7352

What does this change?

Both root and non-root, fleet and non-fleet Agent installations were being blocked from Agent running in a different namespace because of a legacy association check. This legacy check checked that the agent was running in the same namespace as Elasticsearch. The actual issue turned out to be when Agent is not running in the same namespace as Fleet server when fleet mode is enabled. This change enables Agent to run in any namespace regardless of where Fleet or Elasticsearch is deployed.

Interesting background information.

In the current ECK version when fleet mode is enabled the Agent that is associated with the Fleet Server is actually mounting a secret created/owned by the Fleet Server Agent's association with Elasticsearch, not the Agent associated with Fleet Server, which is a bit odd, and was where the underlying issue stems from.

Changes:

  • The association controller for Agent -> Fleet can now retrieve this "transitive" association, and ensure that the CA is created within the same namespace as Agent
  • The Agent controller now checks when building the pod template if the Elasticsearch association is a "transitive" association, and if so it modifies the volume name to be consistent with what the association controller creates.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono naemono added >bug Something isn't working >enhancement Enhancement of existing functionality v2.11.0 labels Nov 29, 2023
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to adjust the documentation, ECK 2.10.0 does not allow Agent and ES to run in different namespaces.

@pebrc
Copy link
Collaborator

pebrc commented Nov 30, 2023

There is one thing I do not understand.

I think it is fine to remove the restriction for those use cases where users follow our non-root recipe which explicitly configures Kibana to propagate the ES CA through the ssl.certificate_authorities attribute in kibana.yml

What I struggle to understand is how our standard Fleet/Agent recipes are going to work with this restriction removed:
https://github.com/elastic/cloud-on-k8s/blob/main/config/recipes/elastic-agent/fleet-kubernetes-integration.yaml for example. The operator will try to transitively discover the Elasticsearch association for Agent (Agent -> Fleet -> ES) and mount the CA secret referenced in Fleets (!) association configuration in Agent. If I am not mistaken that means that at least Fleet and the Agents have to be in the same namespace?

@thbkrkr
Copy link
Contributor

thbkrkr commented Nov 30, 2023

buildkite test this -f p=gke,t=TestFleetKubernetesIntegrationRecipe

@thbkrkr
Copy link
Contributor

thbkrkr commented Nov 30, 2023

Could we update TestFleetKubernetesIntegrationRecipe or add another e2e test to cover the scenario where different namespaces are used?

@pebrc
Copy link
Collaborator

pebrc commented Nov 30, 2023

What I struggle to understand is how our standard Fleet/Agent recipes are going to work with this restriction removed:
https://github.com/elastic/cloud-on-k8s/blob/main/config/recipes/elastic-agent/fleet-kubernetes-integration.yaml for example. The operator will try to transitively discover the Elasticsearch association for Agent (Agent -> Fleet -> ES) and mount the CA secret referenced in Fleets (!) association configuration in Agent. If I am not mistaken that means that at least Fleet and the Agents have to be in the same namespace?

As suspected this does not work, pods are stuck in ContainerCreating

 Warning  FailedMount  25s (x7 over 57s)  kubelet            MountVolume.SetUp failed for volume "elasticsearch-certs" : secret "fleet-server-agent-es-default-elasticsearch-ca" not found      

My setup https://gist.github.com/pebrc/930b57bc69c0ad6d483ee86b17ad1acd

P.S. it does work if Fleet server and agents share a namespace. But I think we should make sure it works in all combinations e.g. by copying the CA secret if it does not exist.

@naemono
Copy link
Contributor Author

naemono commented Nov 30, 2023

My setup https://gist.github.com/pebrc/930b57bc69c0ad6d483ee86b17ad1acd

I think I know the scenario where this differs from all of the tests I ran yesterday. Fleet is not in the same namespace as Agent. In all of mine and @barkbay testing, Elasticsearch was in one namespace, Kibana was in another, and Fleet and Agents were in a third...

I'll update the docs in this PR to indicate that, and add back this restriction to be very specific about what it's restricting.

Update documentation
Update tests.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono
Copy link
Contributor Author

naemono commented Dec 1, 2023

I'll update the docs in this PR to indicate that, and add back this restriction to be very specific about what it's restricting.

This has been updated. Do we want to add additional e2e tests for this?

@pebrc
Copy link
Collaborator

pebrc commented Dec 1, 2023

e.g. by copying the CA secret if it does not exist.

Should we not try to lift the restriction completely as I suggested by copying the secret into the agents namespace if it does not exist there?

@thbkrkr
Copy link
Contributor

thbkrkr commented Dec 1, 2023

Do we want to add additional e2e tests for this?

Yes, but I would be more in favor of modifying an existing one like TestFleetMode?

We should update the step in the Agent builder that verifies that Elasticsearch is ready to use the Elasticsearch namespace instead of the Agent namespace.

diff --git a/test/e2e/test/agent/steps.go b/test/e2e/test/agent/steps.go
index 4d6fe93d9..48fc04367 100644
--- a/test/e2e/test/agent/steps.go
+++ b/test/e2e/test/agent/steps.go
@@ -97,9 +97,9 @@ func (b Builder) CreationTestSteps(k *test.K8sClient) test.StepList {
                // This is to improve test stability see https://github.com/elastic/cloud-on-k8s/issues/7172
                Name: "Wait for Elasticsearch to be green",
                Test: test.UntilSuccess(func() error {
-                     for _, ref := range b.Agent.Spec.ElasticsearchRefs {
+                     for _, esRef := range b.Agent.Spec.ElasticsearchRefs {
                        var es esv1.Elasticsearch
-                         if err := k.Client.Get(context.Background(), ref.WithDefaultNamespace(b.Agent.Namespace).NamespacedName(), &es); err != nil {
+                         if err := k.Client.Get(context.Background(), esRef.NamespacedName(), &es); err != nil {
                            return err
                        }
                        if es.Status.Health != esv1.ElasticsearchGreenHealth {

@naemono
Copy link
Contributor Author

naemono commented Dec 1, 2023

But I think we should make sure it works in all combinations e.g. by copying the CA secret if it does not exist.

My apologies, I missed reading this last sentence. I'm handling this now.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono naemono changed the title Allow Elastic Agent in different namespace than Elasticsearch WIP: Allow Elastic Agent in different namespace than Elastic Stack. Dec 4, 2023
@naemono
Copy link
Contributor Author

naemono commented Dec 4, 2023

Moving this back to a WIP as I work through some changes.

Status:

  • Seems to work with Agent in different namespace that Fleet server. Not sure I love the approach/changes to enable it.

Todo:

  • Manual Testing of different permutations.
  • Writing e2e test(s)
  • Potentially adjusting approach taken.
  • Update Docs

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono
Copy link
Contributor Author

naemono commented Dec 5, 2023

buildkite test this -f E2E_TAGS=agent,p=gke -m s=8.11.1

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The created secret made my setup with Agents and Fleet server in different namespaces work.

pkg/controller/association/reconciler.go Show resolved Hide resolved
pkg/controller/association/controller/agent_fleetserver.go Outdated Show resolved Hide resolved
pkg/controller/association/reconciler.go Show resolved Hide resolved
@naemono
Copy link
Contributor Author

naemono commented Dec 7, 2023

There certainly are some additional issues with this, as I'm seeing when running new e2e tests:

[2023-12-07T03:16:10.306+00:00][INFO ][plugins.security.authentication] Authentication attempt failed: {"error":{"root_cause":[{"type":"security_exception","reason":"unable to authenticate user [e2e-venus-test-agent-fleet-fs-c5vz-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}}],"type":"security_exception","reason":"unable to authenticate user [e2e-venus-test-agent-fleet-fs-c5vz-agent-kb-user] for REST request [/_security/_authenticate]","header":{"WWW-Authenticate":["Basic realm=\"security\" charset=\"UTF-8\"","Bearer realm=\"security\"","ApiKey"]}},"status":401}

I'll work through these, and get these new e2e tests functional/updated.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono naemono changed the title WIP: Allow Elastic Agent in different namespace than Elastic Stack. Allow Elastic Agent in different namespace than Elastic Stack. Dec 7, 2023
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
…secret watched.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Fix labels for transitive associations.
Fix ensuring secrets associated with transitive associations are deleted.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Add back log statement.
Add comment.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono naemono requested review from pebrc and barkbay December 11, 2023 14:38
Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is seems to be an issue with the garbage collection in this PR. In my testing it just spins out of control deleting and recreating secrets in a frenzy.

More generally I think we want to have an implementation where specifics of one association (here agent/fleet server) do not bleed into the generic association reconciler code. I see two options: 1. handling the secret copying outside of the association mechanism directly in the agent controller 2. similar to your approach doing it in the association controller but in a slightly different and simpler way that does not try to mimic an association.

I am including a diff because I went down the route of experimenting in code when reviewing your PR because I got confused myself by the complexities of our association code and it was easier for me reason about it in code. That does not mean that you have to blindly follow the implementation I propose here:

diff --git a/pkg/controller/agent/pod.go b/pkg/controller/agent/pod.go
index 7a7825d2e..4b38e7988 100644
--- a/pkg/controller/agent/pod.go
+++ b/pkg/controller/agent/pod.go
@@ -333,15 +333,15 @@ func applyRelatedEsAssoc(agent agentv1alpha1.Agent, esAssociation commonv1.Assoc
 		return builder, nil
 	}
 
-	esRef := esAssociation.AssociationRef()
-	if !esRef.IsExternal() && !agent.Spec.FleetServerEnabled && agent.Namespace != esRef.Namespace {
-		// check agent and ES share the same namespace
-		return nil, fmt.Errorf(
-			"agent namespace %s is different than referenced Elasticsearch namespace %s, this is not supported yet",
-			agent.Namespace,
-			esAssociation.AssociationRef().Namespace,
-		)
-	}
+	// esRef := esAssociation.AssociationRef()
+	// if !esRef.IsExternal() && !agent.Spec.FleetServerEnabled && agent.Namespace != esRef.Namespace {
+	// 	// check agent and ES share the same namespace
+	// 	return nil, fmt.Errorf(
+	// 		"agent namespace %s is different than referenced Elasticsearch namespace %s, this is not supported yet",
+	// 		agent.Namespace,
+	// 		esAssociation.AssociationRef().Namespace,
+	// 	)
+	// }
 
 	// no ES CA to configure, skip
 	assocConf, err := esAssociation.AssociationConf()
diff --git a/pkg/controller/association/controller/agent_fleetserver.go b/pkg/controller/association/controller/agent_fleetserver.go
index 491d31cdd..59ed8c433 100644
--- a/pkg/controller/association/controller/agent_fleetserver.go
+++ b/pkg/controller/association/controller/agent_fleetserver.go
@@ -30,6 +30,7 @@ func AddAgentFleetServer(mgr manager.Manager, accessReviewer rbac.AccessReviewer
 		AssociationName:           "agent-fleetserver",
 		AssociatedShortName:       "agent",
 		AssociationType:           commonv1.FleetServerAssociationType,
+		AdditionalSecrets:         addtionalSecrets,
 		Labels: func(associated types.NamespacedName) map[string]string {
 			return map[string]string{
 				AgentAssociationLabelName:      associated.Name,
@@ -45,6 +46,47 @@ func AddAgentFleetServer(mgr manager.Manager, accessReviewer rbac.AccessReviewer
 	})
 }
 
+func addtionalSecrets(c k8s.Client, assoc commonv1.Association) ([]types.NamespacedName, error) {
+	associated := assoc.Associated()
+	var agent agentv1alpha1.Agent
+	nsn := types.NamespacedName{Namespace: associated.GetNamespace(), Name: associated.GetName()}
+	if err := c.Get(context.Background(), nsn, &agent); err != nil {
+		return nil, err
+	}
+	fleetServerRef := assoc.AssociationRef()
+	if !fleetServerRef.IsDefined() {
+		return nil, nil
+	}
+	fleetServer := agentv1alpha1.Agent{}
+	if err := c.Get(context.Background(), fleetServerRef.NamespacedName(), &fleetServer); err != nil {
+		return nil, err
+	}
+
+	// If the Fleet Server Agent is not associated with an Elasticsearch cluster
+	// (potentially because of a manual setup) we should do nothing.
+	if len(fleetServer.Spec.ElasticsearchRefs) == 0 {
+		return nil, nil
+	}
+	esAssociation, err := association.SingleAssociationOfType(fleetServer.GetAssociations(), commonv1.ElasticsearchAssociationType)
+	if err != nil {
+		return nil, err
+	}
+
+	// TODO optimisation: if both agent and ES are same namespace no copying needed
+
+	conf, err := esAssociation.AssociationConf()
+	if err != nil {
+		return nil, err
+	}
+	if conf == nil || !conf.CACertProvided {
+		return nil, nil
+	}
+	return []types.NamespacedName{{
+		Namespace: fleetServer.Namespace,
+		Name:      conf.CASecretName,
+	}}, nil
+}
+
 func getFleetServerExternalURL(c k8s.Client, assoc commonv1.Association) (string, error) {
 	fleetServerRef := assoc.AssociationRef()
 	if !fleetServerRef.IsDefined() {
diff --git a/pkg/controller/association/dynamic_watches.go b/pkg/controller/association/dynamic_watches.go
index 05ba2a7e1..753610639 100644
--- a/pkg/controller/association/dynamic_watches.go
+++ b/pkg/controller/association/dynamic_watches.go
@@ -35,6 +35,12 @@ func serviceWatchName(associated types.NamespacedName) string {
 	return fmt.Sprintf("%s-%s-svc-watch", associated.Namespace, associated.Name)
 }
 
+// serviceWatchName returns the name of the watch setup on the custom service to be used to make requests to the
+// referenced resource.
+func additionalSecretWatchName(associated types.NamespacedName) string {
+	return fmt.Sprintf("%s-%s-secrets-watch", associated.Namespace, associated.Name)
+}
+
 // reconcileWatches sets up dynamic watches for:
 // * the referenced resource(s) managed or not by ECK (e.g. Elasticsearch for Kibana -> Elasticsearch associations)
 // * the CA secret of the referenced resource in the referenced resource namespace
@@ -93,17 +99,31 @@ func (r *Reconciler) reconcileWatches(associated types.NamespacedName, associati
 		}
 	}
 
+	if r.AdditionalSecrets != nil {
+		if err := ReconcileGenericWatch(associated, managedElasticRef, r.watches.Secrets, additionalSecretWatchName(associated), func() ([]types.NamespacedName, error) {
+			var toWatch []types.NamespacedName
+			for _, association := range associations {
+				secs, err := r.AdditionalSecrets(r.Client, association)
+				if err != nil {
+					return nil, err
+				}
+				toWatch = append(toWatch, secs...)
+			}
+			return toWatch, nil
+		}); err != nil {
+			return err
+		}
+	}
+
 	return nil
 }
 
-// ReconcileWatch sets or removes `watchName` watch in `dynamicRequest` based on `associated` and `associations` and
-// `watchedFunc`. No watch is added if watchedFunc(association) refers to an empty namespaced name.
-func ReconcileWatch(
+func ReconcileGenericWatch(
 	associated types.NamespacedName,
 	associations []commonv1.Association,
 	dynamicRequest *watches.DynamicEnqueueRequest,
 	watchName string,
-	watchedFunc func(association commonv1.Association) types.NamespacedName,
+	watchedFunc func() ([]types.NamespacedName, error),
 ) error {
 	if len(associations) == 0 {
 		// clean up if there are none
@@ -111,23 +131,41 @@ func ReconcileWatch(
 		return nil
 	}
 
-	emptyNamespacedName := types.NamespacedName{}
-
-	toWatch := make([]types.NamespacedName, 0, len(associations))
-	for _, association := range associations {
-		watchedNamespacedName := watchedFunc(association)
-		if watchedNamespacedName != emptyNamespacedName {
-			toWatch = append(toWatch, watchedFunc(association))
-		}
+	watched, err := watchedFunc()
+	if err != nil {
+		return err
 	}
-
 	return dynamicRequest.AddHandler(watches.NamedWatch{
 		Name:    watchName,
-		Watched: toWatch,
+		Watched: watched,
 		Watcher: associated,
 	})
 }
 
+// ReconcileWatch sets or removes `watchName` watch in `dynamicRequest` based on `associated` and `associations` and
+// `watchedFunc`. No watch is added if watchedFunc(association) refers to an empty namespaced name.
+func ReconcileWatch(
+	associated types.NamespacedName,
+	associations []commonv1.Association,
+	dynamicRequest *watches.DynamicEnqueueRequest,
+	watchName string,
+	watchedFunc func(association commonv1.Association) types.NamespacedName,
+) error {
+
+	return ReconcileGenericWatch(associated, associations, dynamicRequest, watchName, func() ([]types.NamespacedName, error) {
+		emptyNamespacedName := types.NamespacedName{}
+
+		toWatch := make([]types.NamespacedName, 0, len(associations))
+		for _, association := range associations {
+			watchedNamespacedName := watchedFunc(association)
+			if watchedNamespacedName != emptyNamespacedName {
+				toWatch = append(toWatch, watchedFunc(association))
+			}
+		}
+		return toWatch, nil
+	})
+}
+
 // RemoveWatch removes `watchName` watch from `dynamicRequest`.
 func RemoveWatch(dynamicRequest *watches.DynamicEnqueueRequest, watchName string) {
 	dynamicRequest.RemoveHandlerForKey(watchName)
diff --git a/pkg/controller/association/reconciler.go b/pkg/controller/association/reconciler.go
index f31cb953e..2239ddd9c 100644
--- a/pkg/controller/association/reconciler.go
+++ b/pkg/controller/association/reconciler.go
@@ -60,6 +60,8 @@ type AssociationInfo struct { //nolint:revive
 	AssociationName string
 	// AssociatedShortName is the short name of the associated resource type (eg. "kb").
 	AssociatedShortName string
+
+	AdditionalSecrets func(c k8s.Client, assoc commonv1.Association) ([]types.NamespacedName, error)
 	// Labels are labels set on all resources created for association purpose. Note that some resources will be also
 	// labelled with AssociationResourceNameLabelName and AssociationResourceNamespaceLabelName in addition to any
 	// labels provided here.
@@ -258,6 +260,19 @@ func (r *Reconciler) reconcileAssociation(ctx context.Context, association commo
 		return commonv1.AssociationPending, err // maybe not created yet
 	}
 
+	if r.AdditionalSecrets != nil {
+		additionalSecrets, err := r.AdditionalSecrets(r.Client, association)
+		if err != nil {
+			return commonv1.AssociationPending, err // maybe not created yet
+		}
+		for _, sec := range additionalSecrets {
+			if err := copySecret(ctx, r.Client, association.GetNamespace(), sec); err != nil {
+				return commonv1.AssociationPending, err
+			}
+		}
+
+	}
+
 	url, err := r.AssociationInfo.ExternalServiceURL(r.Client, association)
 	if err != nil {
 		// the Service may not have been created by the resource controller yet
diff --git a/pkg/controller/association/secret.go b/pkg/controller/association/secret.go
index 618153e0f..317626fac 100644
--- a/pkg/controller/association/secret.go
+++ b/pkg/controller/association/secret.go
@@ -14,10 +14,12 @@ import (
 	"net/http"
 
 	corev1 "k8s.io/api/core/v1"
+	"k8s.io/apimachinery/pkg/types"
 	"k8s.io/client-go/util/jsonpath"
 
 	commonv1 "github.com/elastic/cloud-on-k8s/v2/pkg/apis/common/v1"
 	"github.com/elastic/cloud-on-k8s/v2/pkg/controller/common/certificates"
+	"github.com/elastic/cloud-on-k8s/v2/pkg/controller/common/reconciler"
 	"github.com/elastic/cloud-on-k8s/v2/pkg/controller/common/version"
 	"github.com/elastic/cloud-on-k8s/v2/pkg/controller/elasticsearch/client"
 	"github.com/elastic/cloud-on-k8s/v2/pkg/utils/k8s"
@@ -189,3 +191,20 @@ func filterManagedElasticRef(associations []commonv1.Association) []commonv1.Ass
 	}
 	return r
 }
+
+func copySecret(ctx context.Context, client k8s.Client, targetNamespace string, source types.NamespacedName) error {
+	var original corev1.Secret
+	if err := client.Get(ctx, source, &original); err != nil {
+		return err
+	}
+	var expected corev1.Secret
+	expected.Data = original.Data
+	expected.Labels = original.Labels
+	expected.Annotations = original.Annotations
+	expected.Name = original.Name
+	expected.Namespace = targetNamespace
+
+	_, err := reconciler.ReconcileSecret(ctx, client, expected, nil)
+	return err
+
+}

The idea is to introduce a new concept "additional secrets" that we can spec out in the association controllers (following your lead with the "transitive associations"). It serves exactly one purpose to instruct the assoc controller to copy the secrets we indicate into the target namespace and add a watch for them so that they are updated when the original changes.
What is missing from my implementation draft is that we need to rotate the agent pods when the CA secret changes (this might actually be a bug in the current implementation not sure) but it could go into applyRelatedEsAssoc. The second thing that is missing is the optimisation to not do any of the copying when everything is in one namespace.

@@ -108,7 +108,13 @@ func internalReconcile(params Params) (*reconciler.Results, agentv1alpha1.AgentS

configHash := fnv.New32a()
var fleetCerts *certificates.CertificatesSecret
if params.Agent.Spec.FleetServerEnabled && params.Agent.Spec.HTTP.TLS.Enabled() {
if params.Agent.Spec.HTTP.TLS.Enabled() && params.Agent.Spec.FleetModeEnabled() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we generating certificates for non Fleet server agents?

Comment on lines +358 to +360
// In the case of a transitive association, no CA is configured in the annotation, so we need to
// use the method used within the Association controller to generate the expected secret name.
caSecretName = association.CACertSecretName(esAssociation, "agent-fleetserver")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think this is true? In case of a transitive association this would be the relationship between FleetServer and Elasticsearch and if that relationship is suing self-signed certs managed by ECK there should be a CA. I am confused as to why this is needed. Can you explain?

Comment on lines +115 to +121
if a.AssociationType == commonv1.FleetServerAssociationType && association.AssociationType() == commonv1.ElasticsearchAssociationType {
// we are dealing with a transitive association, the labels are different
associationResourceNameLabelName = eslabel.ClusterNameLabelName
associationResourceNamespaceLabelName = eslabel.ClusterNamespaceLabelName
// unfortunately the constant cannot be used here as it would create a dependency cycle.
associatedLabels["agentassociation.k8s.elastic.co/type"] = commonv1.ElasticsearchAssociationType
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this import cycle is a telltale sign that this code should not be here. I don't understand why we are now special casing Fleet in the generic association controller. If that's what we want to do it would have been easier to just copy the CA secret in the agent controller if ES and the Agents are not in the same namespace. And set up a dynamic watch there.

Comment on lines +52 to +53
transitiveAssociatedLabels["agentassociation.k8s.elastic.co/type"] = commonv1.ElasticsearchAssociationType
labelSets = append(labelSets, transitiveAssociatedLabels)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be in the generic association controller code

esAssociation, err = association.SingleAssociationOfType(fs.GetAssociations(), commonv1.ElasticsearchAssociationType)
// We copy the Fleet Server Refs to the Agent so that the association appears to come from
// the Elastic Agent, not the Fleet Server and is named appropriately.
agent.Spec.ElasticsearchRefs = fs.Spec.ElasticsearchRefs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

@naemono
Copy link
Contributor Author

naemono commented Dec 12, 2023

In my testing it just spins out of control deleting and recreating secrets in a frenzy.

@pebrc I hadn't seen this in my testing, which is odd.

because I got confused myself by the complexities of our association code and it was easier for me reason about it in code

Yeah, as mentioned my experience in this are of this codebase was limited, and it was a bit confusing for me as well. I'm glad to see I'm not the only one.

Why do you think this is true? In case of a transitive association this would be the relationship between FleetServer and Elasticsearch and if that relationship is suing self-signed certs managed by ECK there should be a CA. I am confused as to why this is needed. Can you explain?

This is where I think some of the confusion is coming into play, and I'd like to try and explain what I'm seeing with an example. What I'm seeing is this:

  1. The Fleetserver->ES association creates a secret for Fleet server to use that includes the CA of ES.
  2. The Elastic Agent associated with Fleetserver fails to come online when associated with a fleet server in a different namespace as the secret mount configured for the Elastic Agent is referencing a secret that exists in the Fleet Server namespace.

I'll work through the the given diff and update soon... Thanks for the detailed review @pebrc

@naemono
Copy link
Contributor Author

naemono commented Dec 12, 2023

More specifics:

With the following (truncated) manifest:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch
  namespace: default
---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
  namespace: default
---
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server
  namespace: server
spec:
  version: 8.10.4
  kibanaRef:
    name: kibana
    namespace: default
  elasticsearchRefs:
  - name: elasticsearch
    namespace: default
  mode: fleet
  fleetServerEnabled: true
  policyID: eck-fleet-server
  deployment:
    replicas: 1
    podTemplate:
      spec:
        volumes:
        - name: agent-data
          emptyDir: {}
        serviceAccountName: fleet-server
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
---
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata: 
  name: elastic-agent
  namespace: agents
spec:
  version: 8.10.4
  kibanaRef:
    name: kibana
    namespace: default
  fleetServerRef: 
    name: fleet-server
    namespace: server
  mode: fleet
  policyID: eck-agent
  daemonSet:
    podTemplate:
      spec:
        volumes:
        - name: agent-data
          emptyDir: {}
        serviceAccountName: elastic-agent
        hostNetwork: true
        dnsPolicy: ClusterFirstWithHostNet
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0

The fleet => es manifest/association creates a CA secret fleet-server-agent-es-default-elasticsearch-ca which is added as a mount in the Fleet Agent spec:

❯ kc get deploy -n server fleet-server-agent -o yaml | yq '.spec.template.spec.volumes[2]'
name: elasticsearch-certs
secret:
  defaultMode: 420
  optional: false
  secretName: fleet-server-agent-es-default-elasticsearch-ca

The agent => fleet manifest causes this mount/volume to be created in the Agent daemonset spec which does not exist in the agents namespace (it's "owned" by the fleet=>es association, not the Agent with fleetserverRef).

❯ kc get ds -n agents elastic-agent-agent -o yaml | yq '.spec.template.spec.volumes[2]'
name: elasticsearch-certs
secret:
  defaultMode: 420
  optional: false
  secretName: fleet-server-agent-es-default-elasticsearch-ca

Which causes the following error:

24s         Warning   FailedMount           pod/elastic-agent-agent-9gjxx   MountVolume.SetUp failed for volume "elasticsearch-certs" : secret "fleet-server-agent-es-default-elasticsearch-ca" not found

This is why I was attempting to make this association be handled in the agent=>fleet association, and not the agent=>es (fleet-server) as it should be owned by that association since the CA is needed for the Agents themselves that are connected to Fleet when they need to communication with Elasticsearch.

@naemono
Copy link
Contributor Author

naemono commented Dec 13, 2023

Closing in favor of #7382

@naemono naemono closed this Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working >enhancement Enhancement of existing functionality v2.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running Agent as non root in a dedicated namespace is not possible
4 participants