-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Elastic Agent in different namespace than Elastic Stack. #7353
Allow Elastic Agent in different namespace than Elastic Stack. #7353
Conversation
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to adjust the documentation, ECK 2.10.0 does not allow Agent and ES to run in different namespaces.
There is one thing I do not understand. I think it is fine to remove the restriction for those use cases where users follow our non-root recipe which explicitly configures Kibana to propagate the ES CA through the What I struggle to understand is how our standard Fleet/Agent recipes are going to work with this restriction removed: |
buildkite test this -f p=gke,t=TestFleetKubernetesIntegrationRecipe |
Could we update |
As suspected this does not work, pods are stuck in
My setup https://gist.github.com/pebrc/930b57bc69c0ad6d483ee86b17ad1acd P.S. it does work if Fleet server and agents share a namespace. But I think we should make sure it works in all combinations e.g. by copying the CA secret if it does not exist. |
I think I know the scenario where this differs from all of the tests I ran yesterday. Fleet is not in the same namespace as Agent. In all of mine and @barkbay testing, Elasticsearch was in one namespace, Kibana was in another, and Fleet and Agents were in a third... I'll update the docs in this PR to indicate that, and add back this restriction to be very specific about what it's restricting. |
Update documentation Update tests. Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
This has been updated. Do we want to add additional e2e tests for this? |
Should we not try to lift the restriction completely as I suggested by copying the secret into the agents namespace if it does not exist there? |
Yes, but I would be more in favor of modifying an existing one like We should update the step in the Agent builder that verifies that Elasticsearch is ready to use the Elasticsearch namespace instead of the Agent namespace. diff --git a/test/e2e/test/agent/steps.go b/test/e2e/test/agent/steps.go
index 4d6fe93d9..48fc04367 100644
--- a/test/e2e/test/agent/steps.go
+++ b/test/e2e/test/agent/steps.go
@@ -97,9 +97,9 @@ func (b Builder) CreationTestSteps(k *test.K8sClient) test.StepList {
// This is to improve test stability see https://github.com/elastic/cloud-on-k8s/issues/7172
Name: "Wait for Elasticsearch to be green",
Test: test.UntilSuccess(func() error {
- for _, ref := range b.Agent.Spec.ElasticsearchRefs {
+ for _, esRef := range b.Agent.Spec.ElasticsearchRefs {
var es esv1.Elasticsearch
- if err := k.Client.Get(context.Background(), ref.WithDefaultNamespace(b.Agent.Namespace).NamespacedName(), &es); err != nil {
+ if err := k.Client.Get(context.Background(), esRef.NamespacedName(), &es); err != nil {
return err
}
if es.Status.Health != esv1.ElasticsearchGreenHealth {
|
My apologies, I missed reading this last sentence. I'm handling this now. |
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Moving this back to a WIP as I work through some changes. Status:
Todo:
|
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
buildkite test this -f E2E_TAGS=agent,p=gke -m s=8.11.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! The created secret made my setup with Agents and Fleet server in different namespaces work.
There certainly are some additional issues with this, as I'm seeing when running new e2e tests:
I'll work through these, and get these new e2e tests functional/updated. |
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
…secret watched. Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Fix labels for transitive associations. Fix ensuring secrets associated with transitive associations are deleted. Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Add back log statement. Add comment. Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is seems to be an issue with the garbage collection in this PR. In my testing it just spins out of control deleting and recreating secrets in a frenzy.
More generally I think we want to have an implementation where specifics of one association (here agent/fleet server) do not bleed into the generic association reconciler code. I see two options: 1. handling the secret copying outside of the association mechanism directly in the agent controller 2. similar to your approach doing it in the association controller but in a slightly different and simpler way that does not try to mimic an association.
I am including a diff because I went down the route of experimenting in code when reviewing your PR because I got confused myself by the complexities of our association code and it was easier for me reason about it in code. That does not mean that you have to blindly follow the implementation I propose here:
diff --git a/pkg/controller/agent/pod.go b/pkg/controller/agent/pod.go
index 7a7825d2e..4b38e7988 100644
--- a/pkg/controller/agent/pod.go
+++ b/pkg/controller/agent/pod.go
@@ -333,15 +333,15 @@ func applyRelatedEsAssoc(agent agentv1alpha1.Agent, esAssociation commonv1.Assoc
return builder, nil
}
- esRef := esAssociation.AssociationRef()
- if !esRef.IsExternal() && !agent.Spec.FleetServerEnabled && agent.Namespace != esRef.Namespace {
- // check agent and ES share the same namespace
- return nil, fmt.Errorf(
- "agent namespace %s is different than referenced Elasticsearch namespace %s, this is not supported yet",
- agent.Namespace,
- esAssociation.AssociationRef().Namespace,
- )
- }
+ // esRef := esAssociation.AssociationRef()
+ // if !esRef.IsExternal() && !agent.Spec.FleetServerEnabled && agent.Namespace != esRef.Namespace {
+ // // check agent and ES share the same namespace
+ // return nil, fmt.Errorf(
+ // "agent namespace %s is different than referenced Elasticsearch namespace %s, this is not supported yet",
+ // agent.Namespace,
+ // esAssociation.AssociationRef().Namespace,
+ // )
+ // }
// no ES CA to configure, skip
assocConf, err := esAssociation.AssociationConf()
diff --git a/pkg/controller/association/controller/agent_fleetserver.go b/pkg/controller/association/controller/agent_fleetserver.go
index 491d31cdd..59ed8c433 100644
--- a/pkg/controller/association/controller/agent_fleetserver.go
+++ b/pkg/controller/association/controller/agent_fleetserver.go
@@ -30,6 +30,7 @@ func AddAgentFleetServer(mgr manager.Manager, accessReviewer rbac.AccessReviewer
AssociationName: "agent-fleetserver",
AssociatedShortName: "agent",
AssociationType: commonv1.FleetServerAssociationType,
+ AdditionalSecrets: addtionalSecrets,
Labels: func(associated types.NamespacedName) map[string]string {
return map[string]string{
AgentAssociationLabelName: associated.Name,
@@ -45,6 +46,47 @@ func AddAgentFleetServer(mgr manager.Manager, accessReviewer rbac.AccessReviewer
})
}
+func addtionalSecrets(c k8s.Client, assoc commonv1.Association) ([]types.NamespacedName, error) {
+ associated := assoc.Associated()
+ var agent agentv1alpha1.Agent
+ nsn := types.NamespacedName{Namespace: associated.GetNamespace(), Name: associated.GetName()}
+ if err := c.Get(context.Background(), nsn, &agent); err != nil {
+ return nil, err
+ }
+ fleetServerRef := assoc.AssociationRef()
+ if !fleetServerRef.IsDefined() {
+ return nil, nil
+ }
+ fleetServer := agentv1alpha1.Agent{}
+ if err := c.Get(context.Background(), fleetServerRef.NamespacedName(), &fleetServer); err != nil {
+ return nil, err
+ }
+
+ // If the Fleet Server Agent is not associated with an Elasticsearch cluster
+ // (potentially because of a manual setup) we should do nothing.
+ if len(fleetServer.Spec.ElasticsearchRefs) == 0 {
+ return nil, nil
+ }
+ esAssociation, err := association.SingleAssociationOfType(fleetServer.GetAssociations(), commonv1.ElasticsearchAssociationType)
+ if err != nil {
+ return nil, err
+ }
+
+ // TODO optimisation: if both agent and ES are same namespace no copying needed
+
+ conf, err := esAssociation.AssociationConf()
+ if err != nil {
+ return nil, err
+ }
+ if conf == nil || !conf.CACertProvided {
+ return nil, nil
+ }
+ return []types.NamespacedName{{
+ Namespace: fleetServer.Namespace,
+ Name: conf.CASecretName,
+ }}, nil
+}
+
func getFleetServerExternalURL(c k8s.Client, assoc commonv1.Association) (string, error) {
fleetServerRef := assoc.AssociationRef()
if !fleetServerRef.IsDefined() {
diff --git a/pkg/controller/association/dynamic_watches.go b/pkg/controller/association/dynamic_watches.go
index 05ba2a7e1..753610639 100644
--- a/pkg/controller/association/dynamic_watches.go
+++ b/pkg/controller/association/dynamic_watches.go
@@ -35,6 +35,12 @@ func serviceWatchName(associated types.NamespacedName) string {
return fmt.Sprintf("%s-%s-svc-watch", associated.Namespace, associated.Name)
}
+// serviceWatchName returns the name of the watch setup on the custom service to be used to make requests to the
+// referenced resource.
+func additionalSecretWatchName(associated types.NamespacedName) string {
+ return fmt.Sprintf("%s-%s-secrets-watch", associated.Namespace, associated.Name)
+}
+
// reconcileWatches sets up dynamic watches for:
// * the referenced resource(s) managed or not by ECK (e.g. Elasticsearch for Kibana -> Elasticsearch associations)
// * the CA secret of the referenced resource in the referenced resource namespace
@@ -93,17 +99,31 @@ func (r *Reconciler) reconcileWatches(associated types.NamespacedName, associati
}
}
+ if r.AdditionalSecrets != nil {
+ if err := ReconcileGenericWatch(associated, managedElasticRef, r.watches.Secrets, additionalSecretWatchName(associated), func() ([]types.NamespacedName, error) {
+ var toWatch []types.NamespacedName
+ for _, association := range associations {
+ secs, err := r.AdditionalSecrets(r.Client, association)
+ if err != nil {
+ return nil, err
+ }
+ toWatch = append(toWatch, secs...)
+ }
+ return toWatch, nil
+ }); err != nil {
+ return err
+ }
+ }
+
return nil
}
-// ReconcileWatch sets or removes `watchName` watch in `dynamicRequest` based on `associated` and `associations` and
-// `watchedFunc`. No watch is added if watchedFunc(association) refers to an empty namespaced name.
-func ReconcileWatch(
+func ReconcileGenericWatch(
associated types.NamespacedName,
associations []commonv1.Association,
dynamicRequest *watches.DynamicEnqueueRequest,
watchName string,
- watchedFunc func(association commonv1.Association) types.NamespacedName,
+ watchedFunc func() ([]types.NamespacedName, error),
) error {
if len(associations) == 0 {
// clean up if there are none
@@ -111,23 +131,41 @@ func ReconcileWatch(
return nil
}
- emptyNamespacedName := types.NamespacedName{}
-
- toWatch := make([]types.NamespacedName, 0, len(associations))
- for _, association := range associations {
- watchedNamespacedName := watchedFunc(association)
- if watchedNamespacedName != emptyNamespacedName {
- toWatch = append(toWatch, watchedFunc(association))
- }
+ watched, err := watchedFunc()
+ if err != nil {
+ return err
}
-
return dynamicRequest.AddHandler(watches.NamedWatch{
Name: watchName,
- Watched: toWatch,
+ Watched: watched,
Watcher: associated,
})
}
+// ReconcileWatch sets or removes `watchName` watch in `dynamicRequest` based on `associated` and `associations` and
+// `watchedFunc`. No watch is added if watchedFunc(association) refers to an empty namespaced name.
+func ReconcileWatch(
+ associated types.NamespacedName,
+ associations []commonv1.Association,
+ dynamicRequest *watches.DynamicEnqueueRequest,
+ watchName string,
+ watchedFunc func(association commonv1.Association) types.NamespacedName,
+) error {
+
+ return ReconcileGenericWatch(associated, associations, dynamicRequest, watchName, func() ([]types.NamespacedName, error) {
+ emptyNamespacedName := types.NamespacedName{}
+
+ toWatch := make([]types.NamespacedName, 0, len(associations))
+ for _, association := range associations {
+ watchedNamespacedName := watchedFunc(association)
+ if watchedNamespacedName != emptyNamespacedName {
+ toWatch = append(toWatch, watchedFunc(association))
+ }
+ }
+ return toWatch, nil
+ })
+}
+
// RemoveWatch removes `watchName` watch from `dynamicRequest`.
func RemoveWatch(dynamicRequest *watches.DynamicEnqueueRequest, watchName string) {
dynamicRequest.RemoveHandlerForKey(watchName)
diff --git a/pkg/controller/association/reconciler.go b/pkg/controller/association/reconciler.go
index f31cb953e..2239ddd9c 100644
--- a/pkg/controller/association/reconciler.go
+++ b/pkg/controller/association/reconciler.go
@@ -60,6 +60,8 @@ type AssociationInfo struct { //nolint:revive
AssociationName string
// AssociatedShortName is the short name of the associated resource type (eg. "kb").
AssociatedShortName string
+
+ AdditionalSecrets func(c k8s.Client, assoc commonv1.Association) ([]types.NamespacedName, error)
// Labels are labels set on all resources created for association purpose. Note that some resources will be also
// labelled with AssociationResourceNameLabelName and AssociationResourceNamespaceLabelName in addition to any
// labels provided here.
@@ -258,6 +260,19 @@ func (r *Reconciler) reconcileAssociation(ctx context.Context, association commo
return commonv1.AssociationPending, err // maybe not created yet
}
+ if r.AdditionalSecrets != nil {
+ additionalSecrets, err := r.AdditionalSecrets(r.Client, association)
+ if err != nil {
+ return commonv1.AssociationPending, err // maybe not created yet
+ }
+ for _, sec := range additionalSecrets {
+ if err := copySecret(ctx, r.Client, association.GetNamespace(), sec); err != nil {
+ return commonv1.AssociationPending, err
+ }
+ }
+
+ }
+
url, err := r.AssociationInfo.ExternalServiceURL(r.Client, association)
if err != nil {
// the Service may not have been created by the resource controller yet
diff --git a/pkg/controller/association/secret.go b/pkg/controller/association/secret.go
index 618153e0f..317626fac 100644
--- a/pkg/controller/association/secret.go
+++ b/pkg/controller/association/secret.go
@@ -14,10 +14,12 @@ import (
"net/http"
corev1 "k8s.io/api/core/v1"
+ "k8s.io/apimachinery/pkg/types"
"k8s.io/client-go/util/jsonpath"
commonv1 "github.com/elastic/cloud-on-k8s/v2/pkg/apis/common/v1"
"github.com/elastic/cloud-on-k8s/v2/pkg/controller/common/certificates"
+ "github.com/elastic/cloud-on-k8s/v2/pkg/controller/common/reconciler"
"github.com/elastic/cloud-on-k8s/v2/pkg/controller/common/version"
"github.com/elastic/cloud-on-k8s/v2/pkg/controller/elasticsearch/client"
"github.com/elastic/cloud-on-k8s/v2/pkg/utils/k8s"
@@ -189,3 +191,20 @@ func filterManagedElasticRef(associations []commonv1.Association) []commonv1.Ass
}
return r
}
+
+func copySecret(ctx context.Context, client k8s.Client, targetNamespace string, source types.NamespacedName) error {
+ var original corev1.Secret
+ if err := client.Get(ctx, source, &original); err != nil {
+ return err
+ }
+ var expected corev1.Secret
+ expected.Data = original.Data
+ expected.Labels = original.Labels
+ expected.Annotations = original.Annotations
+ expected.Name = original.Name
+ expected.Namespace = targetNamespace
+
+ _, err := reconciler.ReconcileSecret(ctx, client, expected, nil)
+ return err
+
+}
The idea is to introduce a new concept "additional secrets" that we can spec out in the association controllers (following your lead with the "transitive associations"). It serves exactly one purpose to instruct the assoc controller to copy the secrets we indicate into the target namespace and add a watch for them so that they are updated when the original changes.
What is missing from my implementation draft is that we need to rotate the agent pods when the CA secret changes (this might actually be a bug in the current implementation not sure) but it could go into applyRelatedEsAssoc
. The second thing that is missing is the optimisation to not do any of the copying when everything is in one namespace.
@@ -108,7 +108,13 @@ func internalReconcile(params Params) (*reconciler.Results, agentv1alpha1.AgentS | |||
|
|||
configHash := fnv.New32a() | |||
var fleetCerts *certificates.CertificatesSecret | |||
if params.Agent.Spec.FleetServerEnabled && params.Agent.Spec.HTTP.TLS.Enabled() { | |||
if params.Agent.Spec.HTTP.TLS.Enabled() && params.Agent.Spec.FleetModeEnabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we generating certificates for non Fleet server agents?
// In the case of a transitive association, no CA is configured in the annotation, so we need to | ||
// use the method used within the Association controller to generate the expected secret name. | ||
caSecretName = association.CACertSecretName(esAssociation, "agent-fleetserver") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think this is true? In case of a transitive association this would be the relationship between FleetServer and Elasticsearch and if that relationship is suing self-signed certs managed by ECK there should be a CA. I am confused as to why this is needed. Can you explain?
if a.AssociationType == commonv1.FleetServerAssociationType && association.AssociationType() == commonv1.ElasticsearchAssociationType { | ||
// we are dealing with a transitive association, the labels are different | ||
associationResourceNameLabelName = eslabel.ClusterNameLabelName | ||
associationResourceNamespaceLabelName = eslabel.ClusterNamespaceLabelName | ||
// unfortunately the constant cannot be used here as it would create a dependency cycle. | ||
associatedLabels["agentassociation.k8s.elastic.co/type"] = commonv1.ElasticsearchAssociationType | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this import cycle is a telltale sign that this code should not be here. I don't understand why we are now special casing Fleet in the generic association controller. If that's what we want to do it would have been easier to just copy the CA secret in the agent controller if ES and the Agents are not in the same namespace. And set up a dynamic watch there.
transitiveAssociatedLabels["agentassociation.k8s.elastic.co/type"] = commonv1.ElasticsearchAssociationType | ||
labelSets = append(labelSets, transitiveAssociatedLabels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be in the generic association controller code
esAssociation, err = association.SingleAssociationOfType(fs.GetAssociations(), commonv1.ElasticsearchAssociationType) | ||
// We copy the Fleet Server Refs to the Agent so that the association appears to come from | ||
// the Elastic Agent, not the Fleet Server and is named appropriately. | ||
agent.Spec.ElasticsearchRefs = fs.Spec.ElasticsearchRefs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
@pebrc I hadn't seen this in my testing, which is odd.
Yeah, as mentioned my experience in this are of this codebase was limited, and it was a bit confusing for me as well. I'm glad to see I'm not the only one.
This is where I think some of the confusion is coming into play, and I'd like to try and explain what I'm seeing with an example. What I'm seeing is this:
I'll work through the the given diff and update soon... Thanks for the detailed review @pebrc |
More specifics: With the following (truncated) manifest:
The fleet => es manifest/association creates a CA secret
The agent => fleet manifest causes this mount/volume to be created in the Agent daemonset spec which does not exist in the
Which causes the following error:
This is why I was attempting to make this association be handled in the agent=>fleet association, and not the agent=>es (fleet-server) as it should be owned by that association since the CA is needed for the Agents themselves that are connected to Fleet when they need to communication with Elasticsearch. |
Closing in favor of #7382 |
resolves #7352
What does this change?
Both root and non-root, fleet and non-fleet Agent installations were being blocked from Agent running in a different namespace because of a legacy association check. This legacy check checked that the agent was running in the same namespace as Elasticsearch. The actual issue turned out to be when Agent is not running in the same namespace as Fleet server when fleet mode is enabled. This change enables Agent to run in any namespace regardless of where Fleet or Elasticsearch is deployed.
Interesting background information.
In the current ECK version when fleet mode is enabled the Agent that is associated with the Fleet Server is actually mounting a secret created/owned by the Fleet Server Agent's association with Elasticsearch, not the Agent associated with Fleet Server, which is a bit odd, and was where the underlying issue stems from.
Changes: