-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APM config from local configuration before enrollment is lost #5204
Comments
I believe that I have identified the issue. Below shows the flow and where the issue exists:
It is possible for 5 and 6 to get crossed based on timing which can very rarely cause APM tracing to work as @simitt noted once. There are a few possible solutions:
Looking on input for proper solution from @elastic/elastic-agent-control-plane |
Moved this to the Elastic Agent repository because it is an Elastic Agent issue, not a Fleet Server or APM issue. |
Additional option: elastic-agent/internal/pkg/agent/cmd/container.go Lines 201 to 214 in 66c2483
Which uses the |
Thanks for the investigation, @blakerouse, and thanks for the additional option, @michel-laterman. I'm adding this issue to our next sprint, which starts Monday, and assigning it to @michel-laterman, looking at other priorities and capacity. So let's pick it up then. |
That could work if I lean towards option 3, but I don't know if we want that behavior. Seems okay to me, but might not be what we want for the product (aka. allowing all options to be used locally before the policy is applied on top). |
Interestingly, we already attempted to do this #4166 see the This feels like along the lines of what users want to happen for Fleet managed agents, which is that their initial configuration turns something on, and then if Fleet doesn't set it or also turns it on it stays enabled. There is no window where a feature is briefly disabled during the transition from standalone to Fleet during enrollment. |
Discussed this in today's team meeting and @cmacknz suggested an alternative approach that might solve the problem for APM Server in ESS. To be clear, this issue here is still pointing to a larger issue we have with config replacement so it needs to be tackled per the options being discussed in the last few comments but it might not be as high priority if we can implement @cmacknz's proposed approach for APM Server in ESS. That approach is basically to set the APM tracing configuration as part of the APM Server policy in ESS. I'm going to test it using the overrides API and verify that the config is passed all the way down to the APM Server component by looking at the diagnostic. cc: @simitt |
I tested this today and it should work. I created a new policy (because I couldn't override the preset
Then, using the policy override API, I set
So I believe the solution would be to update the cc: @simitt |
The agent policy config schema would also need to be update to allow Without this change, the config including |
Thanks @kpollich. Here is the PR to allow |
@ycombinator thanks for looking into alternatives here. We are very eager to finally get apm tracing enabled for apm-server, however, I am a bit worried about building a snowflake solution here, and would prefer a general fix, to avoid any conflicts with future changes that might not consider this specific solution. |
Hi @simitt, sorry I wasn't clearer in #5204 (comment), but the proposed approach is not a snowflake or workaround solution. @cmacknz can keep me honest but the thinking here is that since the APM Server in ESS is part of a Fleet-managed EA, it makes sense for any configuration for that APM Server to come from the Fleet-managed policy (as opposed to from the EA policy that's locally on disk). |
Yes, as of now |
Thanks for the clarifications and timelines! |
I tested on ESS QA with a However, I'm not 100% sure if this is sufficient. @juliaElastic do we also need to inject the following section under
If so, where would one get the |
@ycombinator Yeah, it's needed I think. There is a logic in the cloud repo that sets these values, the issue is they are not applied in agent. https://github.com/elastic/cloud/blob/master/scala-services/runner/src/main/scala/no/found/runner/allocation/stateless/ApmDockerContainer.scala#L434 |
Thanks @juliaElastic.
In that case, why am I not seeing them in the "Elastic Cloud Agent policy" (see the screenshot in #5204 (comment))? Do I need to do some extra configuration elsewhere to have these values show up in the policy?
Yes, I see the temporary workaround in Agent code in: elastic-agent/internal/pkg/agent/application/apm_config_modifier.go Lines 111 to 114 in fd477ec
Once we can confirm that Agent is able to receive the values from Fleet (for that I need to know the answer to the questions above), we can work on making the necessary changes in Agent to remove the temporary workaround. |
@ycombinator AFAIK these APM configs are not added to the agent policy, but directly to |
@juliaElastic Right, that's what we are trying to move away from and have the APM configuration be part of the Fleet-managed Agent policy instead 🙂. See #5204 (comment). So is there some way we can make that happen? I was able to get
|
I see, it seems an APM config is already there in the cloud policy based on a conditional, so maybe we just need to tweak the condition to enable it on all internal ESS clusters: https://github.com/elastic/cloud-assets/blob/4e9cf8979f57fd08db8a7ebb2b476b852fbd72bf/stackpack/kibana/config/kibana.yml#L315-L343 |
@simitt is probably the best person to answer this question. |
The apm data should be sent to the internal Elastic cluster, for support engineers and developers to leverage the data for troubleshooting. |
Thanks @simitt. @juliaElastic @nchaulet Given this, do you know what the policy should use for the values of |
BTW, since I've been testing on ESS QA and feature freeze for |
I'm not sure how to reference the internal Elastic cluster in kibana config. @simitt Could you help with that? |
I think @AlexP-Elastic might be able to provide the details here, as ES and Kibana are already shipping tracing data to cloud regional clusters. |
The Elastic cluster is injected by control plane into the templated config files in the stack pack: https://github.com/elastic/cloud/blob/master/scala-services/runner/src/main/scala/no/found/runner/allocation/stateless/KibanaDockerContainer.scala#L561-L589 for regional values, and for global values so assuming I'm understanding correctly and you want some Kibana code to inject them into the APM policy, easiest would be if you could reference (there's an additional complication in getting the fields that are needed to bypass IP filtering) |
Do these values already available to reference in kibana.yml, or is there a code change needed to make them work? |
These values are already in Kibana YAML (for versions of Kibana that support them) You can't see them in the stackpack you are linked because they are injected by the control plane infrastructure explicitly (not via the templating we use for defaults) |
I tested locally to set monitoring in a preconfigured policy with overrides as Nicolas suggested here: #5204 (comment) |
@AlexP-Elastic I tested the change in the latest 8.16-SNAPSHOT and it seems the substitutions of
|
This comment has been minimized.
This comment has been minimized.
Oh no, I just realized I was guilty of totally not understanding what you were doing and giving your last PR a distracted LGTM instead of reading it :( Sorry about that what I thought you were proposing to do was have Kibana inject the fields into the policy (and the PR https://github.com/elastic/cloud-assets/pull/1573/files was just adding some placeholder fields) What it actually does agent.monitoring:
apm:
hosts:
- "{{ elastic.apm.serverUrl }}"
environment: {{ elastic.apm.environment }}
secret_token: {{ elastic.apm.secretToken }} doesn't work at all because Hang on I need to think about this a moment, now I know you are trying to do it all via templates and not via code in Kibana |
@AlexP-Elastic No worries, we could inject the fields from kibana too, I'm just not sure where to take these values from in kibana. |
This is one option: https://github.com/elastic/cloud/pull/131470/files .. I think this is preferable to writing code inside Kibana, if we have to do it using the existing settings I think my preferred architectural solution would be for the APM container to take the values injected into |
Yeah I'm not sure if we could change the preconfiguration or call the kibana Fleet API from the APM container to modify the cloud agent policy. |
@juliaElastic - Is this blocked because the root cause fix here would involve makes changes to APM itself, or because we are waiting on https://github.com/elastic/cloud/pull/131470? |
I'm waiting to see if https://github.com/elastic/cloud/pull/131470 works, otherwise we will need to get some help from the APM team to see if we can add the config from the APM container. |
I'm just testing https://github.com/elastic/cloud/pull/131470 now, I think we (control plane) are happy to go forward with this as the plan, so once it's working we'll get it merged (hopefully by the end of the week) and you can follow the |
@juliaElastic Sorry for the delay, https://github.com/elastic/cloud/pull/131470 is now merged and in QA, so you can re-create your https://github.com/elastic/cloud/pull/131470 PR vs and then the next day we can actually test it out in QA |
@AlexP-Elastic Thanks, I created a pr: https://github.com/elastic/cloud-assets/pull/1588 |
@juliaElastic - yep I meant to create a PR Actually I just found out that we've branched master -> 9.x and 8.x to 8.x, so you'll need to issue the same PR against the |
Tested today in cloud QA, creating a 8.16-SNAPSHOT deployment, I'm seeing the apm config in the cloud agent policy: Though when I looked up traces on the apm server the metrics were sent to, I'm not seeing any fleet-server traces from the test deployment. Checking fleet-server logs to see what happens. Seeing monitoring server started successfully in fleet-server logs:
Am I missing something here? I don't see any apm related errors in the logs. EDIT: Never mind, I found the fleet-server traces by searching on one trace id from the logs, I didn't find earlier because the deployment id is not on fleet-server traces.
|
Anything else to do before we close the issue? We can add a sample rate to the cloud config when this is done: #5211 |
@juliaElastic sampling rate being optional I am not sure if we want to add it by default, I would rather consider this issue as done and create a follow PR later on if we need to add any sampling rate. |
+1 to consider this done |
Very cool to see this done, @juliaElastic and @AlexP-Elastic. Thank you! Agreed on adding |
We discovered an issue with the default APM config injected by cloud to internal ESS clusters: https://elasticco.atlassian.net/browse/CP-3464
It seems this APM config is not applied on a newly created clusters, and fleet-server traces are not being sent to https://overview.elastic-cloud.com/app/r/s/JIEzg
It is not clear if the issue is on elastic-agent or fleet-server side.
Related doc: https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/dev_docs/apm_tracing.md
Originally posted by @juliaElastic in elastic/fleet-server#3328 (comment)
The text was updated successfully, but these errors were encountered: