You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Basically the jmx version of the agent calls jmxfetch. That package looks for beans/metrics immediately after starting up, and then checks again after 600 seconds (thus the 10 minute delay).
There is a config option refresh_beans_initial which allows you to set the amount of time before it does the first refresh. We set it to 30 seconds, and the problem was resolved.
Note that the documentation does not specify where to put the config. For posterity, we put it in the conf.d/confluent_platform.d/conf.yaml file under instances as follows:
After the fix you can see the new deployment starts sending metrics much earlier than the example below:
Scenario / Issue Description
We have an AWS ECS Cluster running a service with Fargate tasks
In each task we have a Java application in one container, and a DataDog agent running as a sidecar
When a new task is deployed, metrics that would be reported by the confluent_platform integration are not collected for about 10 minutes
This causes the metrics to be under-reported for that period
Example of how it looks during a scale-up from 2 to 3 tasks:
This example is using a metric from our confluent_platform integration. Although the metric indicates a drop in number of records consumed, they are actually being consumed by the new task. Approximately 10 minutes after the scale-up, the metrics begin to be correctly shown.
More info
Above Example
Here is the query used in the above example: sum:confluent.kafka.consumer.fetch_topic.records_consumed_rate{$service,$env} by {client-id}
Note service and env tags are set by the DD_DOGSTATSD_TAGS environment variable on the agent, and client-id is from the confluent_platform integration. We have looked at whether the metrics are being sent without tags, and don't think that's the case.
Scope of Missing Metrics
This seems to only impact metrics from our confluent_platform integration with the DD agent (eg the example above with confluent.kafka.consumer.fetch_topic.records_consumed_rate.
Custom metrics collected with dogstatsd client show up right away
Metrics collected from AWS directly such as ecs.fargate.mem.usage appear for the new task straight away.
See this pic filtering by task ARN:
Config
Our ECS cluster is not setup in a way that allows us to run exec functions on containers, so haven't been able to send a flare. Here is some of the config:
We copy this conf.yml into the conf.d/conflient_platform.d/ directory on the agent:
Looking at the built in logging of ECS, we see that the ContainerKnownStatus for the container stays in PENDING for an extended time:
11:13:24.000 AM
ContainerKnownStatus: PENDING
ContainerName: jmx-dd-sidecar
...
11:25:53.000 AM
ContainerKnownStatus: PENDING
ContainerName: jmx-dd-sidecar
11:26:50.000 AM
ContainerKnownStatus: RUNNING
ContainerName: jmx-dd-sidecar
In this example, metrics started coming through at about 11:21:00, a few minutes before the ContainerKnownStatus went to RUNNING. This is currently the most promising lead we are trying to follow up
Approaches we tried
We used the dependsOn behaviour of ECS to set up a dependency of our java app on the dd-agent sidecar. The agent would report healthy and our service would start up as normal, but it didn't affect the behaviour.
We looked at the agent config, but the closest thing we could find was to set up DD_DOGSTATSD_TAG_CARDINALITY: 'orchestrator' which was already in our config
Due to not being able to run exec in the containers, we haven't been able to run agent status. We are in the process of setting up a cron job to log it so that we can have more information.
Existing tickets
We tried looking into existing tickets, but didn't find anything that matched the same behaviour.
we found that during deploys any tags that were gained via auto-discovery would have a drop-out. basically the metrics would come in, but without the normal tagging due to auto-discovery ramping up. our solution was to setup the tags in the datadog config on startup rather then use auto-discovery
But that example it appeared to be due a slow startup of auto-discovery, which is not the case for us.
it seems the default behavior of the datadog agent in fargate is to drop metrics that would get reported by the same version datadiog agent running on an ecs ec2 cluster. And that is exactly what we saw when we moved a service from an ecs ec2 cluster to fargate
But that one didn't match the delay, and had to do with the host tag which we are not using
The text was updated successfully, but these errors were encountered:
Update - Resolved
It was the same as this issue:
JVM metrics not showing up until 10 minutes after service starts up
Basically the jmx version of the agent calls
jmxfetch
. That package looks for beans/metrics immediately after starting up, and then checks again after 600 seconds (thus the 10 minute delay).There is a config option
refresh_beans_initial
which allows you to set the amount of time before it does the first refresh. We set it to 30 seconds, and the problem was resolved.The documentation for that config option is here: https://docs.datadoghq.com/integrations/java/?tab=host#configuration-options
Note that the documentation does not specify where to put the config. For posterity, we put it in the
conf.d/confluent_platform.d/conf.yaml
file underinstances
as follows:After the fix you can see the new deployment starts sending metrics much earlier than the example below:
Scenario / Issue Description
confluent_platform
integration are not collected for about 10 minutesExample of how it looks during a scale-up from 2 to 3 tasks:
This example is using a metric from our
confluent_platform
integration. Although the metric indicates a drop in number of records consumed, they are actually being consumed by the new task. Approximately 10 minutes after the scale-up, the metrics begin to be correctly shown.More info
Above Example
Here is the query used in the above example:
sum:confluent.kafka.consumer.fetch_topic.records_consumed_rate{$service,$env} by {client-id}
Note
service
andenv
tags are set by theDD_DOGSTATSD_TAGS
environment variable on the agent, andclient-id
is from theconfluent_platform
integration. We have looked at whether the metrics are being sent without tags, and don't think that's the case.Scope of Missing Metrics
confluent_platform
integration with the DD agent (eg the example above withconfluent.kafka.consumer.fetch_topic.records_consumed_rate
.dogstatsd
client show up right awayecs.fargate.mem.usage
appear for the new task straight away.See this pic filtering by task ARN:
Config
Our ECS cluster is not setup in a way that allows us to run
exec
functions on containers, so haven't been able to send a flare. Here is some of the config:conf.yml
into theconf.d/conflient_platform.d/
directory on the agent:datadog/agent:latest-jmx
imageLogging
Looking at the built in logging of ECS, we see that the
ContainerKnownStatus
for the container stays in PENDING for an extended time:In this example, metrics started coming through at about
11:21:00
, a few minutes before theContainerKnownStatus
went toRUNNING
. This is currently the most promising lead we are trying to follow upApproaches we tried
dependsOn
behaviour of ECS to set up a dependency of our java app on the dd-agent sidecar. The agent would report healthy and our service would start up as normal, but it didn't affect the behaviour.DD_DOGSTATSD_TAG_CARDINALITY: 'orchestrator'
which was already in our configexec
in the containers, we haven't been able to runagent status
. We are in the process of setting up a cron job to log it so that we can have more information.Existing tickets
We tried looking into existing tickets, but didn't find anything that matched the same behaviour.
This one looked the most similar in terms of the pattern displayed:
under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per service #3159
But that example it appeared to be due a slow startup of auto-discovery, which is not the case for us.
This one also looked similar:
Metrics without ECS tags
But that one didn't match the delay, and had to do with the
host
tag which we are not usingThe text was updated successfully, but these errors were encountered: