Delay / missing metrics from agent running as AWS ECS / Fargate sidecar #16620

credpath-seek · 2023-04-17T06:01:44Z

Update - Resolved

It was the same as this issue:
JVM metrics not showing up until 10 minutes after service starts up

Basically the jmx version of the agent calls jmxfetch. That package looks for beans/metrics immediately after starting up, and then checks again after 600 seconds (thus the 10 minute delay).

There is a config option refresh_beans_initial which allows you to set the amount of time before it does the first refresh. We set it to 30 seconds, and the problem was resolved.

The documentation for that config option is here: https://docs.datadoghq.com/integrations/java/?tab=host#configuration-options

Note that the documentation does not specify where to put the config. For posterity, we put it in the conf.d/confluent_platform.d/conf.yaml file under instances as follows:

init_config:
    is_jmx: true
    collect_default_metrics: true
    new_gc_metrics: true
    service_check_prefix: confluent
instances:
  - host: localhost
    refresh_beans_initial: 30
    port: 9999

After the fix you can see the new deployment starts sending metrics much earlier than the example below:

Scenario / Issue Description

We have an AWS ECS Cluster running a service with Fargate tasks
In each task we have a Java application in one container, and a DataDog agent running as a sidecar
When a new task is deployed, metrics that would be reported by the confluent_platform integration are not collected for about 10 minutes
This causes the metrics to be under-reported for that period

Example of how it looks during a scale-up from 2 to 3 tasks:

This example is using a metric from our confluent_platform integration. Although the metric indicates a drop in number of records consumed, they are actually being consumed by the new task. Approximately 10 minutes after the scale-up, the metrics begin to be correctly shown.

More info

Above Example

Here is the query used in the above example: sum:confluent.kafka.consumer.fetch_topic.records_consumed_rate{$service,$env} by {client-id}
Note service and env tags are set by the DD_DOGSTATSD_TAGS environment variable on the agent, and client-id is from the confluent_platform integration. We have looked at whether the metrics are being sent without tags, and don't think that's the case.

Scope of Missing Metrics

This seems to only impact metrics from our confluent_platform integration with the DD agent (eg the example above with confluent.kafka.consumer.fetch_topic.records_consumed_rate.
Custom metrics collected with dogstatsd client show up right away
Metrics collected from AWS directly such as ecs.fargate.mem.usage appear for the new task straight away.
See this pic filtering by task ARN:

Config

Our ECS cluster is not setup in a way that allows us to run exec functions on containers, so haven't been able to send a flare. Here is some of the config:

We copy this conf.yml into the conf.d/conflient_platform.d/ directory on the agent:

init_config:
    is_jmx: true
    collect_default_metrics: true
    new_gc_metrics: true
    service_check_prefix: confluent
instances:
  - host: localhost
    port: 9999

This is some of the env on the agent:

      ECS_FARGATE:                    'true'
      DD_APM_ENABLED:                 'true'
      DD_TRACE_ANALYTICS:             'true'
      DD_LOG_LEVEL:                   'WARN'
      DD_DOGSTATSD_TAGS:              'service:service-name env:staging'
      DD_DOGSTATSD_NON_LOCAL_TRAFFIC: 'true'
      DD_DOGSTATSD_PORT:              '8125'
      DD_AGENT_HOST:                  '127.0.0.1'
      DD_CHECKS_TAG_CARDINALITY:      'orchestrator'
      DD_DOGSTATSD_TAG_CARDINALITY:   'orchestrator'

We use the datadog/agent:latest-jmx image

Logging

Looking at the built in logging of ECS, we see that the ContainerKnownStatus for the container stays in PENDING for an extended time:

11:13:24.000 AM
   ContainerKnownStatus: PENDING
   ContainerName: jmx-dd-sidecar

...

11:25:53.000 AM
   ContainerKnownStatus: PENDING
   ContainerName: jmx-dd-sidecar

11:26:50.000 AM	
   ContainerKnownStatus: RUNNING
   ContainerName: jmx-dd-sidecar

In this example, metrics started coming through at about 11:21:00, a few minutes before the ContainerKnownStatus went to RUNNING. This is currently the most promising lead we are trying to follow up

Approaches we tried

We used the dependsOn behaviour of ECS to set up a dependency of our java app on the dd-agent sidecar. The agent would report healthy and our service would start up as normal, but it didn't affect the behaviour.
We looked at the agent config, but the closest thing we could find was to set up DD_DOGSTATSD_TAG_CARDINALITY: 'orchestrator' which was already in our config
Due to not being able to run exec in the containers, we haven't been able to run agent status. We are in the process of setting up a cron job to log it so that we can have more information.

Existing tickets

We tried looking into existing tickets, but didn't find anything that matched the same behaviour.

This one looked the most similar in terms of the pattern displayed:
under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per service #3159

we found that during deploys any tags that were gained via auto-discovery would have a drop-out. basically the metrics would come in, but without the normal tagging due to auto-discovery ramping up. our solution was to setup the tags in the datadog config on startup rather then use auto-discovery

But that example it appeared to be due a slow startup of auto-discovery, which is not the case for us.

This one also looked similar:
Metrics without ECS tags

it seems the default behavior of the datadog agent in fargate is to drop metrics that would get reported by the same version datadiog agent running on an ecs ec2 cluster. And that is exactly what we saw when we moved a service from an ecs ec2 cluster to fargate

But that one didn't match the delay, and had to do with the host tag which we are not using

The text was updated successfully, but these errors were encountered:

credpath-seek closed this as completed Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay / missing metrics from agent running as AWS ECS / Fargate sidecar #16620

Delay / missing metrics from agent running as AWS ECS / Fargate sidecar #16620

credpath-seek commented Apr 17, 2023 •

edited

Loading

Delay / missing metrics from agent running as AWS ECS / Fargate sidecar #16620

Delay / missing metrics from agent running as AWS ECS / Fargate sidecar #16620

Comments

credpath-seek commented Apr 17, 2023 • edited Loading

Update - Resolved

Scenario / Issue Description

More info

Above Example

Scope of Missing Metrics

Config

Logging

Approaches we tried

Existing tickets

credpath-seek commented Apr 17, 2023 •

edited

Loading