Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay / missing metrics from agent running as AWS ECS / Fargate sidecar #16620

Closed
credpath-seek opened this issue Apr 17, 2023 · 0 comments
Closed

Comments

@credpath-seek
Copy link

credpath-seek commented Apr 17, 2023

Update - Resolved

It was the same as this issue:
JVM metrics not showing up until 10 minutes after service starts up

Basically the jmx version of the agent calls jmxfetch. That package looks for beans/metrics immediately after starting up, and then checks again after 600 seconds (thus the 10 minute delay).

There is a config option refresh_beans_initial which allows you to set the amount of time before it does the first refresh. We set it to 30 seconds, and the problem was resolved.

The documentation for that config option is here: https://docs.datadoghq.com/integrations/java/?tab=host#configuration-options

Note that the documentation does not specify where to put the config. For posterity, we put it in the conf.d/confluent_platform.d/conf.yaml file under instances as follows:

init_config:
    is_jmx: true
    collect_default_metrics: true
    new_gc_metrics: true
    service_check_prefix: confluent
instances:
  - host: localhost
    refresh_beans_initial: 30
    port: 9999

After the fix you can see the new deployment starts sending metrics much earlier than the example below:
image

Scenario / Issue Description

  • We have an AWS ECS Cluster running a service with Fargate tasks
  • In each task we have a Java application in one container, and a DataDog agent running as a sidecar
  • When a new task is deployed, metrics that would be reported by the confluent_platform integration are not collected for about 10 minutes
  • This causes the metrics to be under-reported for that period

Example of how it looks during a scale-up from 2 to 3 tasks:
image

This example is using a metric from our confluent_platform integration. Although the metric indicates a drop in number of records consumed, they are actually being consumed by the new task. Approximately 10 minutes after the scale-up, the metrics begin to be correctly shown.

More info

Above Example

Here is the query used in the above example: sum:confluent.kafka.consumer.fetch_topic.records_consumed_rate{$service,$env} by {client-id}
Note service and env tags are set by the DD_DOGSTATSD_TAGS environment variable on the agent, and client-id is from the confluent_platform integration. We have looked at whether the metrics are being sent without tags, and don't think that's the case.

Scope of Missing Metrics

  • This seems to only impact metrics from our confluent_platform integration with the DD agent (eg the example above with confluent.kafka.consumer.fetch_topic.records_consumed_rate.
  • Custom metrics collected with dogstatsd client show up right away
  • Metrics collected from AWS directly such as ecs.fargate.mem.usage appear for the new task straight away.
    See this pic filtering by task ARN:
    image

Config

Our ECS cluster is not setup in a way that allows us to run exec functions on containers, so haven't been able to send a flare. Here is some of the config:

  • We copy this conf.yml into the conf.d/conflient_platform.d/ directory on the agent:
init_config:
    is_jmx: true
    collect_default_metrics: true
    new_gc_metrics: true
    service_check_prefix: confluent
instances:
  - host: localhost
    port: 9999
  • This is some of the env on the agent:
      ECS_FARGATE:                    'true'
      DD_APM_ENABLED:                 'true'
      DD_TRACE_ANALYTICS:             'true'
      DD_LOG_LEVEL:                   'WARN'
      DD_DOGSTATSD_TAGS:              'service:service-name env:staging'
      DD_DOGSTATSD_NON_LOCAL_TRAFFIC: 'true'
      DD_DOGSTATSD_PORT:              '8125'
      DD_AGENT_HOST:                  '127.0.0.1'
      DD_CHECKS_TAG_CARDINALITY:      'orchestrator'
      DD_DOGSTATSD_TAG_CARDINALITY:   'orchestrator'
  • We use the datadog/agent:latest-jmx image

Logging

Looking at the built in logging of ECS, we see that the ContainerKnownStatus for the container stays in PENDING for an extended time:

11:13:24.000 AM
   ContainerKnownStatus: PENDING
   ContainerName: jmx-dd-sidecar

...

11:25:53.000 AM
   ContainerKnownStatus: PENDING
   ContainerName: jmx-dd-sidecar

11:26:50.000 AM	
   ContainerKnownStatus: RUNNING
   ContainerName: jmx-dd-sidecar

In this example, metrics started coming through at about 11:21:00, a few minutes before the ContainerKnownStatus went to RUNNING. This is currently the most promising lead we are trying to follow up

Approaches we tried

  • We used the dependsOn behaviour of ECS to set up a dependency of our java app on the dd-agent sidecar. The agent would report healthy and our service would start up as normal, but it didn't affect the behaviour.
  • We looked at the agent config, but the closest thing we could find was to set up DD_DOGSTATSD_TAG_CARDINALITY: 'orchestrator' which was already in our config
  • Due to not being able to run exec in the containers, we haven't been able to run agent status. We are in the process of setting up a cron job to log it so that we can have more information.

Existing tickets

We tried looking into existing tickets, but didn't find anything that matched the same behaviour.

This one looked the most similar in terms of the pattern displayed:
under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per service #3159

we found that during deploys any tags that were gained via auto-discovery would have a drop-out. basically the metrics would come in, but without the normal tagging due to auto-discovery ramping up. our solution was to setup the tags in the datadog config on startup rather then use auto-discovery

But that example it appeared to be due a slow startup of auto-discovery, which is not the case for us.

This one also looked similar:
Metrics without ECS tags

it seems the default behavior of the datadog agent in fargate is to drop metrics that would get reported by the same version datadiog agent running on an ecs ec2 cluster. And that is exactly what we saw when we moved a service from an ecs ec2 cluster to fargate

But that one didn't match the delay, and had to do with the host tag which we are not using

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant