Unable to set sample_rate for nginx from v1.1.4 onward #148

tahnik · 2020-07-20T18:33:57Z

I am using nginx-opentracing with a nodejs backend. At first I tried v1.1.2 of dd-opentracing which uses the sample_rate in dd-config.json if the dd.priority.sample is set to false. However, it doesn't set any x-datadog- headers so the sample rate is not correctly propagated to the backend. From v1.1.4 onward, the backend does get the headers and it correctly uses the sample rate based on the sampling_rules. However, nginx traces stays 100%. So nginx is not caring about the sample_rate or sampling_rules. How do I control the nginx sample rate in that case?

I know it sounds like an nginx-opentracing issue but it seems to be originating from dd-opentracing so I am opening the issue here.

Thank you.

The text was updated successfully, but these errors were encountered:

cgilmour · 2020-07-21T00:55:58Z

Hi @tahnik, you've opened the issue on the correct project.
There were changes in v1.1.4 related to sampling, and this sounds like a bug to me - an oversight in the sampling_rules implementation.

A workaround in the meantime is to use sampling rules in your JSON config file. The example below will sample 10% of initiated traces, and will send the x-datadog-sampling-priority header when propagating the trace.

{
  "service": "nginx",
  "operation_name_override": "nginx.handle",
  "agent_host": "dd-agent",
  "agent_port": 8126,
  "sampling_rules": [{"sample_rate": 0.1}]
}

Let me know if that works for you, but we can keep the issue open until the bug is fixed.

tahnik · 2020-07-21T07:43:59Z

Thank you for your prompt reponse @cgilmour. As I mentioned, setting the sampling_rules sends the x-datadog-sampling-priority correctly. However, nginx itself is being sampled at 100%. Here is the test setup I have:

nginx ----> node server

If I set the sampling rules to "sampling_rules": [{"sample_rate": 0.0}], the node server is getting x-datadog-sampling-priority: 0. So the node server is not sending any traces, which I am happy with. But the nginx keeps sending every single traces. It seems like nginx itself is not following the rule at all.

cgilmour · 2020-07-22T04:18:33Z

Right, so what you used to see with priority sampling disabled and a sampling rate set globally was a lower number of traces sent to the agent.

The new behavior is that all traces are sent to the agent, even when the sampling priority is 0.
This is intentional, and the agent uses that to capture metrics about number of requests, errors, and a handful of other things.

However, not all traces are sent from the agent to datadog.
Traces with sampling priority: 1 should be sent from agent to datadog.
Also, a low rate (1-2 per second) of traces with sampling priority: 0 will get sent to datadog.
Usually these traces have the error tag on them, or have a higher than expected latency.

The remainder are dropped, but get counted via the metrics, so that the service pages can still show total requests, errors, latency, and endpoint information.

Does that match what you're seeing for the nginx side of things?

tahnik · 2020-07-22T08:19:24Z

I see, that means that nginx is always being sampled at 100%. However, the agent should not be sending all the traces to the datadog. Although I will have to confirm this, wouldn't that hurt nginx performance quite a bit? If traces are being collected for every single request?

cgilmour · 2020-07-22T10:15:35Z

I see, that means that nginx is always being sampled at 100%.

From one perspective, yes - nginx sent trace data to the agent 100% of the time.
From the overall perspective, no - the trace doesn't get sent outside of your system to datadog.
The sampling decision is made in one place, and the agent applies that decision in a different place.

In terms of performance, yes it will have an impact but the exact amount needs to be measured.
The CPU overhead should be relatively consistent when measured against traffic rate, and memory overhead should be measured against request duration.
Internal benchmarking from a while ago showed roughly 10-15% overhead on CPU utilization and 0.5-1KB per request memory overhead.

There's a plan to do some benchmarking and optimizing at some point, to get some updated numbers on those things. The urgency is quite low though because nginx performance with tracing enabled has not been highlighted as an issue.

jgulick48 · 2021-02-08T20:22:33Z

I'm going to add my experiences here as well. In a high traffic datacenter (40k req/s) a sample rate of 1 (100%) results in about 100GB per hour of trace data. As we are working on adopting APM this immediate jump in spend with DataDog isn't something that is maintainable. Looking through all the documentation and information on here I'm not able to see a way to change the sample rate when the DataDog agent and NGINX are deployed to a kubernetes cluster via Helm. In the current setup, 100% of the spans are sent to the DataDog agent and 100% of those spans are then also sent to DataDog rather than being sampled out.

dgoffredo · 2021-09-02T20:24:42Z

There is the CPU/memory/IO cost of sending traces to the local datadog agent, and separately there is the internet bandwidth cost of the agent sending traces to datadog.

Setting sample_rate to a number closer to zero will reduce the latter cost, but not the former. It sounds like you could avoid the extra spend, @jgulick48, by adjusting the sampling rate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to set sample_rate for nginx from v1.1.4 onward #148

Unable to set sample_rate for nginx from v1.1.4 onward #148

tahnik commented Jul 20, 2020

cgilmour commented Jul 21, 2020

tahnik commented Jul 21, 2020 •

edited

Loading

cgilmour commented Jul 22, 2020

tahnik commented Jul 22, 2020

cgilmour commented Jul 22, 2020

jgulick48 commented Feb 8, 2021

dgoffredo commented Sep 2, 2021 •

edited

Loading

Unable to set sample_rate for nginx from v1.1.4 onward #148

Unable to set sample_rate for nginx from v1.1.4 onward #148

Comments

tahnik commented Jul 20, 2020

cgilmour commented Jul 21, 2020

tahnik commented Jul 21, 2020 • edited Loading

cgilmour commented Jul 22, 2020

tahnik commented Jul 22, 2020

cgilmour commented Jul 22, 2020

jgulick48 commented Feb 8, 2021

dgoffredo commented Sep 2, 2021 • edited Loading

tahnik commented Jul 21, 2020 •

edited

Loading

dgoffredo commented Sep 2, 2021 •

edited

Loading