Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to set sample_rate for nginx from v1.1.4 onward #148

Open
tahnik opened this issue Jul 20, 2020 · 7 comments
Open

Unable to set sample_rate for nginx from v1.1.4 onward #148

tahnik opened this issue Jul 20, 2020 · 7 comments

Comments

@tahnik
Copy link

tahnik commented Jul 20, 2020

I am using nginx-opentracing with a nodejs backend. At first I tried v1.1.2 of dd-opentracing which uses the sample_rate in dd-config.json if the dd.priority.sample is set to false. However, it doesn't set any x-datadog- headers so the sample rate is not correctly propagated to the backend. From v1.1.4 onward, the backend does get the headers and it correctly uses the sample rate based on the sampling_rules. However, nginx traces stays 100%. So nginx is not caring about the sample_rate or sampling_rules. How do I control the nginx sample rate in that case?

I know it sounds like an nginx-opentracing issue but it seems to be originating from dd-opentracing so I am opening the issue here.

Thank you.

@cgilmour
Copy link
Contributor

Hi @tahnik, you've opened the issue on the correct project.
There were changes in v1.1.4 related to sampling, and this sounds like a bug to me - an oversight in the sampling_rules implementation.

A workaround in the meantime is to use sampling rules in your JSON config file. The example below will sample 10% of initiated traces, and will send the x-datadog-sampling-priority header when propagating the trace.

{
  "service": "nginx",
  "operation_name_override": "nginx.handle",
  "agent_host": "dd-agent",
  "agent_port": 8126,
  "sampling_rules": [{"sample_rate": 0.1}]
}

Let me know if that works for you, but we can keep the issue open until the bug is fixed.

@tahnik
Copy link
Author

tahnik commented Jul 21, 2020

Thank you for your prompt reponse @cgilmour. As I mentioned, setting the sampling_rules sends the x-datadog-sampling-priority correctly. However, nginx itself is being sampled at 100%. Here is the test setup I have:

nginx ----> node server

If I set the sampling rules to "sampling_rules": [{"sample_rate": 0.0}], the node server is getting x-datadog-sampling-priority: 0. So the node server is not sending any traces, which I am happy with. But the nginx keeps sending every single traces. It seems like nginx itself is not following the rule at all.

@cgilmour
Copy link
Contributor

Right, so what you used to see with priority sampling disabled and a sampling rate set globally was a lower number of traces sent to the agent.

The new behavior is that all traces are sent to the agent, even when the sampling priority is 0.
This is intentional, and the agent uses that to capture metrics about number of requests, errors, and a handful of other things.

However, not all traces are sent from the agent to datadog.
Traces with sampling priority: 1 should be sent from agent to datadog.
Also, a low rate (1-2 per second) of traces with sampling priority: 0 will get sent to datadog.
Usually these traces have the error tag on them, or have a higher than expected latency.

The remainder are dropped, but get counted via the metrics, so that the service pages can still show total requests, errors, latency, and endpoint information.

Does that match what you're seeing for the nginx side of things?

@tahnik
Copy link
Author

tahnik commented Jul 22, 2020

I see, that means that nginx is always being sampled at 100%. However, the agent should not be sending all the traces to the datadog. Although I will have to confirm this, wouldn't that hurt nginx performance quite a bit? If traces are being collected for every single request?

@cgilmour
Copy link
Contributor

I see, that means that nginx is always being sampled at 100%.

From one perspective, yes - nginx sent trace data to the agent 100% of the time.
From the overall perspective, no - the trace doesn't get sent outside of your system to datadog.
The sampling decision is made in one place, and the agent applies that decision in a different place.

In terms of performance, yes it will have an impact but the exact amount needs to be measured.
The CPU overhead should be relatively consistent when measured against traffic rate, and memory overhead should be measured against request duration.
Internal benchmarking from a while ago showed roughly 10-15% overhead on CPU utilization and 0.5-1KB per request memory overhead.

There's a plan to do some benchmarking and optimizing at some point, to get some updated numbers on those things. The urgency is quite low though because nginx performance with tracing enabled has not been highlighted as an issue.

@jgulick48
Copy link

I'm going to add my experiences here as well. In a high traffic datacenter (40k req/s) a sample rate of 1 (100%) results in about 100GB per hour of trace data. As we are working on adopting APM this immediate jump in spend with DataDog isn't something that is maintainable. Looking through all the documentation and information on here I'm not able to see a way to change the sample rate when the DataDog agent and NGINX are deployed to a kubernetes cluster via Helm. In the current setup, 100% of the spans are sent to the DataDog agent and 100% of those spans are then also sent to DataDog rather than being sampled out.

@dgoffredo
Copy link
Contributor

dgoffredo commented Sep 2, 2021

There is the CPU/memory/IO cost of sending traces to the local datadog agent, and separately there is the internet bandwidth cost of the agent sending traces to datadog.

Setting sample_rate to a number closer to zero will reduce the latter cost, but not the former. It sounds like you could avoid the extra spend, @jgulick48, by adjusting the sampling rate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants