Reporting client data loss stats to Jaeger backend #2005

yurishkuro · 2020-01-06T19:34:41Z

Requirement - what kind of business use case are you trying to solve?

We want to have accurate global measurements of the potential data loss in Jaeger clients.

Problem - what in Jaeger blocks you from solving the requirement?

Although Jaeger clients internally support integrations with metrics backends, not all users actually configure those, and even if they do, the metrics are often namespaced to the application, e.g. by including a tag service=xyz or using a custom prefix, which makes it more difficult to observe across the site.

Proposal - what do you suggest to solve the problem or improve the existing situation?

Allow clients to report certain data loss metrics when submitting span batches to Jaeger backend. The data elements should include:

batch sequence number to detect possible dropped packets (especially with UDP transports)
count of spans dropped by the client due to internal queue overflow
count of spans dropped by the client due to exceeding max packet size (e.g. 65000 bytes cap for UDP transport)

The Jaeger clients (some, at least) are already including a client-uuid string tag in the Process, which will allow the agent/collector to monitor those reported stats and emit consistent metrics.

Thrift IDL change Add ability to send data loss stats in Batch jaeger-idl#60
Go client prototype Report data loss stats to Jaeger backend jaeger-client-go#482
Agent prototype [agent] Process data loss stats from clients #2010
Other clients

The text was updated successfully, but these errors were encountered:

yurishkuro · 2020-01-07T15:46:59Z

@jpkrohling @pavolloffay fyi

jpkrohling · 2020-01-07T15:58:13Z

I wouldn't expect that extracting metrics from clients is difficult, especially because applications should be exposing their own metrics anyway. Once metrics are collected, it should also be easy to get aggregated values with or without service names using any reasonably modern metrics backend, like Prometheus.

But if you do need this internally, this might be a good indication that other people might benefit from it.

I do have one question, though:

batch sequence number to detect possible dropped packets (especially with UDP transports)

What do you have in mind for syncing this across all collectors in a cluster?

yurishkuro · 2020-01-07T16:08:57Z

I wouldn't expect that extracting metrics from clients is difficult

We indeed found this quite challenging, given how heterogeneous Uber's ecosystem is. Some apps, for example, are still using statsd-style metrics that don't even support tags, a lot more apps simply do not configure Jaeger clients with a metrics client, because they are not interested in those metrics. So we want to have a more guaranteed way of collecting critical metrics like data loss (I don't want to replicate all 20 client metrics here).

What do you have in mind for syncing this across all collectors in a cluster?

I guess this approach will only work with the agents, where there is no need for synching: each agent remembers the max batch seqNo it received from each unique client, and can use it to emit a metric of the total number of batches that the client thinks were sent. If batches arrive out of order (i.e. with a smaller seqNo than what agent remembers), then it is safe to ignore them for the purpose of the metrics because we will "catch up" once a batch with higher seqNo arrives, since all counters included in the batch are cumulative.

jpkrohling · 2020-01-08T09:07:25Z

apps simply do not configure Jaeger clients with a metrics client, because they are not interested in those metrics

I can see that, but aren't you then forcing the application into doing something it doesn't want, assuming this would be on by default?

I guess this approach will only work with the agents, where there is no need for synching: each agent remembers the max batch seqNo it received from each unique client, and can use it to emit a metric of the total number of batches that the client thinks were sent.

Before merging the PRs for the clients, can we get an experimental version of the agent (draft PR would be fine), with a concrete example?

yurishkuro · 2020-01-08T15:32:34Z

I don't plan on merging any PRs until I have an end to end solution working.

yurishkuro · 2020-01-09T18:09:09Z

@jpkrohling fyi #2010

jpkrohling · 2020-01-10T10:02:05Z

I left a review already. As it's marked as WIP, not sure you want another one already.

yurishkuro · 2022-08-26T18:54:18Z

since Jaeger clients are deprecated, this is unlikely to move further, so closing

ghost added the needs-triage label Jan 6, 2020

yurishkuro self-assigned this Jan 6, 2020

yurishkuro changed the title ~~Reporting some client stats to Jaeger backend~~ Reporting client data loss stats to Jaeger backend Jan 6, 2020

yurishkuro mentioned this issue Jan 6, 2020

Add ability to send data loss stats in Batch jaegertracing/jaeger-idl#60

Merged

yurishkuro added changelog:bugfix-or-minor-feature enhancement changelog:new-feature Change that should be called out as new feature in CHANGELOG and removed needs-triage changelog:bugfix-or-minor-feature labels Jan 6, 2020

yurishkuro mentioned this issue Jan 6, 2020

Report data loss stats to Jaeger backend jaegertracing/jaeger-client-go#482

Merged

This was referenced Jan 9, 2020

[agent] Process data loss stats from clients #2010

Merged

Upgrade dependencies dep ensure --update #2012

Merged

yurishkuro removed their assignment Nov 30, 2020

yurishkuro added the help wanted Features that maintainers are willing to accept but do not have cycles to implement label Nov 30, 2020

yurishkuro closed this as completed Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reporting client data loss stats to Jaeger backend #2005

Reporting client data loss stats to Jaeger backend #2005

yurishkuro commented Jan 6, 2020 •

edited

Loading

yurishkuro commented Jan 7, 2020

jpkrohling commented Jan 7, 2020

yurishkuro commented Jan 7, 2020 •

edited

Loading

jpkrohling commented Jan 8, 2020

yurishkuro commented Jan 8, 2020

yurishkuro commented Jan 9, 2020

jpkrohling commented Jan 10, 2020

yurishkuro commented Aug 26, 2022

Reporting client data loss stats to Jaeger backend #2005

Reporting client data loss stats to Jaeger backend #2005

Comments

yurishkuro commented Jan 6, 2020 • edited Loading

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?

Proposal - what do you suggest to solve the problem or improve the existing situation?

yurishkuro commented Jan 7, 2020

jpkrohling commented Jan 7, 2020

yurishkuro commented Jan 7, 2020 • edited Loading

jpkrohling commented Jan 8, 2020

yurishkuro commented Jan 8, 2020

yurishkuro commented Jan 9, 2020

jpkrohling commented Jan 10, 2020

yurishkuro commented Aug 26, 2022

yurishkuro commented Jan 6, 2020 •

edited

Loading

yurishkuro commented Jan 7, 2020 •

edited

Loading