otelgrpc high metric cardinality from rpc.server.duration #3071

rcrowe · 2022-12-08T09:14:02Z

Since upgrading to v1.12.0/0.37.0/0.6.0 & specifically #2700 we've seen an explosion in the cardinality of our metrics.

Metric recorded:

opentelemetry-go-contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go

Line 346 in ad04d9c

cfg.rpcServerDuration.Record(ctx, int64(elapsedTime), attr...)
Attributes are taken from spanInfo:

opentelemetry-go-contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go

Line 331 in ad04d9c

name, attr := spanInfo(info.FullMethod, peerFromCtx(ctx))

Peer attributes added:

opentelemetry-go-contrib/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go

Lines 476 to 479 in ad04d9c

    
           return []attribute.KeyValue{ 
        
           	semconv.NetPeerIPKey.String(host), 
        
           	semconv.NetPeerPortKey.String(port), 
        
           }

Is the intention that these types of high cardinality issues are handled in userland, by us using a view (or blocking at the ingestion layer) to strip labels from certain metrics after we find them problematic? On one hand I can understand that being acceptable, on the other, this feels like pushing a problem downstream to consumers that will suddenly get hit by a metric that is most likely going to grow very wide.

CC @fatsheep9146

The text was updated successfully, but these errors were encountered:

rcrowe · 2022-12-08T09:41:10Z

Missed https://github.com/open-telemetry/opentelemetry-specification/blob/56b71774c7ea694e0e492c2ae1120e6df59e5dd6/specification/metrics/semantic_conventions/rpc-metrics.md#attributes which seems to suggest this is working as expected.

To avoid high cardinality, implementations should prefer the most stable of net.peer.name or 
net.sock.peer.addr, depending on expected deployment profile. For many cloud applications, 
this is likely net.peer.name as names can be recycled even across re-instantiation of a server 
with a different ip.

MrAlias · 2022-12-08T16:17:49Z

Missed https://github.com/open-telemetry/opentelemetry-specification/blob/56b71774c7ea694e0e492c2ae1120e6df59e5dd6/specification/metrics/semantic_conventions/rpc-metrics.md#attributes which seems to suggest this is working as expected.
To avoid high cardinality, implementations should prefer the most stable of net.peer.name or 
net.sock.peer.addr, depending on expected deployment profile. For many cloud applications, 
this is likely net.peer.name as names can be recycled even across re-instantiation of a server 
with a different ip.

Correct, this does look to be something required by the specification. It might be worth opening an issue in the specification repository if you think this shouldn't be a required attribute.

lucasoares · 2023-01-31T18:35:23Z

How to remove this metric? It is exposing 7000 metrics in the exporter. It should be disabled as default.

regeda · 2023-02-03T09:31:54Z

After v0.38.0 release, I see the same issue with otelhttp package. We avoid this upgrade bc it blows up the memory consumption drastically.

MrAlias · 2023-02-03T15:51:23Z

You can use a View to drop the aggregation entirely or filter the attributes causing your cardinality explosion.

regeda · 2023-02-03T16:28:56Z

You can use a View to drop the aggregation entirely or filter the attributes causing your cardinality explosion.

Could you share an example how to drop certain attributes from the aggregation?

MrAlias · 2023-02-03T16:55:07Z

You can use a View to drop the aggregation entirely or filter the attributes causing your cardinality explosion.

Could you share an example how to drop certain attributes from the aggregation?

The linked documentation has an example that should help as a basis:

You'll need to update your match criteria to specifically match what you want though.

regeda · 2023-02-06T16:27:16Z

I've done with the following example (thanks @MrAlias ):

func allowedAttr(v ...string) attribute.Filter {
	m := make(map[string]struct{}, len(v))
	for _, s := range v {
		m[s] = struct{}{}
	}
	return func(kv attribute.KeyValue) bool {
		_, ok := m[string(kv.Key)]
		return ok
	}
}

func setGlobalMeterProvider() error {
	exporter, err := prometheus.New(
		prometheus.WithRegisterer(client_prom.DefaultRegisterer),
	)
	if err != nil {
		return err
	}
	global.SetMeterProvider(metric.NewMeterProvider(
		metric.WithReader(exporter),
		metric.WithView(
			metric.NewView(
				metric.Instrument{Scope: instrumentation.Scope{Name: "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"}},
				metric.Stream{
					AttributeFilter: allowedAttr(
						"http.method",
						"http.status_code",
						"http.target",
					),
				},
			),
		),
	))
	return nil
}

regeda · 2023-02-07T14:57:54Z

Another issue :(

A new version of otelhttp/v0.38.0 uses go.opentelemetry.io/otel/semconv/v1.17.0/httpconv.ServerRequest for attributes generating https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/instrumentation/net/http/otelhttp/handler.go#L218.

This function writes the whole request URI (including the query string) into http.target. It affects the cardinality of the metric and the server allocates a lot of memory if the query string is constantly random.

regeda · 2023-02-07T16:05:24Z

Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:

peer port
user agent
request uri

MrAlias · 2023-02-07T21:42:18Z

Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:

peer port

user agent

request uri

Can you expand on this? I'm not sure how an attribute filter wouldn't help prevent this. The filter wraps the aggregation so attributes filtered out should never reach the underlying aggregation storage.

regeda · 2023-02-08T09:40:19Z

Also, the view's override doesn't help too much bc the internal aggregation is still happening with the following attributes that impact memory consumption and there is no option to prevent it:

peer port

user agent

request uri

Can you expand on this? I'm not sure how an attribute filter wouldn't help prevent this. The filter wraps the aggregation so attributes filtered out should never reach the underlying aggregation storage.

I think that the problem is here https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/metric/internal/filter.go#L67

The filter uses the original attributes set as a key in the seen map.

func (f *filter[N]) Aggregate(measurement N, attr attribute.Set) {
	// TODO (#3006): drop stale attributes from seen. // <-- not yet implemented
	f.Lock()
	defer f.Unlock()
	fAttr, ok := f.seen[attr]
        fmt.Println(len(f.seen)) // <-- seen is growing with a random attribute set
	if !ok {
		fAttr, _ = attr.Filter(f.filter)
		f.seen[attr] = fAttr // <-- memory leak
	}
	f.aggregator.Aggregate(measurement, fAttr)
}

See open-telemetry/opentelemetry-go#3006

this is my test case:

import (
    ...
    "math/rand"
    ...
)

func TestRandomFilterAggregate(t *testing.T) {
	f := NewFilter[int64](&testStableAggregator[int64]{}, testAttributeFilter)
	for i := 0; i < 1000; i++ {
		set := attribute.NewSet(
			attribute.Int("foo", rand.Int()),
		)
		f.Aggregate(1, set)
	}
}

Another problem is in seen map[attribute.Set]attribute.Set bc attribute.Set is not comparable type then the following test will consume 2x memory even if the attribute set is the same:

func TestRandomFilterAggregate(t *testing.T) {
	f := NewFilter[int64](&testStableAggregator[int64]{}, testAttributeFilter)
	for i := 0; i < 1000; i++ {
		attr := attribute.Int("foo", rand.Int()),
		set1 := attribute.NewSet(attr)
		f.Aggregate(1, set1)
		set2 := attribute.NewSet(attr)
		f.Aggregate(1, set2)
	}
}

That is my proposal open-telemetry/opentelemetry-go#3695

lucasoares · 2023-02-08T18:54:34Z

I just want to be able to disable 100% of grpc and http metrics from ever being created, to not have any kind of impact on my application.

Any way to do it? I don't want to have to do bench tests because of these third-party metrics...

MrAlias · 2023-02-08T18:56:50Z

I just want to be able to disable 100% of grpc and http metrics from ever being created, to not have any kind of impact on my application.

Any way to do it? I don't want to have to do bench tests because of these third-party metrics...

#3071 (comment)

liufuyang · 2023-02-20T12:51:58Z

@MrAlias I think I encounter the same memory issue. So have this open-telemetry/opentelemetry-go#3695 fixed it and does the current release includes it? Thank you :)

Ah, sorry I just saw this https://github.com/open-telemetry/opentelemetry-go/milestone/40 looks like we are very close to releasing a fix on this.

Thank you very much and I will give a test again when 0.37 is out 👍

ericwenn · 2023-02-21T15:10:15Z

Hey, I'm seeing this issue in services, both for otelhttp and otelgrpc.
Filtering attributes with views and using the unreleased fix for memory consumptions in this issue alleviates the immediate issues.

However, based on reading the specification I see two, possibly, incorrect things.

For HTTP (otelhttp) some attributes are required only on the client while optional on server. Specifically net.peer.port and http.client_id are not required for servers, but is collected on servers here.
For gRPC (otelgrpc) the specification makes no distinction on client vs server attributes, but using http metrics as ground truth the net.peer.port attribute should not be collected for servers in gRPC either.

@MrAlias I can submit issue in specification repo for 2, but I think 1 should be fixed by the go implementation.
Am I missing something?

jaronoff97 · 2023-04-07T17:14:25Z

Hello all, we're running a collector right now and we're seeing this issue cause a massive memory leak when converting the metric to prometheus. This stems from the new usage of the meter provider here in the collector.

IMO this is a massive issue because it means someone could cause OOMs/dropped data by flooding the collector with bad rpcs. Would it be possible to make the inclusion of the net.sock.peer.addr, net.sock.peer.port, etc. optional? The inclusion of these attributes are useful for traces, but cause massive cardinality explosion in metrics. Because the collector also is exporting Prometheus Histograms, the cardinality we observe is |net.sock.peer.addr| x |net.sock.peer.port| x |buckets| which results in millions of metrics created.

Users of the collector are unable to implement a custom view as has been recommended on this thread, because the code is already built with this. Given other users are experiencing this issue, it felt prudent to upstream this request to the source.

…o-contrib#3071 is fixed

mackjmr · 2023-09-12T13:11:50Z

I experienced this when running opentelemetry-demo.

Here is a graph showing the memory usage of the collector pod before / after implementing the filter from #3071 (comment) in productcatalogservice (fix pushed at 17h25):

IMO, net.sock.peer.port should be disabled by default seeing the cardinality / significant memory usage in the collector it leads to when enabled.

cc @pellared as it seems like you are working on a fix to this issue in #3730.

liufuyang mentioned this issue Feb 20, 2023

Update to recent version creates memory leak like issue and the byte sent to google continue increases GoogleCloudPlatform/opentelemetry-operations-go#595

Closed

charleskorn added a commit to batect/services-common that referenced this issue Apr 8, 2023

fix: workaround open-telemetry/opentelemetry-go-contrib#3071

a9da343

charleskorn added a commit to batect/services-common that referenced this issue Apr 8, 2023

fix: don't export metrics to GCP until open-telemetry/opentelemetry-g…

91d909c

…o-contrib#3071 is fixed

jaronoff97 mentioned this issue Apr 11, 2023

High cardinality metric observed from grpc instrumentation open-telemetry/opentelemetry-collector#7517

Closed

RangelReale mentioned this issue May 11, 2023

[otelhttp] transport metrics #3769

Open

hasheddan mentioned this issue May 17, 2023

http.server.duration metric includes high cardinality attr: http.client_ip #3765

Closed

This was referenced May 19, 2023

Update otel dependencies from v0.35.0 to v0.42.0 kubernetes/kubernetes#117945

Closed

Update OpenTelemetry dependencies to v1.15.1 kubernetes/kubernetes#117743

Closed

dependencies:update go.opentelemetry.io/{contrib,otel} kubernetes/kubernetes#117467

Closed

ahobson mentioned this issue May 22, 2023

metric aggregations never seem to reset, causing memory leaks and grpc message size ResourceExhaused open-telemetry/opentelemetry-go#4096

Closed

FaranIdo mentioned this issue Sep 12, 2023

Remove high cardanility metrics from otelhttp #4277

Merged

pellared mentioned this issue Sep 12, 2023

Fix cardinality issue for rpc.server.duration (#3536) #3730

Open

MadVikingGod mentioned this issue Sep 21, 2023

otelgrpc: Remove high cardinality metric attributes #4322

Merged

puckpuck mentioned this issue Sep 28, 2023

metrics issues from SDKs open-telemetry/opentelemetry-demo#1147

Closed

4 tasks

pellared closed this as completed in #4322 Nov 6, 2023

pellared added this to the untracked milestone Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otelgrpc high metric cardinality from rpc.server.duration #3071

otelgrpc high metric cardinality from rpc.server.duration #3071

rcrowe commented Dec 8, 2022 •

edited

Loading

rcrowe commented Dec 8, 2022 •

edited

Loading

MrAlias commented Dec 8, 2022

lucasoares commented Jan 31, 2023 •

edited

Loading

regeda commented Feb 3, 2023

MrAlias commented Feb 3, 2023

regeda commented Feb 3, 2023

MrAlias commented Feb 3, 2023

regeda commented Feb 6, 2023

regeda commented Feb 7, 2023 •

edited

Loading

regeda commented Feb 7, 2023

MrAlias commented Feb 7, 2023

regeda commented Feb 8, 2023 •

edited

Loading

lucasoares commented Feb 8, 2023

MrAlias commented Feb 8, 2023

liufuyang commented Feb 20, 2023 •

edited

Loading

ericwenn commented Feb 21, 2023

jaronoff97 commented Apr 7, 2023

mackjmr commented Sep 12, 2023

otelgrpc high metric cardinality from rpc.server.duration #3071

otelgrpc high metric cardinality from rpc.server.duration #3071

Comments

rcrowe commented Dec 8, 2022 • edited Loading

rcrowe commented Dec 8, 2022 • edited Loading

MrAlias commented Dec 8, 2022

lucasoares commented Jan 31, 2023 • edited Loading

regeda commented Feb 3, 2023

MrAlias commented Feb 3, 2023

regeda commented Feb 3, 2023

MrAlias commented Feb 3, 2023

regeda commented Feb 6, 2023

regeda commented Feb 7, 2023 • edited Loading

regeda commented Feb 7, 2023

MrAlias commented Feb 7, 2023

regeda commented Feb 8, 2023 • edited Loading

lucasoares commented Feb 8, 2023

MrAlias commented Feb 8, 2023

liufuyang commented Feb 20, 2023 • edited Loading

ericwenn commented Feb 21, 2023

jaronoff97 commented Apr 7, 2023

mackjmr commented Sep 12, 2023

rcrowe commented Dec 8, 2022 •

edited

Loading

rcrowe commented Dec 8, 2022 •

edited

Loading

lucasoares commented Jan 31, 2023 •

edited

Loading

regeda commented Feb 7, 2023 •

edited

Loading

regeda commented Feb 8, 2023 •

edited

Loading

liufuyang commented Feb 20, 2023 •

edited

Loading