otelhttp grpc ResourceExhausted - resources not being freed #3536

KasonBraley · 2023-03-06T22:30:01Z

When wrapping an http server with the otelhttp handler, these errors occur and do not stop. The message size continues to grow and never frees:

context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5325126 vs. 4194304)
context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (7255095 vs. 4194304)
context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (9129195 vs. 4194304)
context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (10831746 vs. 4194304)
context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (12704602 vs. 4194304)
...
context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (33605778 vs. 4194304)

Full stacktrace (seems to be related to metrics):

{
    "level": "error",
    "ts": 1678141390.0813837,
    "caller": "reporting/otel.go:37",
    "msg": "context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (33605778 vs. 4194304)",
    "stacktrace": "main.initOTEL.func1\n\t/go/src/app/services/reporting/otel.go:37\ngo.opentelemetry.io/otel.ErrorHandlerFunc.Handle\n\t/go/pkg/mod/go.opentelemetry.io/otel@v1.14.0/error_handler.go:37\ngo.opentelemetry.io/otel.(*delegator).Handle\n\t/go/pkg/mod/go.opentelemetry.io/otel@v1.14.0/handler.go:42\ngo.opentelemetry.io/otel.Handle\n\t/go/pkg/mod/go.opentelemetry.io/otel@v1.14.0/handler.go:95\ngo.opentelemetry.io/otel/sdk/metric.(*periodicReader).run\n\t/go/pkg/mod/go.opentelemetry.io/otel/sdk/metric@v0.37.0/periodic_reader.go:169\ngo.opentelemetry.io/otel/sdk/metric.NewPeriodicReader.func1\n\t/go/pkg/mod/go.opentelemetry.io/otel/sdk/metric@v0.37.0/periodic_reader.go:128"
}

How the otelhttp handler is being setup:

func (s *Server) routes() {
	mux := http.NewServeMux()

	mux.Handle("/route1", otelhttp.WithRouteTag("/route1", http.HandlerFunc(s.handleRoute1)))
	mux.Handle("/route2", otelhttp.WithRouteTag("/route2", http.HandlerFunc(s.handleRoute2)))
	mux.Handle("/route3", otelhttp.WithRouteTag("/route3", http.HandlerFunc(s.handleRoute3)))
	mux.Handle("/route4", otelhttp.WithRouteTag("/route4", http.HandlerFunc(s.handleRoute4)))

	s.Router = mux
}

func NewServer() (*Server, error) {
	srv := &Server{}

	srv.routes()
	return srv, nil
}

func main() {
	srv, err := NewServer()
	if err != nil {
		logger.Fatal(err)
	}

	s := &http.Server{
		Addr:         fmt.Sprintf(":%s", cfg.Port),
		Handler:      otelhttp.NewHandler(srv.Router, "server"),
		ReadTimeout:  30 * time.Second,
		WriteTimeout: 10 * time.Second,
		ErrorLog:     serverLogger,
	}
}

Metric setup:

func initMetrics(logger *zap.SugaredLogger, res *resource.Resource) (func(context.Context) error, error) {
	endpoint := os.Getenv("OTEL_METRICS_ENDPOINT")

	// If endpoint is not set, assume metrics should be disabled
	if endpoint == "" {
		global.SetMeterProvider(metric.NewNoopMeterProvider())
		return nil, nil
	}

	ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
	defer cancel()

	exporter, err := otlpmetricgrpc.New(
		ctx,
		otlpmetricgrpc.WithInsecure(),
		otlpmetricgrpc.WithEndpoint(endpoint),
		otlpmetricgrpc.WithDialOption(grpc.WithBlock()),
	)
	if err != nil {
		return nil, fmt.Errorf("failed to connect to OTEL collector at %q. Error: %w", endpoint, err)
	}

	meterProvider := sdkmetric.NewMeterProvider(
		sdkmetric.WithReader(sdkmetric.NewPeriodicReader(exporter)),
		sdkmetric.WithResource(res),
	)

	global.SetMeterProvider(meterProvider)

	// enable runtime metrics
	if err := runtime.Start(); err != nil {
		return nil, err
	}

	logger.Info("metrics registered")
	return meterProvider.Shutdown, nil
}

Is there something I am doing wrong in the setup with otelhttp?

The text was updated successfully, but these errors were encountered:

alatgithub · 2023-03-07T13:29:16Z

Looks like the current context size limit is 4194304 (4 meg). Is this a context property you can set larger to say 10 meg, 10485760?

Aneurysm9 · 2023-03-08T00:14:35Z

IIRC that limit is on the receiving side. For the OTel collector that would be max_recv_msg_size_mib. See https://github.com/open-telemetry/opentelemetry-collector/tree/main/config/configgrpc#server-configuration for further details.

KasonBraley · 2023-03-08T23:17:42Z

IIRC that limit is on the receiving side. For the OTel collector that would be max_recv_msg_size_mib. See https://github.com/open-telemetry/opentelemetry-collector/tree/main/config/configgrpc#server-configuration for further details.

I don't think upping that limit would solve the root issue of the message continuing to grow on error and not releasing it's resources.

dmathieu · 2023-03-09T09:27:16Z

The message keeps growing because of the failures. It can't release resources, as that data failed to be sent.

KasonBraley · 2023-03-09T15:10:23Z

The message keeps growing because of the failures. It can't release resources, as that data failed to be sent.

@dmathieu so this is expected behavior? If so, I guess I would have expected it to have a limit on the amount of retries it does so it doesn't just keep growing indefinitely.

Do you have any recommendations on next steps? I would like to prevent this situation from occurring, as we currently cannot use otelhttp in production because of this bug. Are there any ways I can find out what is so large and why it's failing?

Aneurysm9 · 2023-03-09T20:03:59Z

I'm not sure that the message size is growing because of export failures. Each export should be independent. Errors will be reported and then the next time the reader collects and exports metrics it starts collecting into a new metricdata.ResourceMetrics. open-telemetry/opentelemetry-go#3760 adds a pool for those to allow re-use, but they should still be re-populated based on the aggregation of all metrics to be exported at that time. (it was also just merged, so it's not the cause of this issue)

Are you able to increase the limit on the collector side such that the metrics can be received correctly there and inspect what is being sent, or use the STDOUT exporter? This sounds like there may be a cardinality issue with some of the metrics collected.

KasonBraley · 2023-03-11T21:09:18Z

Are you able to increase the limit on the collector side such that the metrics can be received correctly there and inspect what is being sent, or use the STDOUT exporter? This sounds like there may be a cardinality issue with some of the metrics collected.

Sorry, but probably not anytime soon. Since we've only been able to reproduce this in production, we've since switched back to the prometheus client for metric collection and turned off the otlpmetricgrpc exporter so that we can still have otelhttp enabled for traces only.

I can give some additional context in the meantime if it helps though. We had otelhttp enabled with metrics and tracing being sent to an otel collector for quite a while, along with other various otel metrics (runtime, pgx, sql, redis, etc.) without issue. We then updated to the the newer metrics API about a month ago, and that's when we started seeing this ResourceExhausted issue. Unfortunately, we haven't spent any time trying to pinpoint the exact version that caused the issue.

But these are the versions we were on when it worked:

	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.35.0
	go.opentelemetry.io/contrib/instrumentation/runtime v0.36.0
	go.opentelemetry.io/contrib/propagators/aws v1.10.0
	go.opentelemetry.io/otel v1.10.0
	go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v0.31.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.10.0
	go.opentelemetry.io/otel/metric v0.32.0
	go.opentelemetry.io/otel/sdk v1.10.0
	go.opentelemetry.io/otel/sdk/metric v0.31.0
	go.opentelemetry.io/otel/trace v1.10.0
	google.golang.org/grpc v1.49.0

Issue started occurring after one of the below upgrades, as mentioned before, we haven't spent the time to dissect which version caused it.

Upgraded to:

	go.opentelemetry.io/contrib/instrumentation/github.com/gocql/gocql/otelgocql v0.38.0
	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.38.0
	go.opentelemetry.io/contrib/instrumentation/runtime v0.38.0
	go.opentelemetry.io/contrib/propagators/aws v1.13.0
	go.opentelemetry.io/otel v1.12.0
	go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v0.35.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.12.0
	go.opentelemetry.io/otel/metric v0.35.0
	go.opentelemetry.io/otel/sdk v1.12.0
	go.opentelemetry.io/otel/sdk/metric v0.35.0
	go.opentelemetry.io/otel/trace v1.12.0
	google.golang.org/grpc v1.52.3

And then about a week or so later:

	go.opentelemetry.io/otel/sdk v1.13.0
	go.opentelemetry.io/otel/sdk/metric v0.36.0

chchaffin · 2023-03-31T20:30:59Z

I have run into the exact same issue as @KasonBraley. Previous versions of OTEL worked but v0.34.0 does not.

chchaffin · 2023-04-11T20:35:22Z

I did some digging into this and found that the growth of rpc.server.duration seems to be the root cause. This commit 04c5dcb.

I created a custom exporter and logged out the metrics in our production environment and noticed that for each RPC a new attribute was created for net.peer.port. Small sample of the output log.

{"Key":"net.peer.port","Value":{"Type":"STRING","Value":"59570"}},{"Key":"rpc.grpc.status_code","Value":{"Type":"INT64","Value":0}},{"Key":"rpc.method","Value":{"Type":"STRING","Value":"Check"}},{"Key":"rpc.service","Value":{"Type":"STRING","Value":"grpc.health.v1.Health"}},{"Key":"rpc.system","Value":{"Type":"STRING","Value":"grpc"}}],"StartTime":"2023-04-11T19:29:41.712276063Z","Time":"2023-04-11T19:31:56.712209998Z","Count":1,"Bounds":[0,5,10,25,50,75,100,250,500,750,1000,2500,5000,7500,10000],"BucketCounts":[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"Min":0,"Max":0,"Sum":0},{"Attributes":[{"Key":"net.peer.ip","Value":{"Type":"STRING","Value":"*****"}},{"Key":"net.peer.port","Value":{"Type":"STRING","Value":"55753"}},{"Key":"rpc.grpc.status_code","Value":{"Type":"INT64","Value":0}},{"Key":"rpc.method","Value":{"Type":"STRING","Value":"Directions"}},{"Key":"rpc.service","Value":{"Type":"STRING","Value":"*****"}},{"Key":"rpc.system","Value":{"Type":"STRING","Value":"grpc"}}],"StartTime":"2023-04-11T19:29:41.712276063Z","Time":"2023-04-11T19:31:56.712209998Z","Count":10,"Bounds":[0,5,10,25,50,75,100,250,500,750,1000,2500,5000,7500,10000],"BucketCounts":[0,0,0,0,4,5,1,0,0,0,0,0,0,0,0,0],"Min":31,"Max":85,"Sum":518},{"Attributes":[{"Key":"net.peer.ip","Value":{"Type":"STRING","Value":"127.0.0.6"}},{"Key":"net.peer.port","Value":{"Type":"STRING","Value":"60669"}},{"Key":"rpc.grpc.status_code","Value":{"Type":"INT64","Value":5}},{"Key":"rpc.method","Value":{"Type":"STRING","Value":"Place"}},{"Key":"rpc.service","Value":{"Type":"STRING","Value":"*****"}},{"Key":"rpc.system","Value":{"Type":"STRING","Value":"grpc"}}],"StartTime":"2023-04-11T19:29:41.712276063Z","Time":"2023-04-11T19:31:56.712209998Z","Count":4,"Bounds":[0,5,10,25,50,75,100,250,500,750,1000,2500,5000,7500,10000],"BucketCounts":[0,0,0,0,2,0,1,1,0,0,0,0,0,0,0,0],"Min":32,"Max":114,"Sum":288},{"Attributes":[{"Key":"net.peer.ip","Value":{"Type":"STRING","Value":"127.0.0.1"}},{"Key":"net.peer.port","Value":{"Type":"STRING","Value":"55386"}},

I put ***** around things that are specific to my company. Maybe this is an issue specific to AWS/cloud environments since the port seems to be different on each call?

I'm willing to make the change necessary if we can come to some agreement on how to proceed. Can we simply make the metric customizable? Some sort of optional With* function approach?

@Aneurysm9 please advise. Thanks in advance!

chchaffin · 2023-04-14T16:58:12Z

I made a custom exporter that dropped the rpc.server.duration metric and I haven't seen the resource exhausted issue crop up since. It's been a little over two days. Normally, I'd see the issue come back within 24 hours of a redeploy or restart. I'm fairly certain the issue is with docker containers in a cloud environment. The ports aren't static leading to high cardinality. I'll work on a change this weekend.

MrAlias added bug Something isn't working instrumentation: otelhttp labels Mar 8, 2023

MrAlias assigned chchaffin Apr 16, 2023

chchaffin mentioned this issue Apr 18, 2023

Fix cardinality issue for rpc.server.duration (#3536) #3730

Open

MadVikingGod mentioned this issue Sep 21, 2023

otelgrpc: Remove high cardinality metric attributes #4322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otelhttp grpc ResourceExhausted - resources not being freed #3536

otelhttp grpc ResourceExhausted - resources not being freed #3536

KasonBraley commented Mar 6, 2023 •

edited

Loading

alatgithub commented Mar 7, 2023

Aneurysm9 commented Mar 8, 2023

KasonBraley commented Mar 8, 2023

dmathieu commented Mar 9, 2023

KasonBraley commented Mar 9, 2023 •

edited

Loading

Aneurysm9 commented Mar 9, 2023

KasonBraley commented Mar 11, 2023

chchaffin commented Mar 31, 2023

chchaffin commented Apr 11, 2023

chchaffin commented Apr 14, 2023

otelhttp grpc ResourceExhausted - resources not being freed #3536

otelhttp grpc ResourceExhausted - resources not being freed #3536

Comments

KasonBraley commented Mar 6, 2023 • edited Loading

alatgithub commented Mar 7, 2023

Aneurysm9 commented Mar 8, 2023

KasonBraley commented Mar 8, 2023

dmathieu commented Mar 9, 2023

KasonBraley commented Mar 9, 2023 • edited Loading

Aneurysm9 commented Mar 9, 2023

KasonBraley commented Mar 11, 2023

chchaffin commented Mar 31, 2023

chchaffin commented Apr 11, 2023

chchaffin commented Apr 14, 2023

KasonBraley commented Mar 6, 2023 •

edited

Loading

KasonBraley commented Mar 9, 2023 •

edited

Loading