Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otelhttp grpc ResourceExhausted - resources not being freed #3536

Open
KasonBraley opened this issue Mar 6, 2023 · 10 comments
Open

otelhttp grpc ResourceExhausted - resources not being freed #3536

KasonBraley opened this issue Mar 6, 2023 · 10 comments
Assignees
Labels
bug Something isn't working instrumentation: otelhttp

Comments

@KasonBraley
Copy link
Contributor

KasonBraley commented Mar 6, 2023

When wrapping an http server with the otelhttp handler, these errors occur and do not stop. The message size continues to grow and never frees:

context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (5325126 vs. 4194304)
context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (7255095 vs. 4194304)
context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (9129195 vs. 4194304)
context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (10831746 vs. 4194304)
context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (12704602 vs. 4194304)
...
context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (33605778 vs. 4194304)

Full stacktrace (seems to be related to metrics):

{
    "level": "error",
    "ts": 1678141390.0813837,
    "caller": "reporting/otel.go:37",
    "msg": "context deadline exceeded: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (33605778 vs. 4194304)",
    "stacktrace": "main.initOTEL.func1\n\t/go/src/app/services/reporting/otel.go:37\ngo.opentelemetry.io/otel.ErrorHandlerFunc.Handle\n\t/go/pkg/mod/go.opentelemetry.io/otel@v1.14.0/error_handler.go:37\ngo.opentelemetry.io/otel.(*delegator).Handle\n\t/go/pkg/mod/go.opentelemetry.io/otel@v1.14.0/handler.go:42\ngo.opentelemetry.io/otel.Handle\n\t/go/pkg/mod/go.opentelemetry.io/otel@v1.14.0/handler.go:95\ngo.opentelemetry.io/otel/sdk/metric.(*periodicReader).run\n\t/go/pkg/mod/go.opentelemetry.io/otel/sdk/metric@v0.37.0/periodic_reader.go:169\ngo.opentelemetry.io/otel/sdk/metric.NewPeriodicReader.func1\n\t/go/pkg/mod/go.opentelemetry.io/otel/sdk/metric@v0.37.0/periodic_reader.go:128"
}

How the otelhttp handler is being setup:

func (s *Server) routes() {
	mux := http.NewServeMux()

	mux.Handle("/route1", otelhttp.WithRouteTag("/route1", http.HandlerFunc(s.handleRoute1)))
	mux.Handle("/route2", otelhttp.WithRouteTag("/route2", http.HandlerFunc(s.handleRoute2)))
	mux.Handle("/route3", otelhttp.WithRouteTag("/route3", http.HandlerFunc(s.handleRoute3)))
	mux.Handle("/route4", otelhttp.WithRouteTag("/route4", http.HandlerFunc(s.handleRoute4)))

	s.Router = mux
}

func NewServer() (*Server, error) {
	srv := &Server{}

	srv.routes()
	return srv, nil
}

func main() {
	srv, err := NewServer()
	if err != nil {
		logger.Fatal(err)
	}

	s := &http.Server{
		Addr:         fmt.Sprintf(":%s", cfg.Port),
		Handler:      otelhttp.NewHandler(srv.Router, "server"),
		ReadTimeout:  30 * time.Second,
		WriteTimeout: 10 * time.Second,
		ErrorLog:     serverLogger,
	}
}

Metric setup:

func initMetrics(logger *zap.SugaredLogger, res *resource.Resource) (func(context.Context) error, error) {
	endpoint := os.Getenv("OTEL_METRICS_ENDPOINT")

	// If endpoint is not set, assume metrics should be disabled
	if endpoint == "" {
		global.SetMeterProvider(metric.NewNoopMeterProvider())
		return nil, nil
	}

	ctx, cancel := context.WithTimeout(context.Background(), time.Second*10)
	defer cancel()

	exporter, err := otlpmetricgrpc.New(
		ctx,
		otlpmetricgrpc.WithInsecure(),
		otlpmetricgrpc.WithEndpoint(endpoint),
		otlpmetricgrpc.WithDialOption(grpc.WithBlock()),
	)
	if err != nil {
		return nil, fmt.Errorf("failed to connect to OTEL collector at %q. Error: %w", endpoint, err)
	}

	meterProvider := sdkmetric.NewMeterProvider(
		sdkmetric.WithReader(sdkmetric.NewPeriodicReader(exporter)),
		sdkmetric.WithResource(res),
	)

	global.SetMeterProvider(meterProvider)

	// enable runtime metrics
	if err := runtime.Start(); err != nil {
		return nil, err
	}

	logger.Info("metrics registered")
	return meterProvider.Shutdown, nil
}

Is there something I am doing wrong in the setup with otelhttp?

@alatgithub
Copy link

Looks like the current context size limit is 4194304 (4 meg). Is this a context property you can set larger to say 10 meg, 10485760?

@Aneurysm9
Copy link
Member

IIRC that limit is on the receiving side. For the OTel collector that would be max_recv_msg_size_mib. See https://github.com/open-telemetry/opentelemetry-collector/tree/main/config/configgrpc#server-configuration for further details.

@MrAlias MrAlias added bug Something isn't working instrumentation: otelhttp labels Mar 8, 2023
@KasonBraley
Copy link
Contributor Author

IIRC that limit is on the receiving side. For the OTel collector that would be max_recv_msg_size_mib. See https://github.com/open-telemetry/opentelemetry-collector/tree/main/config/configgrpc#server-configuration for further details.

I don't think upping that limit would solve the root issue of the message continuing to grow on error and not releasing it's resources.

@dmathieu
Copy link
Member

dmathieu commented Mar 9, 2023

The message keeps growing because of the failures. It can't release resources, as that data failed to be sent.

@KasonBraley
Copy link
Contributor Author

KasonBraley commented Mar 9, 2023

The message keeps growing because of the failures. It can't release resources, as that data failed to be sent.

@dmathieu so this is expected behavior? If so, I guess I would have expected it to have a limit on the amount of retries it does so it doesn't just keep growing indefinitely.

Do you have any recommendations on next steps? I would like to prevent this situation from occurring, as we currently cannot use otelhttp in production because of this bug. Are there any ways I can find out what is so large and why it's failing?

@Aneurysm9
Copy link
Member

I'm not sure that the message size is growing because of export failures. Each export should be independent. Errors will be reported and then the next time the reader collects and exports metrics it starts collecting into a new metricdata.ResourceMetrics. open-telemetry/opentelemetry-go#3760 adds a pool for those to allow re-use, but they should still be re-populated based on the aggregation of all metrics to be exported at that time. (it was also just merged, so it's not the cause of this issue)

Are you able to increase the limit on the collector side such that the metrics can be received correctly there and inspect what is being sent, or use the STDOUT exporter? This sounds like there may be a cardinality issue with some of the metrics collected.

@KasonBraley
Copy link
Contributor Author

Are you able to increase the limit on the collector side such that the metrics can be received correctly there and inspect what is being sent, or use the STDOUT exporter? This sounds like there may be a cardinality issue with some of the metrics collected.

Sorry, but probably not anytime soon. Since we've only been able to reproduce this in production, we've since switched back to the prometheus client for metric collection and turned off the otlpmetricgrpc exporter so that we can still have otelhttp enabled for traces only.

I can give some additional context in the meantime if it helps though. We had otelhttp enabled with metrics and tracing being sent to an otel collector for quite a while, along with other various otel metrics (runtime, pgx, sql, redis, etc.) without issue. We then updated to the the newer metrics API about a month ago, and that's when we started seeing this ResourceExhausted issue. Unfortunately, we haven't spent any time trying to pinpoint the exact version that caused the issue.

But these are the versions we were on when it worked:

	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.35.0
	go.opentelemetry.io/contrib/instrumentation/runtime v0.36.0
	go.opentelemetry.io/contrib/propagators/aws v1.10.0
	go.opentelemetry.io/otel v1.10.0
	go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v0.31.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.10.0
	go.opentelemetry.io/otel/metric v0.32.0
	go.opentelemetry.io/otel/sdk v1.10.0
	go.opentelemetry.io/otel/sdk/metric v0.31.0
	go.opentelemetry.io/otel/trace v1.10.0
	google.golang.org/grpc v1.49.0

Issue started occurring after one of the below upgrades, as mentioned before, we haven't spent the time to dissect which version caused it.

Upgraded to:

	go.opentelemetry.io/contrib/instrumentation/github.com/gocql/gocql/otelgocql v0.38.0
	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.38.0
	go.opentelemetry.io/contrib/instrumentation/runtime v0.38.0
	go.opentelemetry.io/contrib/propagators/aws v1.13.0
	go.opentelemetry.io/otel v1.12.0
	go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v0.35.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.12.0
	go.opentelemetry.io/otel/metric v0.35.0
	go.opentelemetry.io/otel/sdk v1.12.0
	go.opentelemetry.io/otel/sdk/metric v0.35.0
	go.opentelemetry.io/otel/trace v1.12.0
	google.golang.org/grpc v1.52.3

And then about a week or so later:

	go.opentelemetry.io/otel/sdk v1.13.0
	go.opentelemetry.io/otel/sdk/metric v0.36.0

@chchaffin
Copy link

I have run into the exact same issue as @KasonBraley. Previous versions of OTEL worked but v0.34.0 does not.

@chchaffin
Copy link

I did some digging into this and found that the growth of rpc.server.duration seems to be the root cause. This commit 04c5dcb.

I created a custom exporter and logged out the metrics in our production environment and noticed that for each RPC a new attribute was created for net.peer.port. Small sample of the output log.

{"Key":"net.peer.port","Value":{"Type":"STRING","Value":"59570"}},{"Key":"rpc.grpc.status_code","Value":{"Type":"INT64","Value":0}},{"Key":"rpc.method","Value":{"Type":"STRING","Value":"Check"}},{"Key":"rpc.service","Value":{"Type":"STRING","Value":"grpc.health.v1.Health"}},{"Key":"rpc.system","Value":{"Type":"STRING","Value":"grpc"}}],"StartTime":"2023-04-11T19:29:41.712276063Z","Time":"2023-04-11T19:31:56.712209998Z","Count":1,"Bounds":[0,5,10,25,50,75,100,250,500,750,1000,2500,5000,7500,10000],"BucketCounts":[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"Min":0,"Max":0,"Sum":0},{"Attributes":[{"Key":"net.peer.ip","Value":{"Type":"STRING","Value":"*****"}},{"Key":"net.peer.port","Value":{"Type":"STRING","Value":"55753"}},{"Key":"rpc.grpc.status_code","Value":{"Type":"INT64","Value":0}},{"Key":"rpc.method","Value":{"Type":"STRING","Value":"Directions"}},{"Key":"rpc.service","Value":{"Type":"STRING","Value":"*****"}},{"Key":"rpc.system","Value":{"Type":"STRING","Value":"grpc"}}],"StartTime":"2023-04-11T19:29:41.712276063Z","Time":"2023-04-11T19:31:56.712209998Z","Count":10,"Bounds":[0,5,10,25,50,75,100,250,500,750,1000,2500,5000,7500,10000],"BucketCounts":[0,0,0,0,4,5,1,0,0,0,0,0,0,0,0,0],"Min":31,"Max":85,"Sum":518},{"Attributes":[{"Key":"net.peer.ip","Value":{"Type":"STRING","Value":"127.0.0.6"}},{"Key":"net.peer.port","Value":{"Type":"STRING","Value":"60669"}},{"Key":"rpc.grpc.status_code","Value":{"Type":"INT64","Value":5}},{"Key":"rpc.method","Value":{"Type":"STRING","Value":"Place"}},{"Key":"rpc.service","Value":{"Type":"STRING","Value":"*****"}},{"Key":"rpc.system","Value":{"Type":"STRING","Value":"grpc"}}],"StartTime":"2023-04-11T19:29:41.712276063Z","Time":"2023-04-11T19:31:56.712209998Z","Count":4,"Bounds":[0,5,10,25,50,75,100,250,500,750,1000,2500,5000,7500,10000],"BucketCounts":[0,0,0,0,2,0,1,1,0,0,0,0,0,0,0,0],"Min":32,"Max":114,"Sum":288},{"Attributes":[{"Key":"net.peer.ip","Value":{"Type":"STRING","Value":"127.0.0.1"}},{"Key":"net.peer.port","Value":{"Type":"STRING","Value":"55386"}},

I put ***** around things that are specific to my company. Maybe this is an issue specific to AWS/cloud environments since the port seems to be different on each call?

I'm willing to make the change necessary if we can come to some agreement on how to proceed. Can we simply make the metric customizable? Some sort of optional With* function approach?

@Aneurysm9 please advise. Thanks in advance!

@chchaffin
Copy link

I made a custom exporter that dropped the rpc.server.duration metric and I haven't seen the resource exhausted issue crop up since. It's been a little over two days. Normally, I'd see the issue come back within 24 hours of a redeploy or restart. I'm fairly certain the issue is with docker containers in a cloud environment. The ports aren't static leading to high cardinality. I'll work on a change this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working instrumentation: otelhttp
Projects
None yet
Development

No branches or pull requests

6 participants