Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM on unrecoverable errors even with memory_limiter configured #1121

Closed
kbrockhoff opened this issue Jun 15, 2020 · 10 comments
Closed

OOM on unrecoverable errors even with memory_limiter configured #1121

kbrockhoff opened this issue Jun 15, 2020 · 10 comments
Assignees
Labels
bug Something isn't working help wanted Good issue for contributors to OpenTelemetry Service to pick up priority:p1 High
Milestone

Comments

@kbrockhoff
Copy link
Member

Describe the bug
Kubernetes container running otelcol ends up in CrashLoopBackoff on repeated unrecoverable errors from an exporter. This is with the memory_limiter in place and working without error when data is flowing normally.

Steps to reproduce
Push enough data to AWS X-Ray that is starts returning ThrottlingExceptions.

What did you expect to see?
Span data is dropped and otelcol instance keeps running without issues.

What did you see instead?
Constant restarts of K8s pod.

What version did you use?
Version: Custom build from commit f640b71 of open telemetry-collector-contrib

What config did you use?
Config:

receivers:
  opencensus:
    endpoint: ":55678"
    max_recv_msg_size_mib: 16
    max_concurrent_streams: 16
  otlp:
    endpoint: ":55680"
    max_recv_msg_size_mib: 16
    max_concurrent_streams: 16
  
processors:
  memory_limiter:
    ballast_size_mib: 192
    check_interval: 5s
    limit_mib: 448
    spike_limit_mib: 64
  queued_retry:
    num_workers: 16
    queue_size: 8192
    retry_on_failure: true
  batch:
    send_batch_size: 48
    timeout: 15s

exporters:
  awsxray:
    num_workers: 8
    endpoint: ""
    request_timeout_seconds: 30
    max_retries: 2
    no_verify_ssl: false
    local_mode: false
    resource_arn: "arn:aws:eks:{{ .Values.Region }}:{{ .Values.AWS_ACCOUNT_ID }}:cluster/{{ .Values.CLUSTER_NAME }}"
    role_arn: ""

extensions:
  health_check:
    port: 13133
  zpages:
    endpoint: ":55679"

service:
  extensions: [zpages, health_check]
  pipelines:
    traces:
      receivers: [opencensus, otlp]
      processors: [memory_limiter, batch, queued_retry]
      exporters: [awsxray]

Environment
OS: Amazon Linux/2
Compiler go 14.2

Additional context

2020-06-12T18:30:49.872Z	WARN	queuedprocessor/queued_processor.go:175	Unrecoverable bad data error	{"component_kind": "processor", "component_type": "queued_retry", "component_name": "queued_retry", "#spans": 916, "error": "ThrottlingException: Rate exceeded\n\tstatus code: 400, request id: 56eb8255-dc18-4fd9-8548-4ab086f7565c"}
go.opentelemetry.io/collector/processor/queuedprocessor.(*queuedSpanProcessor).processItemFromQueue
	/go/pkg/mod/go.opentelemetry.io/collector@v0.3.1-0.20200522130256-daf2fc71ac65/processor/queuedprocessor/queued_processor.go:175
go.opentelemetry.io/collector/processor/queuedprocessor.(*queuedSpanProcessor).Start.func1
	/go/pkg/mod/go.opentelemetry.io/collector@v0.3.1-0.20200522130256-daf2fc71ac65/processor/queuedprocessor/queued_processor.go:86
github.com/jaegertracing/jaeger/pkg/queue.(*BoundedQueue).StartConsumers.func1
	/go/pkg/mod/github.com/jaegertracing/jaeger@v1.17.0/pkg/queue/bounded_queue.go:77
2020-06-12T18:30:50.073Z	WARN	queuedprocessor/queued_processor.go:175	Unrecoverable bad data error	{"component_kind": "processor", "component_type": "queued_retry", "component_name": "queued_retry", "#spans": 713, "error": "ThrottlingException: Rate exceeded\n\tstatus code: 400, request id: defbd58d-8b87-4f3c-a2d5-f0a1ff3327a4"}
go.opentelemetry.io/collector/processor/queuedprocessor.(*queuedSpanProcessor).processItemFromQueue
	/go/pkg/mod/go.opentelemetry.io/collector@v0.3.1-0.20200522130256-daf2fc71ac65/processor/queuedprocessor/queued_processor.go:175
go.opentelemetry.io/collector/processor/queuedprocessor.(*queuedSpanProcessor).Start.func1
	/go/pkg/mod/go.opentelemetry.io/collector@v0.3.1-0.20200522130256-daf2fc71ac65/processor/queuedprocessor/queued_processor.go:86
github.com/jaegertracing/jaeger/pkg/queue.(*BoundedQueue).StartConsumers.func1
	/go/pkg/mod/github.com/jaegertracing/jaeger@v1.17.0/pkg/queue/bounded_queue.go:77
2020-06-12T18:30:50.075Z	WARN	queuedprocessor/queued_processor.go:175	Unrecoverable bad data error	{"component_kind": "processor", "component_type": "queued_retry", "component_name": "queued_retry", "#spans": 678, "error": "ThrottlingException: Rate exceeded\n\tstatus code: 400, request id: 27700543-f30f-49c4-bc5d-19c27b225309"}
go.opentelemetry.io/collector/processor/queuedprocessor.(*queuedSpanProcessor).processItemFromQueue
	/go/pkg/mod/go.opentelemetry.io/collector@v0.3.1-0.20200522130256-daf2fc71ac65/processor/queuedprocessor/queued_processor.go:175
go.opentelemetry.io/collector/processor/queuedprocessor.(*queuedSpanProcessor).Start.func1
	/go/pkg/mod/go.opentelemetry.io/collector@v0.3.1-0.20200522130256-daf2fc71ac65/processor/queuedprocessor/queued_processor.go:86
github.com/jaegertracing/jaeger/pkg/queue.(*BoundedQueue).StartConsumers.func1
	/go/pkg/mod/github.com/jaegertracing/jaeger@v1.17.0/pkg/queue/bounded_queue.go:77
2020-06-12T18:30:50.272Z	WARN	queuedprocessor/queued_processor.go:175	Unrecoverable bad data error	{"component_kind": "processor", "component_type": "queued_retry", "component_name": "queued_retry", "#spans": 2005, "error": "ThrottlingException: Rate exceeded\n\tstatus code: 400, request id: 03052edb-a4de-4cb4-813b-b8096ce47f42"}
@kbrockhoff kbrockhoff added the bug Something isn't working label Jun 15, 2020
@flands flands added this to the GA 1.0 milestone Jun 18, 2020
@tigrannajaryan tigrannajaryan added the help wanted Good issue for contributors to OpenTelemetry Service to pick up label Sep 2, 2020
@tigrannajaryan tigrannajaryan modified the milestones: GA 1.0, Backlog Oct 19, 2020
@jpkrohling
Copy link
Member

@kbrockhoff, how do your resources settings for the deployment look like? I'll try to simulate this with another exporter, as I don't have an AWS account for testing this, so, would you be able to get me a series of pprof snapshots? Like, one every minute, if it takes a few minutes to reach this state (or one every hour). This way, I can see what's getting stuck in memory for your problem and compare with what I'm trying to reproduce locally.

@jpkrohling
Copy link
Member

jpkrohling commented Nov 9, 2020

I need the information requested in the previous comment to confirm, but I think this can be reproduced with a setup like this:

  1. podman run -p 14250:14250 -p 16686:16686 jaegertracing/all-in-one --memory.max-traces=1000
  2. go run ./cmd/otelcol --config local/config.memlimit.yaml --mem-ballast-size-mib 192 (on github.com/open-telemetry/opentelemetry-collector)
  3. tracegen -duration 5m -otlp-insecure (tracegen from contrib)

And with this config file:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: ":55680"
        max_recv_msg_size_mib: 16
        max_concurrent_streams: 16

processors:
  memory_limiter:
    ballast_size_mib: 192
    check_interval: 5s
    limit_mib: 448
    spike_limit_mib: 64
  queued_retry:
    num_workers: 16
    queue_size: 8192
    retry_on_failure: true
  batch:
    send_batch_size: 48
    timeout: 15s

exporters:
  jaeger:
    endpoint: localhost:14250
    insecure: true

extensions:
  health_check:
    port: 13133
  zpages:
    endpoint: ":55679"
  pprof:
    endpoint: ":1777"

service:
  extensions: [zpages, health_check, pprof]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, queued_retry]
      exporters: [jaeger]

This will run with a very low memory profile while Jaeger is up, but stopping the Jaeger container will cause the queued retry to jump to very high levels of memory consumption, depending on the in-flight data. I've seen numbers between ~400Mi and ~1Gi, which certainly would explain your OOMKilled in Kubernetes, depending on your limits configuration.

Note that there are two queued retries: one as a processor (deprecated) and one inside the exporter. Disabling both will cause the memory profile to be at the normal levels even when the exporter is failing:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: ":55680"
        max_recv_msg_size_mib: 16
        max_concurrent_streams: 16

processors:
  memory_limiter:
    ballast_size_mib: 192
    check_interval: 5s
    limit_mib: 448
    spike_limit_mib: 64
  batch:
    send_batch_size: 48
    timeout: 15s

exporters:
  jaeger:
    endpoint: localhost:14250
    insecure: true
    retry_on_failure:
      enabled: false

extensions:
  health_check:
    port: 13133
  zpages:
    endpoint: ":55679"
  pprof:
    endpoint: ":1777"

service:
  extensions: [zpages, health_check, pprof]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger]

In the example above, note that I disabled the Jaeger queued retry and removed the queued_retry processor entirely. And here's the catch: the memory limiter only works for things entering the pipeline: things that are already in the pipeline aren't evicted, even if they are being retried. In the tests above, I confirmed that the memory limiter did stop new data from being added to the pipeline, sending this back to the client:

2020/11/09 17:29:34 rpc error: code = Unknown desc = data dropped due to high memory usage

That all said, I think this is actually working as designed, but we should definitely talk about improving this design, as it's clearly not enough to prevent OOMs.

@jpkrohling
Copy link
Member

@kbrockhoff, did you have a chance to get the requested info?

@jpkrohling
Copy link
Member

jpkrohling commented Nov 13, 2020

@tigrannajaryan, @pjanotti: looking at the original PR that introduced this processor (#498), I see that the case where exporters are unavailable is explicitly listed there as a motivation for using this processor. Given that this is exactly the case here, it appears that either the design for this processor should be changed or this processor should be removed, as it's not being very effective. In fact,, it causes a false sense of security at the moment.

@tigrannajaryan
Copy link
Member

In the tests above, I confirmed that the memory limiter did stop new data from being added to the pipeline, sending this back to the client:

@jpkrohling If the memory limiter stops the new data from being added why is it not effective at limiting the data? Is new data being created elsewhere?

things that are already in the pipeline aren't evicted, even if they are being retried.

They should be eventually evicted after the maximum number of retries is reached. It may take a while. In the meantime no new data should be added so I don't understand why the limiter is not effective.

That all said, I think this is actually working as designed, but we should definitely talk about improving this design, as it's clearly not enough to prevent OOMs.
In fact,, it causes a false sense of security at the moment.

We definitely need to fix it. For now I don't understand what is the design flaw.

@jpkrohling
Copy link
Member

why is it not effective at limiting the data?

It does stop, but only after the process has exploded in memory usage way above the limit set in the memory limiter. People using the memory-limiter setting the limit to, say, 1GiB would assume the process won't ever go over ~1GiB, but it can actually go over double of that very easily, causing their watchdog processes (like Kubernetes requests/limits) to kill the process.

@kbrockhoff
Copy link
Member Author

I do not have access to the k8s pod config used at that time any more. Knowing how I usually set this, I am fairly confident the pod memory limit was set to 512m

@tigrannajaryan
Copy link
Member

but it can actually go over double of that very easily

Why does this happen? Limiter kicks in too late?

@jpkrohling
Copy link
Member

Why does this happen? Limiter kicks in too late?

Yes, I believe so. In the example I was testing, the check happens every 5s, which is already too long for high-traffic scenarios. And I'm not sure we'd want more frequent checks than that, especially for high-throughput collectors. I would love to come up with a proposal, but I can't think of one that would be satisfactory for the case that was tested here.

tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Dec 4, 2020
Contributes to open-telemetry#1121

Memory limiter processor previously had only one limit. When exceeding
this limit it was previously continuously calling GC. This resulted in
huge CPU consumption if the check interval was small and this was
forcing to use large check intervals. This in turn was resulting in
lethargic response to growing memory usage and the memory limiter was
not very effective in situations when memory usage were growing rapidly
(e.g. when there was a big spike or when the backend was down).

I changed the logic of memory limiter to be based on 2 thresholds:
soft and hard. While below soft threshold the memory limiter is fully disabled.
Between soft and hard limiter the limiter begins dropping incoming data
but does not perform GC. Only when exceed the hard limit we perform GC.
The net result is that the actually used memory is limited at the level
set by soft limit and fluctuates between soft and hard limit as the garbage
is created and collected. Correspondingly GC runs much more infrequently, only
when the hard limit is reached and such GC immediately collects significant
amount of garbage (reduces memory usage close to soft limit) and thus does not
require subsequent GC calls for quite some time.

I did some performance tests with the old and new approaches with 4000 Mib limit,
100,000 spans per second and with exporter completely blocked (no backend).

With the old approach an interval of 100 ms causes about 450% of CPU usage
once the memory limit is hit (while below limit the CPU usage is around 50%).
Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:20:47 Agent RAM (RES):3296 MiB, CPU:44.4% | Sent:   7022700 items | Received:         0 items (0/sec)
2020/12/03 20:20:50 Agent RAM (RES):3481 MiB, CPU:43.0% | Sent:   7322500 items | Received:         0 items (0/sec)
2020/12/03 20:20:53 Agent RAM (RES):3681 MiB, CPU:41.6% | Sent:   7614100 items | Received:         0 items (0/sec)
2020/12/03 20:20:56 Agent RAM (RES):3703 MiB, CPU:47.7% | Sent:   7863600 items | Received:         0 items (0/sec)
2020/12/03 20:20:59 Agent RAM (RES):3028 MiB, CPU:47.0% | Sent:   8062700 items | Received:         0 items (0/sec)
2020/12/03 20:21:02 Agent RAM (RES):3644 MiB, CPU:246.9% | Sent:   8331600 items | Received:         0 items (0/sec) <-- likely a regular GC, not at limit yet
2020/12/03 20:21:05 Agent RAM (RES):3555 MiB, CPU:72.8% | Sent:   8620500 items | Received:         0 items (0/sec)
2020/12/03 20:21:08 Agent RAM (RES):3717 MiB, CPU:57.5% | Sent:   8895500 items | Received:         0 items (0/sec)
2020/12/03 20:21:11 Agent RAM (RES):3877 MiB, CPU:126.9% | Sent:   9172900 items | Received:         0 items (0/sec) <-- hit limit
2020/12/03 20:21:14 Agent RAM (RES):3900 MiB, CPU:127.6% | Sent:   9461100 items | Received:         0 items (0/sec)
2020/12/03 20:21:17 Agent RAM (RES):3918 MiB, CPU:201.7% | Sent:   9728900 items | Received:         0 items (0/sec)
2020/12/03 20:21:20 Agent RAM (RES):3938 MiB, CPU:326.0% | Sent:   9994700 items | Received:         0 items (0/sec)
2020/12/03 20:21:23 Agent RAM (RES):3951 MiB, CPU:470.8% | Sent:  10253200 items | Received:         0 items (0/sec)
2020/12/03 20:21:26 Agent RAM (RES):3955 MiB, CPU:440.0% | Sent:  10504400 items | Received:         0 items (0/sec)
2020/12/03 20:21:29 Agent RAM (RES):3961 MiB, CPU:451.0% | Sent:  10766200 items | Received:         0 items (0/sec)
2020/12/03 20:21:32 Agent RAM (RES):3965 MiB, CPU:465.8% | Sent:  11008400 items | Received:         0 items (0/sec)
2020/12/03 20:21:35 Agent RAM (RES):3974 MiB, CPU:423.6% | Sent:  11272700 items | Received:         0 items (0/sec)
```
This makes the interval of 1 second unusable with the old approach and we had to
choose a longer interval to avoid such performance degradation.

With the new approach under the exact same conditions when using 100ms check interval
the CPU usage is 50% when below memory limits and when the hard memory limits are
hit the CPU usage increases to 68%. With 1 second check interval there is no measurable
increase in CPU usage when memory limiter is hit (unlike 9x CPU increase with the old
approach).

Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:28:35 Agent RAM (RES):1888 MiB, CPU:48.2% | Sent:   3796400 items | Received:         0 items (0/sec)
2020/12/03 20:28:38 Agent RAM (RES):2029 MiB, CPU:47.1% | Sent:   4088600 items | Received:         0 items (0/sec)
2020/12/03 20:28:41 Agent RAM (RES):2197 MiB, CPU:48.3% | Sent:   4388200 items | Received:         0 items (0/sec)
2020/12/03 20:28:44 Agent RAM (RES):2370 MiB, CPU:45.7% | Sent:   4679900 items | Received:         0 items (0/sec)
2020/12/03 20:28:47 Agent RAM (RES):2558 MiB, CPU:49.0% | Sent:   4972200 items | Received:         0 items (0/sec)
2020/12/03 20:28:50 Agent RAM (RES):2771 MiB, CPU:47.4% | Sent:   5260700 items | Received:         0 items (0/sec)
2020/12/03 20:28:53 Agent RAM (RES):2921 MiB, CPU:133.3% | Sent:   5547500 items | Received:         0 items (0/sec)
2020/12/03 20:28:56 Agent RAM (RES):2922 MiB, CPU:50.1% | Sent:   5846700 items | Received:         0 items (0/sec)
2020/12/03 20:28:59 Agent RAM (RES):2957 MiB, CPU:43.6% | Sent:   6131700 items | Received:         0 items (0/sec)
2020/12/03 20:29:02 Agent RAM (RES):3144 MiB, CPU:50.0% | Sent:   6419400 items | Received:         0 items (0/sec)
2020/12/03 20:29:05 Agent RAM (RES):3328 MiB, CPU:49.0% | Sent:   6719100 items | Received:         0 items (0/sec)
2020/12/03 20:29:08 Agent RAM (RES):3488 MiB, CPU:38.6% | Sent:   7007300 items | Received:         0 items (0/sec)
2020/12/03 20:29:11 Agent RAM (RES):3667 MiB, CPU:42.0% | Sent:   7306700 items | Received:         0 items (0/sec)
2020/12/03 20:29:14 Agent RAM (RES):3813 MiB, CPU:37.4% | Sent:   7577700 items | Received:         0 items (0/sec)
2020/12/03 20:29:17 Agent RAM (RES):3802 MiB, CPU:170.9% | Sent:   7860100 items | Received:         0 items (0/sec) <-- hit hard limit
2020/12/03 20:29:20 Agent RAM (RES):3882 MiB, CPU:68.1% | Sent:   8160000 items | Received:         0 items (0/sec)
2020/12/03 20:29:23 Agent RAM (RES):4007 MiB, CPU:42.3% | Sent:   8447900 items | Received:         0 items (0/sec)
2020/12/03 20:29:26 Agent RAM (RES):4007 MiB, CPU:39.3% | Sent:   8747800 items | Received:         0 items (0/sec)
2020/12/03 20:29:29 Agent RAM (RES):4008 MiB, CPU:34.3% | Sent:   9038400 items | Received:         0 items (0/sec)
2020/12/03 20:29:32 Agent RAM (RES):4009 MiB, CPU:39.9% | Sent:   9317200 items | Received:         0 items (0/sec)
```
This is a dramatically better picture compared to the old approach.

With 1 second interval memory limiter's impact on CPU is not measurable with the
new approach, whereas with the old approach it was still showing several times higher
CPU when limit was hit.

This makes small check intervals practically useful and allows to effectively
suppress incoming surges of data.
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Dec 4, 2020
Contributes to open-telemetry#1121

Memory limiter processor previously had only one limit. When exceeding
this limit it was previously continuously calling GC. This resulted in
huge CPU consumption if the check interval was small and this was
forcing to use large check intervals. This in turn was resulting in
lethargic response to growing memory usage and the memory limiter was
not very effective in situations when memory usage were growing rapidly
(e.g. when there was a big spike or when the backend was down).

I changed the logic of memory limiter to be based on 2 thresholds:
soft and hard. While below soft threshold the memory limiter is fully disabled.
Between soft and hard limiter the limiter begins dropping incoming data
but does not perform GC. Only when exceed the hard limit we perform GC.
The net result is that the actually used memory is limited at the level
set by soft limit and fluctuates between soft and hard limit as the garbage
is created and collected. Correspondingly GC runs much more infrequently, only
when the hard limit is reached and such GC immediately collects significant
amount of garbage (reduces memory usage close to soft limit) and thus does not
require subsequent GC calls for quite some time.

I did some performance tests with the old and new approaches with 4000 Mib limit,
100,000 spans per second and with exporter completely blocked (no backend).

With the old approach an interval of 100 ms causes about 450% of CPU usage
once the memory limit is hit (while below limit the CPU usage is around 50%).
Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:20:47 Agent RAM (RES):3296 MiB, CPU:44.4% | Sent:   7022700 items | Received:         0 items (0/sec)
2020/12/03 20:20:50 Agent RAM (RES):3481 MiB, CPU:43.0% | Sent:   7322500 items | Received:         0 items (0/sec)
2020/12/03 20:20:53 Agent RAM (RES):3681 MiB, CPU:41.6% | Sent:   7614100 items | Received:         0 items (0/sec)
2020/12/03 20:20:56 Agent RAM (RES):3703 MiB, CPU:47.7% | Sent:   7863600 items | Received:         0 items (0/sec)
2020/12/03 20:20:59 Agent RAM (RES):3028 MiB, CPU:47.0% | Sent:   8062700 items | Received:         0 items (0/sec)
2020/12/03 20:21:02 Agent RAM (RES):3644 MiB, CPU:246.9% | Sent:   8331600 items | Received:         0 items (0/sec) <-- likely a regular GC, not at limit yet
2020/12/03 20:21:05 Agent RAM (RES):3555 MiB, CPU:72.8% | Sent:   8620500 items | Received:         0 items (0/sec)
2020/12/03 20:21:08 Agent RAM (RES):3717 MiB, CPU:57.5% | Sent:   8895500 items | Received:         0 items (0/sec)
2020/12/03 20:21:11 Agent RAM (RES):3877 MiB, CPU:126.9% | Sent:   9172900 items | Received:         0 items (0/sec) <-- hit limit
2020/12/03 20:21:14 Agent RAM (RES):3900 MiB, CPU:127.6% | Sent:   9461100 items | Received:         0 items (0/sec)
2020/12/03 20:21:17 Agent RAM (RES):3918 MiB, CPU:201.7% | Sent:   9728900 items | Received:         0 items (0/sec)
2020/12/03 20:21:20 Agent RAM (RES):3938 MiB, CPU:326.0% | Sent:   9994700 items | Received:         0 items (0/sec)
2020/12/03 20:21:23 Agent RAM (RES):3951 MiB, CPU:470.8% | Sent:  10253200 items | Received:         0 items (0/sec)
2020/12/03 20:21:26 Agent RAM (RES):3955 MiB, CPU:440.0% | Sent:  10504400 items | Received:         0 items (0/sec)
2020/12/03 20:21:29 Agent RAM (RES):3961 MiB, CPU:451.0% | Sent:  10766200 items | Received:         0 items (0/sec)
2020/12/03 20:21:32 Agent RAM (RES):3965 MiB, CPU:465.8% | Sent:  11008400 items | Received:         0 items (0/sec)
2020/12/03 20:21:35 Agent RAM (RES):3974 MiB, CPU:423.6% | Sent:  11272700 items | Received:         0 items (0/sec)
```
Even the interval of 1 second was unusable with the old approach and we had to
choose a longer interval to avoid performance degradation.

With the new approach under the exact same conditions when using 100ms check interval
the CPU usage is 50% when below memory limits and when the hard memory limits are
hit the CPU usage increases to 68%. With 1 second check interval there is no measurable
increase in CPU usage when memory limiter is hit (unlike 9x CPU increase with the old
approach).

Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:28:35 Agent RAM (RES):1888 MiB, CPU:48.2% | Sent:   3796400 items | Received:         0 items (0/sec)
2020/12/03 20:28:38 Agent RAM (RES):2029 MiB, CPU:47.1% | Sent:   4088600 items | Received:         0 items (0/sec)
2020/12/03 20:28:41 Agent RAM (RES):2197 MiB, CPU:48.3% | Sent:   4388200 items | Received:         0 items (0/sec)
2020/12/03 20:28:44 Agent RAM (RES):2370 MiB, CPU:45.7% | Sent:   4679900 items | Received:         0 items (0/sec)
2020/12/03 20:28:47 Agent RAM (RES):2558 MiB, CPU:49.0% | Sent:   4972200 items | Received:         0 items (0/sec)
2020/12/03 20:28:50 Agent RAM (RES):2771 MiB, CPU:47.4% | Sent:   5260700 items | Received:         0 items (0/sec)
2020/12/03 20:28:53 Agent RAM (RES):2921 MiB, CPU:133.3% | Sent:   5547500 items | Received:         0 items (0/sec)
2020/12/03 20:28:56 Agent RAM (RES):2922 MiB, CPU:50.1% | Sent:   5846700 items | Received:         0 items (0/sec)
2020/12/03 20:28:59 Agent RAM (RES):2957 MiB, CPU:43.6% | Sent:   6131700 items | Received:         0 items (0/sec)
2020/12/03 20:29:02 Agent RAM (RES):3144 MiB, CPU:50.0% | Sent:   6419400 items | Received:         0 items (0/sec)
2020/12/03 20:29:05 Agent RAM (RES):3328 MiB, CPU:49.0% | Sent:   6719100 items | Received:         0 items (0/sec)
2020/12/03 20:29:08 Agent RAM (RES):3488 MiB, CPU:38.6% | Sent:   7007300 items | Received:         0 items (0/sec)
2020/12/03 20:29:11 Agent RAM (RES):3667 MiB, CPU:42.0% | Sent:   7306700 items | Received:         0 items (0/sec)
2020/12/03 20:29:14 Agent RAM (RES):3813 MiB, CPU:37.4% | Sent:   7577700 items | Received:         0 items (0/sec)
2020/12/03 20:29:17 Agent RAM (RES):3802 MiB, CPU:170.9% | Sent:   7860100 items | Received:         0 items (0/sec) <-- hit hard limit
2020/12/03 20:29:20 Agent RAM (RES):3882 MiB, CPU:68.1% | Sent:   8160000 items | Received:         0 items (0/sec)
2020/12/03 20:29:23 Agent RAM (RES):4007 MiB, CPU:42.3% | Sent:   8447900 items | Received:         0 items (0/sec)
2020/12/03 20:29:26 Agent RAM (RES):4007 MiB, CPU:39.3% | Sent:   8747800 items | Received:         0 items (0/sec)
2020/12/03 20:29:29 Agent RAM (RES):4008 MiB, CPU:34.3% | Sent:   9038400 items | Received:         0 items (0/sec)
2020/12/03 20:29:32 Agent RAM (RES):4009 MiB, CPU:39.9% | Sent:   9317200 items | Received:         0 items (0/sec)
```
This is a dramatically better picture compared to the old approach.

With 1 second interval memory limiter's impact on CPU is not measurable with the
new approach, whereas with the old approach it was still showing several times higher
CPU when limit was hit.

This makes small check intervals practically useful and allows to effectively
suppress incoming surges of data.
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Dec 9, 2020
Contributes to open-telemetry#1121

Memory limiter processor previously had only one limit. When exceeding
this limit it was previously continuously calling GC. This resulted in
huge CPU consumption if the check interval was small and this was
forcing to use large check intervals. This in turn was resulting in
lethargic response to growing memory usage and the memory limiter was
not very effective in situations when memory usage were growing rapidly
(e.g. when there was a big spike or when the backend was down).

I changed the logic of memory limiter to be based on 2 thresholds:
soft and hard. While below soft threshold the memory limiter is fully disabled.
Between soft and hard limiter the limiter begins dropping incoming data
but does not perform GC. Only when exceed the hard limit we perform GC.
The net result is that the actually used memory is limited at the level
set by soft limit and fluctuates between soft and hard limit as the garbage
is created and collected. Correspondingly GC runs much more infrequently, only
when the hard limit is reached and such GC immediately collects significant
amount of garbage (reduces memory usage close to soft limit) and thus does not
require subsequent GC calls for quite some time.

I did some performance tests with the old and new approaches with 4000 Mib limit,
100,000 spans per second and with exporter completely blocked (no backend).

With the old approach an interval of 100 ms causes about 450% of CPU usage
once the memory limit is hit (while below limit the CPU usage is around 50%).
Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:20:47 Agent RAM (RES):3296 MiB, CPU:44.4% | Sent:   7022700 items | Received:         0 items (0/sec)
2020/12/03 20:20:50 Agent RAM (RES):3481 MiB, CPU:43.0% | Sent:   7322500 items | Received:         0 items (0/sec)
2020/12/03 20:20:53 Agent RAM (RES):3681 MiB, CPU:41.6% | Sent:   7614100 items | Received:         0 items (0/sec)
2020/12/03 20:20:56 Agent RAM (RES):3703 MiB, CPU:47.7% | Sent:   7863600 items | Received:         0 items (0/sec)
2020/12/03 20:20:59 Agent RAM (RES):3028 MiB, CPU:47.0% | Sent:   8062700 items | Received:         0 items (0/sec)
2020/12/03 20:21:02 Agent RAM (RES):3644 MiB, CPU:246.9% | Sent:   8331600 items | Received:         0 items (0/sec) <-- likely a regular GC, not at limit yet
2020/12/03 20:21:05 Agent RAM (RES):3555 MiB, CPU:72.8% | Sent:   8620500 items | Received:         0 items (0/sec)
2020/12/03 20:21:08 Agent RAM (RES):3717 MiB, CPU:57.5% | Sent:   8895500 items | Received:         0 items (0/sec)
2020/12/03 20:21:11 Agent RAM (RES):3877 MiB, CPU:126.9% | Sent:   9172900 items | Received:         0 items (0/sec) <-- hit limit
2020/12/03 20:21:14 Agent RAM (RES):3900 MiB, CPU:127.6% | Sent:   9461100 items | Received:         0 items (0/sec)
2020/12/03 20:21:17 Agent RAM (RES):3918 MiB, CPU:201.7% | Sent:   9728900 items | Received:         0 items (0/sec)
2020/12/03 20:21:20 Agent RAM (RES):3938 MiB, CPU:326.0% | Sent:   9994700 items | Received:         0 items (0/sec)
2020/12/03 20:21:23 Agent RAM (RES):3951 MiB, CPU:470.8% | Sent:  10253200 items | Received:         0 items (0/sec)
2020/12/03 20:21:26 Agent RAM (RES):3955 MiB, CPU:440.0% | Sent:  10504400 items | Received:         0 items (0/sec)
2020/12/03 20:21:29 Agent RAM (RES):3961 MiB, CPU:451.0% | Sent:  10766200 items | Received:         0 items (0/sec)
2020/12/03 20:21:32 Agent RAM (RES):3965 MiB, CPU:465.8% | Sent:  11008400 items | Received:         0 items (0/sec)
2020/12/03 20:21:35 Agent RAM (RES):3974 MiB, CPU:423.6% | Sent:  11272700 items | Received:         0 items (0/sec)
```
Even the interval of 1 second was unusable with the old approach and we had to
choose a longer interval to avoid performance degradation.

With the new approach under the exact same conditions when using 100ms check interval
the CPU usage is 50% when below memory limits and when the hard memory limits are
hit the CPU usage increases to 68%. With 1 second check interval there is no measurable
increase in CPU usage when memory limiter is hit (unlike 9x CPU increase with the old
approach).

Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:28:35 Agent RAM (RES):1888 MiB, CPU:48.2% | Sent:   3796400 items | Received:         0 items (0/sec)
2020/12/03 20:28:38 Agent RAM (RES):2029 MiB, CPU:47.1% | Sent:   4088600 items | Received:         0 items (0/sec)
2020/12/03 20:28:41 Agent RAM (RES):2197 MiB, CPU:48.3% | Sent:   4388200 items | Received:         0 items (0/sec)
2020/12/03 20:28:44 Agent RAM (RES):2370 MiB, CPU:45.7% | Sent:   4679900 items | Received:         0 items (0/sec)
2020/12/03 20:28:47 Agent RAM (RES):2558 MiB, CPU:49.0% | Sent:   4972200 items | Received:         0 items (0/sec)
2020/12/03 20:28:50 Agent RAM (RES):2771 MiB, CPU:47.4% | Sent:   5260700 items | Received:         0 items (0/sec)
2020/12/03 20:28:53 Agent RAM (RES):2921 MiB, CPU:133.3% | Sent:   5547500 items | Received:         0 items (0/sec)
2020/12/03 20:28:56 Agent RAM (RES):2922 MiB, CPU:50.1% | Sent:   5846700 items | Received:         0 items (0/sec)
2020/12/03 20:28:59 Agent RAM (RES):2957 MiB, CPU:43.6% | Sent:   6131700 items | Received:         0 items (0/sec)
2020/12/03 20:29:02 Agent RAM (RES):3144 MiB, CPU:50.0% | Sent:   6419400 items | Received:         0 items (0/sec)
2020/12/03 20:29:05 Agent RAM (RES):3328 MiB, CPU:49.0% | Sent:   6719100 items | Received:         0 items (0/sec)
2020/12/03 20:29:08 Agent RAM (RES):3488 MiB, CPU:38.6% | Sent:   7007300 items | Received:         0 items (0/sec)
2020/12/03 20:29:11 Agent RAM (RES):3667 MiB, CPU:42.0% | Sent:   7306700 items | Received:         0 items (0/sec)
2020/12/03 20:29:14 Agent RAM (RES):3813 MiB, CPU:37.4% | Sent:   7577700 items | Received:         0 items (0/sec)
2020/12/03 20:29:17 Agent RAM (RES):3802 MiB, CPU:170.9% | Sent:   7860100 items | Received:         0 items (0/sec) <-- hit hard limit
2020/12/03 20:29:20 Agent RAM (RES):3882 MiB, CPU:68.1% | Sent:   8160000 items | Received:         0 items (0/sec)
2020/12/03 20:29:23 Agent RAM (RES):4007 MiB, CPU:42.3% | Sent:   8447900 items | Received:         0 items (0/sec)
2020/12/03 20:29:26 Agent RAM (RES):4007 MiB, CPU:39.3% | Sent:   8747800 items | Received:         0 items (0/sec)
2020/12/03 20:29:29 Agent RAM (RES):4008 MiB, CPU:34.3% | Sent:   9038400 items | Received:         0 items (0/sec)
2020/12/03 20:29:32 Agent RAM (RES):4009 MiB, CPU:39.9% | Sent:   9317200 items | Received:         0 items (0/sec)
```
This is a dramatically better picture compared to the old approach.

With 1 second interval memory limiter's impact on CPU is not measurable with the
new approach, whereas with the old approach it was still showing several times higher
CPU when limit was hit.

This makes small check intervals practically useful and allows to effectively
suppress incoming surges of data.
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Dec 9, 2020
Contributes to open-telemetry#1121

Memory limiter processor previously had only one limit. When exceeding
this limit it was previously continuously calling GC. This resulted in
huge CPU consumption if the check interval was small and this was
forcing to use large check intervals. This in turn was resulting in
lethargic response to growing memory usage and the memory limiter was
not very effective in situations when memory usage were growing rapidly
(e.g. when there was a big spike or when the backend was down).

I changed the logic of memory limiter to be based on 2 thresholds:
soft and hard. While below soft threshold the memory limiter is fully disabled.
Between soft and hard limiter the limiter begins dropping incoming data
but does not perform GC. Only when exceed the hard limit we perform GC.
The net result is that the actually used memory is limited at the level
set by soft limit and fluctuates between soft and hard limit as the garbage
is created and collected. Correspondingly GC runs much more infrequently, only
when the hard limit is reached and such GC immediately collects significant
amount of garbage (reduces memory usage close to soft limit) and thus does not
require subsequent GC calls for quite some time.

I did some performance tests with the old and new approaches with 4000 Mib limit,
100,000 spans per second and with exporter completely blocked (no backend).

With the old approach an interval of 100 ms causes about 450% of CPU usage
once the memory limit is hit (while below limit the CPU usage is around 50%).
Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:20:47 Agent RAM (RES):3296 MiB, CPU:44.4% | Sent:   7022700 items | Received:         0 items (0/sec)
2020/12/03 20:20:50 Agent RAM (RES):3481 MiB, CPU:43.0% | Sent:   7322500 items | Received:         0 items (0/sec)
2020/12/03 20:20:53 Agent RAM (RES):3681 MiB, CPU:41.6% | Sent:   7614100 items | Received:         0 items (0/sec)
2020/12/03 20:20:56 Agent RAM (RES):3703 MiB, CPU:47.7% | Sent:   7863600 items | Received:         0 items (0/sec)
2020/12/03 20:20:59 Agent RAM (RES):3028 MiB, CPU:47.0% | Sent:   8062700 items | Received:         0 items (0/sec)
2020/12/03 20:21:02 Agent RAM (RES):3644 MiB, CPU:246.9% | Sent:   8331600 items | Received:         0 items (0/sec) <-- likely a regular GC, not at limit yet
2020/12/03 20:21:05 Agent RAM (RES):3555 MiB, CPU:72.8% | Sent:   8620500 items | Received:         0 items (0/sec)
2020/12/03 20:21:08 Agent RAM (RES):3717 MiB, CPU:57.5% | Sent:   8895500 items | Received:         0 items (0/sec)
2020/12/03 20:21:11 Agent RAM (RES):3877 MiB, CPU:126.9% | Sent:   9172900 items | Received:         0 items (0/sec) <-- hit limit
2020/12/03 20:21:14 Agent RAM (RES):3900 MiB, CPU:127.6% | Sent:   9461100 items | Received:         0 items (0/sec)
2020/12/03 20:21:17 Agent RAM (RES):3918 MiB, CPU:201.7% | Sent:   9728900 items | Received:         0 items (0/sec)
2020/12/03 20:21:20 Agent RAM (RES):3938 MiB, CPU:326.0% | Sent:   9994700 items | Received:         0 items (0/sec)
2020/12/03 20:21:23 Agent RAM (RES):3951 MiB, CPU:470.8% | Sent:  10253200 items | Received:         0 items (0/sec)
2020/12/03 20:21:26 Agent RAM (RES):3955 MiB, CPU:440.0% | Sent:  10504400 items | Received:         0 items (0/sec)
2020/12/03 20:21:29 Agent RAM (RES):3961 MiB, CPU:451.0% | Sent:  10766200 items | Received:         0 items (0/sec)
2020/12/03 20:21:32 Agent RAM (RES):3965 MiB, CPU:465.8% | Sent:  11008400 items | Received:         0 items (0/sec)
2020/12/03 20:21:35 Agent RAM (RES):3974 MiB, CPU:423.6% | Sent:  11272700 items | Received:         0 items (0/sec)
```
Even the interval of 1 second was unusable with the old approach and we had to
choose a longer interval to avoid performance degradation.

With the new approach under the exact same conditions when using 100ms check interval
the CPU usage is 50% when below memory limits and when the hard memory limits are
hit the CPU usage increases to 68%. With 1 second check interval there is no measurable
increase in CPU usage when memory limiter is hit (unlike 9x CPU increase with the old
approach).

Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:28:35 Agent RAM (RES):1888 MiB, CPU:48.2% | Sent:   3796400 items | Received:         0 items (0/sec)
2020/12/03 20:28:38 Agent RAM (RES):2029 MiB, CPU:47.1% | Sent:   4088600 items | Received:         0 items (0/sec)
2020/12/03 20:28:41 Agent RAM (RES):2197 MiB, CPU:48.3% | Sent:   4388200 items | Received:         0 items (0/sec)
2020/12/03 20:28:44 Agent RAM (RES):2370 MiB, CPU:45.7% | Sent:   4679900 items | Received:         0 items (0/sec)
2020/12/03 20:28:47 Agent RAM (RES):2558 MiB, CPU:49.0% | Sent:   4972200 items | Received:         0 items (0/sec)
2020/12/03 20:28:50 Agent RAM (RES):2771 MiB, CPU:47.4% | Sent:   5260700 items | Received:         0 items (0/sec)
2020/12/03 20:28:53 Agent RAM (RES):2921 MiB, CPU:133.3% | Sent:   5547500 items | Received:         0 items (0/sec)
2020/12/03 20:28:56 Agent RAM (RES):2922 MiB, CPU:50.1% | Sent:   5846700 items | Received:         0 items (0/sec)
2020/12/03 20:28:59 Agent RAM (RES):2957 MiB, CPU:43.6% | Sent:   6131700 items | Received:         0 items (0/sec)
2020/12/03 20:29:02 Agent RAM (RES):3144 MiB, CPU:50.0% | Sent:   6419400 items | Received:         0 items (0/sec)
2020/12/03 20:29:05 Agent RAM (RES):3328 MiB, CPU:49.0% | Sent:   6719100 items | Received:         0 items (0/sec)
2020/12/03 20:29:08 Agent RAM (RES):3488 MiB, CPU:38.6% | Sent:   7007300 items | Received:         0 items (0/sec)
2020/12/03 20:29:11 Agent RAM (RES):3667 MiB, CPU:42.0% | Sent:   7306700 items | Received:         0 items (0/sec)
2020/12/03 20:29:14 Agent RAM (RES):3813 MiB, CPU:37.4% | Sent:   7577700 items | Received:         0 items (0/sec)
2020/12/03 20:29:17 Agent RAM (RES):3802 MiB, CPU:170.9% | Sent:   7860100 items | Received:         0 items (0/sec) <-- hit hard limit
2020/12/03 20:29:20 Agent RAM (RES):3882 MiB, CPU:68.1% | Sent:   8160000 items | Received:         0 items (0/sec)
2020/12/03 20:29:23 Agent RAM (RES):4007 MiB, CPU:42.3% | Sent:   8447900 items | Received:         0 items (0/sec)
2020/12/03 20:29:26 Agent RAM (RES):4007 MiB, CPU:39.3% | Sent:   8747800 items | Received:         0 items (0/sec)
2020/12/03 20:29:29 Agent RAM (RES):4008 MiB, CPU:34.3% | Sent:   9038400 items | Received:         0 items (0/sec)
2020/12/03 20:29:32 Agent RAM (RES):4009 MiB, CPU:39.9% | Sent:   9317200 items | Received:         0 items (0/sec)
```
This is a dramatically better picture compared to the old approach.

With 1 second interval memory limiter's impact on CPU is not measurable with the
new approach, whereas with the old approach it was still showing several times higher
CPU when limit was hit.

This makes small check intervals practically useful and allows to effectively
suppress incoming surges of data.
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Dec 15, 2020
Contributes to open-telemetry#1121

Memory limiter processor previously had only one limit. When exceeding
this limit it was previously continuously calling GC. This resulted in
huge CPU consumption if the check interval was small and this was
forcing to use large check intervals. This in turn was resulting in
lethargic response to growing memory usage and the memory limiter was
not very effective in situations when memory usage were growing rapidly
(e.g. when there was a big spike or when the backend was down).

I changed the logic of memory limiter to be based on 2 thresholds:
soft and hard. While below soft threshold the memory limiter is fully disabled.
Between soft and hard limiter the limiter begins dropping incoming data
but does not perform GC. Only when exceed the hard limit we perform GC.
The net result is that the actually used memory is limited at the level
set by soft limit and fluctuates between soft and hard limit as the garbage
is created and collected. Correspondingly GC runs much more infrequently, only
when the hard limit is reached and such GC immediately collects significant
amount of garbage (reduces memory usage close to soft limit) and thus does not
require subsequent GC calls for quite some time.

I did some performance tests with the old and new approaches with 4000 Mib limit,
100,000 spans per second and with exporter completely blocked (no backend).

With the old approach an interval of 100 ms causes about 450% of CPU usage
once the memory limit is hit (while below limit the CPU usage is around 50%).
Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:20:47 Agent RAM (RES):3296 MiB, CPU:44.4% | Sent:   7022700 items | Received:         0 items (0/sec)
2020/12/03 20:20:50 Agent RAM (RES):3481 MiB, CPU:43.0% | Sent:   7322500 items | Received:         0 items (0/sec)
2020/12/03 20:20:53 Agent RAM (RES):3681 MiB, CPU:41.6% | Sent:   7614100 items | Received:         0 items (0/sec)
2020/12/03 20:20:56 Agent RAM (RES):3703 MiB, CPU:47.7% | Sent:   7863600 items | Received:         0 items (0/sec)
2020/12/03 20:20:59 Agent RAM (RES):3028 MiB, CPU:47.0% | Sent:   8062700 items | Received:         0 items (0/sec)
2020/12/03 20:21:02 Agent RAM (RES):3644 MiB, CPU:246.9% | Sent:   8331600 items | Received:         0 items (0/sec) <-- likely a regular GC, not at limit yet
2020/12/03 20:21:05 Agent RAM (RES):3555 MiB, CPU:72.8% | Sent:   8620500 items | Received:         0 items (0/sec)
2020/12/03 20:21:08 Agent RAM (RES):3717 MiB, CPU:57.5% | Sent:   8895500 items | Received:         0 items (0/sec)
2020/12/03 20:21:11 Agent RAM (RES):3877 MiB, CPU:126.9% | Sent:   9172900 items | Received:         0 items (0/sec) <-- hit limit
2020/12/03 20:21:14 Agent RAM (RES):3900 MiB, CPU:127.6% | Sent:   9461100 items | Received:         0 items (0/sec)
2020/12/03 20:21:17 Agent RAM (RES):3918 MiB, CPU:201.7% | Sent:   9728900 items | Received:         0 items (0/sec)
2020/12/03 20:21:20 Agent RAM (RES):3938 MiB, CPU:326.0% | Sent:   9994700 items | Received:         0 items (0/sec)
2020/12/03 20:21:23 Agent RAM (RES):3951 MiB, CPU:470.8% | Sent:  10253200 items | Received:         0 items (0/sec)
2020/12/03 20:21:26 Agent RAM (RES):3955 MiB, CPU:440.0% | Sent:  10504400 items | Received:         0 items (0/sec)
2020/12/03 20:21:29 Agent RAM (RES):3961 MiB, CPU:451.0% | Sent:  10766200 items | Received:         0 items (0/sec)
2020/12/03 20:21:32 Agent RAM (RES):3965 MiB, CPU:465.8% | Sent:  11008400 items | Received:         0 items (0/sec)
2020/12/03 20:21:35 Agent RAM (RES):3974 MiB, CPU:423.6% | Sent:  11272700 items | Received:         0 items (0/sec)
```
Even the interval of 1 second was unusable with the old approach and we had to
choose a longer interval to avoid performance degradation.

With the new approach under the exact same conditions when using 100ms check interval
the CPU usage is 50% when below memory limits and when the hard memory limits are
hit the CPU usage increases to 68%. With 1 second check interval there is no measurable
increase in CPU usage when memory limiter is hit (unlike 9x CPU increase with the old
approach).

Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:28:35 Agent RAM (RES):1888 MiB, CPU:48.2% | Sent:   3796400 items | Received:         0 items (0/sec)
2020/12/03 20:28:38 Agent RAM (RES):2029 MiB, CPU:47.1% | Sent:   4088600 items | Received:         0 items (0/sec)
2020/12/03 20:28:41 Agent RAM (RES):2197 MiB, CPU:48.3% | Sent:   4388200 items | Received:         0 items (0/sec)
2020/12/03 20:28:44 Agent RAM (RES):2370 MiB, CPU:45.7% | Sent:   4679900 items | Received:         0 items (0/sec)
2020/12/03 20:28:47 Agent RAM (RES):2558 MiB, CPU:49.0% | Sent:   4972200 items | Received:         0 items (0/sec)
2020/12/03 20:28:50 Agent RAM (RES):2771 MiB, CPU:47.4% | Sent:   5260700 items | Received:         0 items (0/sec)
2020/12/03 20:28:53 Agent RAM (RES):2921 MiB, CPU:133.3% | Sent:   5547500 items | Received:         0 items (0/sec)
2020/12/03 20:28:56 Agent RAM (RES):2922 MiB, CPU:50.1% | Sent:   5846700 items | Received:         0 items (0/sec)
2020/12/03 20:28:59 Agent RAM (RES):2957 MiB, CPU:43.6% | Sent:   6131700 items | Received:         0 items (0/sec)
2020/12/03 20:29:02 Agent RAM (RES):3144 MiB, CPU:50.0% | Sent:   6419400 items | Received:         0 items (0/sec)
2020/12/03 20:29:05 Agent RAM (RES):3328 MiB, CPU:49.0% | Sent:   6719100 items | Received:         0 items (0/sec)
2020/12/03 20:29:08 Agent RAM (RES):3488 MiB, CPU:38.6% | Sent:   7007300 items | Received:         0 items (0/sec)
2020/12/03 20:29:11 Agent RAM (RES):3667 MiB, CPU:42.0% | Sent:   7306700 items | Received:         0 items (0/sec)
2020/12/03 20:29:14 Agent RAM (RES):3813 MiB, CPU:37.4% | Sent:   7577700 items | Received:         0 items (0/sec)
2020/12/03 20:29:17 Agent RAM (RES):3802 MiB, CPU:170.9% | Sent:   7860100 items | Received:         0 items (0/sec) <-- hit hard limit
2020/12/03 20:29:20 Agent RAM (RES):3882 MiB, CPU:68.1% | Sent:   8160000 items | Received:         0 items (0/sec)
2020/12/03 20:29:23 Agent RAM (RES):4007 MiB, CPU:42.3% | Sent:   8447900 items | Received:         0 items (0/sec)
2020/12/03 20:29:26 Agent RAM (RES):4007 MiB, CPU:39.3% | Sent:   8747800 items | Received:         0 items (0/sec)
2020/12/03 20:29:29 Agent RAM (RES):4008 MiB, CPU:34.3% | Sent:   9038400 items | Received:         0 items (0/sec)
2020/12/03 20:29:32 Agent RAM (RES):4009 MiB, CPU:39.9% | Sent:   9317200 items | Received:         0 items (0/sec)
```
This is a dramatically better picture compared to the old approach.

With 1 second interval memory limiter's impact on CPU is not measurable with the
new approach, whereas with the old approach it was still showing several times higher
CPU when limit was hit.

This makes small check intervals practically useful and allows to effectively
suppress incoming surges of data.
tigrannajaryan added a commit to tigrannajaryan/opentelemetry-collector that referenced this issue Dec 15, 2020
Contributes to open-telemetry#1121

Memory limiter processor previously had only one limit. When exceeding
this limit it was previously continuously calling GC. This resulted in
huge CPU consumption if the check interval was small and this was
forcing to use large check intervals. This in turn was resulting in
lethargic response to growing memory usage and the memory limiter was
not very effective in situations when memory usage were growing rapidly
(e.g. when there was a big spike or when the backend was down).

I changed the logic of memory limiter to be based on 2 thresholds:
soft and hard. While below soft threshold the memory limiter is fully disabled.
Between soft and hard limiter the limiter begins dropping incoming data
but does not perform GC. Only when exceed the hard limit we perform GC.
The net result is that the actually used memory is limited at the level
set by soft limit and fluctuates between soft and hard limit as the garbage
is created and collected. Correspondingly GC runs much more infrequently, only
when the hard limit is reached and such GC immediately collects significant
amount of garbage (reduces memory usage close to soft limit) and thus does not
require subsequent GC calls for quite some time.

I did some performance tests with the old and new approaches with 4000 Mib limit,
100,000 spans per second and with exporter completely blocked (no backend).

With the old approach an interval of 100 ms causes about 450% of CPU usage
once the memory limit is hit (while below limit the CPU usage is around 50%).
Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:20:47 Agent RAM (RES):3296 MiB, CPU:44.4% | Sent:   7022700 items | Received:         0 items (0/sec)
2020/12/03 20:20:50 Agent RAM (RES):3481 MiB, CPU:43.0% | Sent:   7322500 items | Received:         0 items (0/sec)
2020/12/03 20:20:53 Agent RAM (RES):3681 MiB, CPU:41.6% | Sent:   7614100 items | Received:         0 items (0/sec)
2020/12/03 20:20:56 Agent RAM (RES):3703 MiB, CPU:47.7% | Sent:   7863600 items | Received:         0 items (0/sec)
2020/12/03 20:20:59 Agent RAM (RES):3028 MiB, CPU:47.0% | Sent:   8062700 items | Received:         0 items (0/sec)
2020/12/03 20:21:02 Agent RAM (RES):3644 MiB, CPU:246.9% | Sent:   8331600 items | Received:         0 items (0/sec) <-- likely a regular GC, not at limit yet
2020/12/03 20:21:05 Agent RAM (RES):3555 MiB, CPU:72.8% | Sent:   8620500 items | Received:         0 items (0/sec)
2020/12/03 20:21:08 Agent RAM (RES):3717 MiB, CPU:57.5% | Sent:   8895500 items | Received:         0 items (0/sec)
2020/12/03 20:21:11 Agent RAM (RES):3877 MiB, CPU:126.9% | Sent:   9172900 items | Received:         0 items (0/sec) <-- hit limit
2020/12/03 20:21:14 Agent RAM (RES):3900 MiB, CPU:127.6% | Sent:   9461100 items | Received:         0 items (0/sec)
2020/12/03 20:21:17 Agent RAM (RES):3918 MiB, CPU:201.7% | Sent:   9728900 items | Received:         0 items (0/sec)
2020/12/03 20:21:20 Agent RAM (RES):3938 MiB, CPU:326.0% | Sent:   9994700 items | Received:         0 items (0/sec)
2020/12/03 20:21:23 Agent RAM (RES):3951 MiB, CPU:470.8% | Sent:  10253200 items | Received:         0 items (0/sec)
2020/12/03 20:21:26 Agent RAM (RES):3955 MiB, CPU:440.0% | Sent:  10504400 items | Received:         0 items (0/sec)
2020/12/03 20:21:29 Agent RAM (RES):3961 MiB, CPU:451.0% | Sent:  10766200 items | Received:         0 items (0/sec)
2020/12/03 20:21:32 Agent RAM (RES):3965 MiB, CPU:465.8% | Sent:  11008400 items | Received:         0 items (0/sec)
2020/12/03 20:21:35 Agent RAM (RES):3974 MiB, CPU:423.6% | Sent:  11272700 items | Received:         0 items (0/sec)
```
Even the interval of 1 second was unusable with the old approach and we had to
choose a longer interval to avoid performance degradation.

With the new approach under the exact same conditions when using 100ms check interval
the CPU usage is 50% when below memory limits and when the hard memory limits are
hit the CPU usage increases to 68%. With 1 second check interval there is no measurable
increase in CPU usage when memory limiter is hit (unlike 9x CPU increase with the old
approach).

Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:28:35 Agent RAM (RES):1888 MiB, CPU:48.2% | Sent:   3796400 items | Received:         0 items (0/sec)
2020/12/03 20:28:38 Agent RAM (RES):2029 MiB, CPU:47.1% | Sent:   4088600 items | Received:         0 items (0/sec)
2020/12/03 20:28:41 Agent RAM (RES):2197 MiB, CPU:48.3% | Sent:   4388200 items | Received:         0 items (0/sec)
2020/12/03 20:28:44 Agent RAM (RES):2370 MiB, CPU:45.7% | Sent:   4679900 items | Received:         0 items (0/sec)
2020/12/03 20:28:47 Agent RAM (RES):2558 MiB, CPU:49.0% | Sent:   4972200 items | Received:         0 items (0/sec)
2020/12/03 20:28:50 Agent RAM (RES):2771 MiB, CPU:47.4% | Sent:   5260700 items | Received:         0 items (0/sec)
2020/12/03 20:28:53 Agent RAM (RES):2921 MiB, CPU:133.3% | Sent:   5547500 items | Received:         0 items (0/sec)
2020/12/03 20:28:56 Agent RAM (RES):2922 MiB, CPU:50.1% | Sent:   5846700 items | Received:         0 items (0/sec)
2020/12/03 20:28:59 Agent RAM (RES):2957 MiB, CPU:43.6% | Sent:   6131700 items | Received:         0 items (0/sec)
2020/12/03 20:29:02 Agent RAM (RES):3144 MiB, CPU:50.0% | Sent:   6419400 items | Received:         0 items (0/sec)
2020/12/03 20:29:05 Agent RAM (RES):3328 MiB, CPU:49.0% | Sent:   6719100 items | Received:         0 items (0/sec)
2020/12/03 20:29:08 Agent RAM (RES):3488 MiB, CPU:38.6% | Sent:   7007300 items | Received:         0 items (0/sec)
2020/12/03 20:29:11 Agent RAM (RES):3667 MiB, CPU:42.0% | Sent:   7306700 items | Received:         0 items (0/sec)
2020/12/03 20:29:14 Agent RAM (RES):3813 MiB, CPU:37.4% | Sent:   7577700 items | Received:         0 items (0/sec)
2020/12/03 20:29:17 Agent RAM (RES):3802 MiB, CPU:170.9% | Sent:   7860100 items | Received:         0 items (0/sec) <-- hit hard limit
2020/12/03 20:29:20 Agent RAM (RES):3882 MiB, CPU:68.1% | Sent:   8160000 items | Received:         0 items (0/sec)
2020/12/03 20:29:23 Agent RAM (RES):4007 MiB, CPU:42.3% | Sent:   8447900 items | Received:         0 items (0/sec)
2020/12/03 20:29:26 Agent RAM (RES):4007 MiB, CPU:39.3% | Sent:   8747800 items | Received:         0 items (0/sec)
2020/12/03 20:29:29 Agent RAM (RES):4008 MiB, CPU:34.3% | Sent:   9038400 items | Received:         0 items (0/sec)
2020/12/03 20:29:32 Agent RAM (RES):4009 MiB, CPU:39.9% | Sent:   9317200 items | Received:         0 items (0/sec)
```
This is a dramatically better picture compared to the old approach.

With 1 second interval memory limiter's impact on CPU is not measurable with the
new approach, whereas with the old approach it was still showing several times higher
CPU when limit was hit.

This makes small check intervals practically useful and allows to effectively
suppress incoming surges of data.
tigrannajaryan added a commit that referenced this issue Dec 15, 2020
Contributes to #1121

Memory limiter processor previously had only one limit. When exceeding
this limit it was previously continuously calling GC. This resulted in
huge CPU consumption if the check interval was small and this was
forcing to use large check intervals. This in turn was resulting in
lethargic response to growing memory usage and the memory limiter was
not very effective in situations when memory usage were growing rapidly
(e.g. when there was a big spike or when the backend was down).

I changed the logic of memory limiter to be based on 2 thresholds:
soft and hard. While below soft threshold the memory limiter is fully disabled.
Between soft and hard limiter the limiter begins dropping incoming data
but does not perform GC. Only when exceed the hard limit we perform GC.
The net result is that the actually used memory is limited at the level
set by soft limit and fluctuates between soft and hard limit as the garbage
is created and collected. Correspondingly GC runs much more infrequently, only
when the hard limit is reached and such GC immediately collects significant
amount of garbage (reduces memory usage close to soft limit) and thus does not
require subsequent GC calls for quite some time.

I did some performance tests with the old and new approaches with 4000 Mib limit,
100,000 spans per second and with exporter completely blocked (no backend).

With the old approach an interval of 100 ms causes about 450% of CPU usage
once the memory limit is hit (while below limit the CPU usage is around 50%).
Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:20:47 Agent RAM (RES):3296 MiB, CPU:44.4% | Sent:   7022700 items | Received:         0 items (0/sec)
2020/12/03 20:20:50 Agent RAM (RES):3481 MiB, CPU:43.0% | Sent:   7322500 items | Received:         0 items (0/sec)
2020/12/03 20:20:53 Agent RAM (RES):3681 MiB, CPU:41.6% | Sent:   7614100 items | Received:         0 items (0/sec)
2020/12/03 20:20:56 Agent RAM (RES):3703 MiB, CPU:47.7% | Sent:   7863600 items | Received:         0 items (0/sec)
2020/12/03 20:20:59 Agent RAM (RES):3028 MiB, CPU:47.0% | Sent:   8062700 items | Received:         0 items (0/sec)
2020/12/03 20:21:02 Agent RAM (RES):3644 MiB, CPU:246.9% | Sent:   8331600 items | Received:         0 items (0/sec) <-- likely a regular GC, not at limit yet
2020/12/03 20:21:05 Agent RAM (RES):3555 MiB, CPU:72.8% | Sent:   8620500 items | Received:         0 items (0/sec)
2020/12/03 20:21:08 Agent RAM (RES):3717 MiB, CPU:57.5% | Sent:   8895500 items | Received:         0 items (0/sec)
2020/12/03 20:21:11 Agent RAM (RES):3877 MiB, CPU:126.9% | Sent:   9172900 items | Received:         0 items (0/sec) <-- hit limit
2020/12/03 20:21:14 Agent RAM (RES):3900 MiB, CPU:127.6% | Sent:   9461100 items | Received:         0 items (0/sec)
2020/12/03 20:21:17 Agent RAM (RES):3918 MiB, CPU:201.7% | Sent:   9728900 items | Received:         0 items (0/sec)
2020/12/03 20:21:20 Agent RAM (RES):3938 MiB, CPU:326.0% | Sent:   9994700 items | Received:         0 items (0/sec)
2020/12/03 20:21:23 Agent RAM (RES):3951 MiB, CPU:470.8% | Sent:  10253200 items | Received:         0 items (0/sec)
2020/12/03 20:21:26 Agent RAM (RES):3955 MiB, CPU:440.0% | Sent:  10504400 items | Received:         0 items (0/sec)
2020/12/03 20:21:29 Agent RAM (RES):3961 MiB, CPU:451.0% | Sent:  10766200 items | Received:         0 items (0/sec)
2020/12/03 20:21:32 Agent RAM (RES):3965 MiB, CPU:465.8% | Sent:  11008400 items | Received:         0 items (0/sec)
2020/12/03 20:21:35 Agent RAM (RES):3974 MiB, CPU:423.6% | Sent:  11272700 items | Received:         0 items (0/sec)
```
Even the interval of 1 second was unusable with the old approach and we had to
choose a longer interval to avoid performance degradation.

With the new approach under the exact same conditions when using 100ms check interval
the CPU usage is 50% when below memory limits and when the hard memory limits are
hit the CPU usage increases to 68%. With 1 second check interval there is no measurable
increase in CPU usage when memory limiter is hit (unlike 9x CPU increase with the old
approach).

Here is an extract of performance test output showing the moment when the limiter
is hit:
```
2020/12/03 20:28:35 Agent RAM (RES):1888 MiB, CPU:48.2% | Sent:   3796400 items | Received:         0 items (0/sec)
2020/12/03 20:28:38 Agent RAM (RES):2029 MiB, CPU:47.1% | Sent:   4088600 items | Received:         0 items (0/sec)
2020/12/03 20:28:41 Agent RAM (RES):2197 MiB, CPU:48.3% | Sent:   4388200 items | Received:         0 items (0/sec)
2020/12/03 20:28:44 Agent RAM (RES):2370 MiB, CPU:45.7% | Sent:   4679900 items | Received:         0 items (0/sec)
2020/12/03 20:28:47 Agent RAM (RES):2558 MiB, CPU:49.0% | Sent:   4972200 items | Received:         0 items (0/sec)
2020/12/03 20:28:50 Agent RAM (RES):2771 MiB, CPU:47.4% | Sent:   5260700 items | Received:         0 items (0/sec)
2020/12/03 20:28:53 Agent RAM (RES):2921 MiB, CPU:133.3% | Sent:   5547500 items | Received:         0 items (0/sec)
2020/12/03 20:28:56 Agent RAM (RES):2922 MiB, CPU:50.1% | Sent:   5846700 items | Received:         0 items (0/sec)
2020/12/03 20:28:59 Agent RAM (RES):2957 MiB, CPU:43.6% | Sent:   6131700 items | Received:         0 items (0/sec)
2020/12/03 20:29:02 Agent RAM (RES):3144 MiB, CPU:50.0% | Sent:   6419400 items | Received:         0 items (0/sec)
2020/12/03 20:29:05 Agent RAM (RES):3328 MiB, CPU:49.0% | Sent:   6719100 items | Received:         0 items (0/sec)
2020/12/03 20:29:08 Agent RAM (RES):3488 MiB, CPU:38.6% | Sent:   7007300 items | Received:         0 items (0/sec)
2020/12/03 20:29:11 Agent RAM (RES):3667 MiB, CPU:42.0% | Sent:   7306700 items | Received:         0 items (0/sec)
2020/12/03 20:29:14 Agent RAM (RES):3813 MiB, CPU:37.4% | Sent:   7577700 items | Received:         0 items (0/sec)
2020/12/03 20:29:17 Agent RAM (RES):3802 MiB, CPU:170.9% | Sent:   7860100 items | Received:         0 items (0/sec) <-- hit hard limit
2020/12/03 20:29:20 Agent RAM (RES):3882 MiB, CPU:68.1% | Sent:   8160000 items | Received:         0 items (0/sec)
2020/12/03 20:29:23 Agent RAM (RES):4007 MiB, CPU:42.3% | Sent:   8447900 items | Received:         0 items (0/sec)
2020/12/03 20:29:26 Agent RAM (RES):4007 MiB, CPU:39.3% | Sent:   8747800 items | Received:         0 items (0/sec)
2020/12/03 20:29:29 Agent RAM (RES):4008 MiB, CPU:34.3% | Sent:   9038400 items | Received:         0 items (0/sec)
2020/12/03 20:29:32 Agent RAM (RES):4009 MiB, CPU:39.9% | Sent:   9317200 items | Received:         0 items (0/sec)
```
This is a dramatically better picture compared to the old approach.

With 1 second interval memory limiter's impact on CPU is not measurable with the
new approach, whereas with the old approach it was still showing several times higher
CPU when limit was hit.

This makes small check intervals practically useful and allows to effectively
suppress incoming surges of data.
@tigrannajaryan
Copy link
Member

Closing for now, should be fixed or at least mitigated by #2250
Can reopen if more work is needed.

swiatekm pushed a commit to swiatekm/opentelemetry-collector that referenced this issue Oct 9, 2024
* read cert contents

* rename vars, add upgrade guid

* fix examples, rebase

* fix example

* update version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Good issue for contributors to OpenTelemetry Service to pick up priority:p1 High
Projects
None yet
Development

No branches or pull requests

5 participants