Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PushController.tick keeps GIL for the time depeding on the amount of lables and metrics #1142

Closed
haarakdy opened this issue Jun 21, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@haarakdy
Copy link

I am using PrometheusRemoteWriteMetricsExporter and Logz.io
In my tests OpenTelemetry's tick() keeps GIL for up to 10ms when the number of metrics is 20, and average number of labels is 6
I measure the GIL availability using a loop like in the class below. The py-spy record ... confirms the result.
In my application I have a constraint for the response latency. The constraint is not very tight. I have a budget of few 100s of milliseconds. OpenTelemetry's tick is not the only loop in the system. If I use more than a few metrics I start to see the impact.

class Ping:
    def __init__(self, sleep_time) -> None:
        # A sorted list of a few samples of the worst case
        # thread shcheduler latency
        self.max_latencies: List[float] = 10 * [0.0]
        # A sum of all latencies for mean calculations
        self.latency_accumulator = 0.0
        # Number of tests
        self.count = 0
        self.completed = False
        self.sleep_time = sleep_time

        # Start test
        self.task_ping = Thread(target=self._ping)
        self.task_ping.daemon = True
        self.task_ping.start()

    def complete(self):
        self.completed = True
        self.task_ping.join()

    @property
    def max_latency(self) -> float:
        return self.max_latencies[-1]

    @property
    def mean_latency(self) -> float:
        return self.latency_accumulator / self.count

    def _ping(self) -> None:
        """
        Call sleep(), measure the actual time the thread spends sleeping,
        collect the latency, the time slippage.
        """
        before = time()
        while not self.completed:
            sleep(self.sleep_time)

            # In the ideal preemptive scheduling the time spent sleeping
            # is equal to the argument. In a real system the thread scheduler
            # has latency. In the Python environment of cooperative
            # multithreading the latency is impacted by threads keeping GIL.
            after = time()
            latency = after - before - self.sleep_time
            # Keep a few worst case latencies
            if latency > self.max_latencies[0]:
                # replace the lowest value
                self.max_latencies[0] = latency
                self.max_latencies.sort()
            self.latency_accumulator += latency
            self.count += 1
            before = after

This patch solves the problem

diff --git a/exporter/prometheus_remote_write/__init__.py b/exporter/prometheus_remote_write/__init__.py
index 1a75ea8..7bc47cd 100644
--- a/exporter/prometheus_remote_write/__init__.py
+++ b/exporter/prometheus_remote_write/__init__.py
@@ -14,6 +14,7 @@
 
 import logging
 import re
+from itertools import islice
 from typing import Dict, Sequence
 
 import requests
@@ -181,7 +182,7 @@ class PrometheusRemoteWriteMetricsExporter(MetricsExporter):
         self, export_records: Sequence[ExportRecord]
     ) -> Sequence[TimeSeries]:
         timeseries = []
-        for export_record in export_records:
+        for export_record in islice(export_records, 10):
             aggregator_type = type(export_record.aggregator)
             converter = self.converter_map.get(aggregator_type)
             if converter:
@haarakdy
Copy link
Author

haarakdy commented Jun 21, 2022

I can quickly reproduce the problem by generating random values labels in the metric. Why does the exporter performance depends on the number of different values of a label?
The memory in use grows when the list of possible values for labels is expanding.

@ocelotl
Copy link
Contributor

ocelotl commented Jun 21, 2022

@haarakdy you are using the old prometheus write exporter, we have to bring it back to use the new metrics API/SDK, this is being tracked in #933.

@haarakdy
Copy link
Author

haarakdy commented Jul 5, 2022

Is there a simple way to run the OpenTelemetry in a dedicated Python instance, a dedicated process? Ultimately I want the UpDownCounter.add() API to copy the data to a shared memory, send an event, and return. OpenTelemetry process picks the data from the shared memory and performs it's magic.

@srikanthccv
Copy link
Member

@haarakdy you are using the old exporter that's no longer supported/maintained. please track the progress here #933.

Is there a simple way to run the OpenTelemetry in a dedicated Python instance, a dedicated process? Ultimately I want the UpDownCounter.add() API to copy the data to a shared memory, send an event, and return. OpenTelemetry process picks the data from the shared memory and performs it's magic.

Not sure what you are referring to here. Please create another feature-request issue to start the discussion.

@srikanthccv srikanthccv closed this as not planned Won't fix, can't repro, duplicate, stale Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants