-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observing delays in otel collector extension #1497
Comments
Hi @bhaskarbanerjee, is it possible to set |
Thanks for quick reponse @serkan-ozal. Here it is |
2024-08-26T13:39:55.810-05:00 | {"level":"warn","ts":1724697595.8106759,"caller":"localhostgate/featuregate.go:63","msg":"The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default."",""feature gate ID"":""component.UseLocalHostAsDefaultHost""} THIS IS WHERE WE SEE A DELAY OF ~2-4 seconds. 2024-08-26T13:39:57.228-05:00 | EXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN] |
Here is my config.
|
@bhaskarbanerjee In my simple example, I have seen runtime and function init logs between those collector extension logs ( To check this, you may also put print debug log statement to the first and last lines (and also some other important points) of handler.py file to measure the load start/end timings of your handler module/file (not the handler function, but the handler file) |
Thanks @serkan-ozal . I have other lambdas too which are not using opentelemetry and in those cases, there is no such wait time. here is an example of using New Relic layer and another layer from Hashicorp in the same lambda.
In this case the Lambda lifecycle API shows it took 5ms for it to start since the extensions came to ready state, which is quite acceptable. |
Here are the logs from my sample Nodejs Lambda: With OTEL SDK + User Handler Init
Time diff between collector logs:
With OTEL SDK + No User Handler Init
Time diff between collector logs:
No OTEL SDK + No User Handler Init
Time diff between collector logs:
So, in summary, as you can see from the logs, time diff between collector logs are heavily depends on OTEL SDK init and user handler init delays. So, in your example, since handler itself is just simple hello world function without initializing any 3rd party dependency, I think, 2-4 second diff between collector logs is caused by OTEL Python SDK initialization. I will also rerun this benchmark with sample Python Lambda and share the results. I am not a Python expert but maybe this type of auto instrumentation is heavyweight for Python runtime:
|
@serkan-ozal Much Thanks, I really appreciate your help. qq - what config of the collector are you running with? Is it the default one I am gathering evidences on my end. Will share soon. |
@bhaskarbanerjee default one |
@bhaskarbanerjee I am not sure but such type of auto instrumentation (checks the dependencies and forks another process from main one to run the real app) seems not lightweight to me: https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/opentelemetry-instrumentation/src/opentelemetry/instrumentation/auto_instrumentation/__init__.py Maybe this is a question to be asked to OTEL Python folks |
@bhaskarbanerjee I think my theory about |
Agreed to both your points. I have already added logs to find out more about several other inits and calls. I will share those tests after I run the scenarios we discussed this morning. |
@serkan-ozal Here are my test results with lambda auto instrumented in NodeJs v18
Inference:
|
Thanks for the detailed benchmark @bhaskarbanerjee. I have also reproduced similar cold start delays with Python example too and verified that Python auto instrumentation is the reason. Will look into further tomorrow by profiling Python auto instrumentation script. But, initially as far as I see from the Python auto instrumentation, it
|
autoinstrumentation is definitely a cold start penalty in JS. the more instrumentations are enabled, the higher the penalty. This is because we wrap require and intercept all module loads to determine if it is a module that needs to be monkeypatched. It used to be much worse when each instrumentation maintained its own wrapper. Now there is a single global wrapper that instrumentations register with. There are maybe additional optimizations that can be done, but the mechanism will always have some module load penalty. edit: tl;dr because node autoinstrumentation uses require-in-the-middle, each module load is slightly slower than it would be without autoinstrumentation enabled |
Thanks @dyladan. Should choosing the instrumentations lower the cold starts. Or are there any other ways apart from manual instrumentation? Do we have any tests conducted/documented by the community to determine the performance difference between AUTO instrumentation and MANUAL instrumentation? |
For the Python runtime, I have analyzed OTEL Python auto instrumentation script and verified that the following parts have significant effect on coldstart (for a 512 MB Lambda function):
I am not sure whether there is anything we can do for But instead of auto discovery of the OTEL packages, I have switched to static list of instrumentation packages like we already do in Node.js layer and it helped me to cut 750-1000 ms from coldstart delay. If you are also interested in to test my optimized layer (which is experimental and please don't use it on prod :) ), I can share the ARN of it, so you also can benchmark it too. |
According to my coldstart analysis, I have created these issues in OTEL Python repository to discuss potential solutions: |
HI @serkan-ozal Sorry, I was out for a few days. Can you please share which instrumentations have you included? I have tried setting OTEL_PYTHON_DISABLED_INSTRUMENTATIONS to
|
Describe the bug
We are using the otel collector in decouple mode and observing that the extension is taking > 2s to start. On further investigation, we are noticing a delay of ~1.5-2s between the log statements
Everything is ready. Begin running and processing data.
andEXTENSION Name: collector State: Ready Events: [INVOKE, SHUTDOWN]
Is so much delay expected?
The text was updated successfully, but these errors were encountered: