Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cold start detection and marking #52

Merged
merged 1 commit into from
May 20, 2021
Merged

Conversation

tom-pytel
Copy link
Contributor

@tom-pytel tom-pytel commented May 17, 2021

I don't think I've seen this mechanism in the other agents but I was asked to implement this. A cold start is when an instance is spun up to serve a request, for this reason the response time of the endpoint includes the spinup time of the instance, which is not ideal. This PR detects this condition and marks spans to which it applies.

Cold start detection is as follows: First span to run within 1 second of skywalking init is considered a cold start. This span gets the tag 'coldStart' set to 'true'. This span also optionally gets the text '<cold>' appended to the endpoint name if env var SW_COLD_ENDPOINT is set to 'true' or config option 'coldEndpoint', the default is off (normal endpoint names). A dummy span running first due to ignore is considered a valid first run and eats the cold start. Note: 'coldStart' tag is not added at all if cold start is not detected, which means most spans will not have anything extra added.

The metrics of the endpoint that was cold-started do not change (much) so this tag is not for the endpoint itself but for analysis of spans upstream which will be significantly slower due to the cold start. The visual part via 'coldEndpoint' is more for quick visual analysis in the trace view.

The target environment for this was Azure Functions which we are working on now but the mechanism is general so it should apply in any environment where instances are spun up on demand (Kubernetes). Performance-wise it is just a flag check so impact should be nothing.

@kezhenxu94 kezhenxu94 requested review from kezhenxu94 and wu-sheng May 17, 2021 13:04
@tom-pytel
Copy link
Contributor Author

Forgot to update tests for coldStart tag.

@tom-pytel tom-pytel force-pushed the master branch 2 times, most recently from 1ab8e47 to 1b168e7 Compare May 17, 2021 17:04
README.md Outdated Show resolved Hide resolved
@kezhenxu94
Copy link
Member

@wu-sheng WDYT of the cold start concept and mechanism?

@wu-sheng
Copy link
Member

In the code start case, is there any chance to separate the booting process and request handling process? These are typical differences in most cases I am familiar with. Or do you mean, this code starting begins after receiving a request? I don't know this kind of case.

@tom-pytel
Copy link
Contributor Author

tom-pytel commented May 19, 2021

In the code start case, is there any chance to separate the booting process and request handling process? These are typical differences in most cases I am familiar with. Or do you mean, this code starting begins after receiving a request? I don't know this kind of case.

The coldStart tag doesn't really give information about the endpoint where it applies in isolation but rather explains the slower than normal execution of an upstream parent span which waits on this child. For one cold started child alone there is no way to determine the length of the bootup since the span starts once the boot is complete and in theory will execute more or less the same as if it was not a cold start (except for any unfilled caches and things like that). If you have both cold and non-cold versions of an endpoint with their parent then you can estimate the time of bootup alone as shown below:

              0   1   2   3   4

/parent       +-------+
/child            +---+

/parent       +---------------+
/child<cold>              +---+

child boot time = (/parent - /child<cold>) - (/parent - /child) = (4 - 1) - (2 - 1) = 2

Of course you have to select the parent times which have cold vs. non-cold children. The /child and /child are subtracted from /parent because they MAY have statistically significant difference in execution time..

In any case this is just something we do on our side because someone requested it, thought it might be useful info for your side as well.

@tom-pytel tom-pytel force-pushed the master branch 3 times, most recently from f0dfe41 to 314f195 Compare May 19, 2021 13:13
@tom-pytel tom-pytel closed this May 19, 2021
@tom-pytel tom-pytel reopened this May 19, 2021
@tom-pytel
Copy link
Contributor Author

Closed accidentally...

@wu-sheng
Copy link
Member

I am good with a new endpoint with <cold> suffix.
But this PR reminds me, I can't find the plugin dev doc about APIs explanation in NodeJS agent. Could you submit an issue to track this, and try to submit a PR to fix?

@tom-pytel
Copy link
Contributor Author

tom-pytel commented May 19, 2021

I am good with a new endpoint with <cold> suffix.

That part is optional for visual inspection as well as separating statistics for warm vs. cold endpoints (or maybe help with some DB analysis queries as well). What is always added is the coldStart tag.

But this PR reminds me, I can't find the plugin dev doc about APIs explanation in NodeJS agent. Could you submit an issue to track this, and try to submit a PR to fix?

This is more for @kezhenxu94 I think? Or if you want me to document this then tell me where @kezhenxu94.

Copy link
Member

@kezhenxu94 kezhenxu94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kezhenxu94 kezhenxu94 merged commit 8e72aad into apache:master May 20, 2021
@kezhenxu94 kezhenxu94 mentioned this pull request May 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants