Segment reference and reporting overhaul #50

tom-pytel · 2021-05-12T12:09:54Z

There are many things in this PR most of which were discussed with @wu-sheng in a separate channel, they are as follows:

Spans which are ignored due to suffix or endpoint name pattern match will propagate their ignored status downstream via dummy span and sending an empty "sw8" header. If such an empty header is received then a dummy context and span is created instead of a new actively traced segment. This is only a behavior change for this node agent and is compatible with other language agents, they will just act as they have always done.
If a span is created in code which is an async child of a span, which has already finished and been sent to the server, then it will now correctly create a new segment and reference the segment which created it and keep the parent-child relationship in the trace. Previously behavior was undefined mostly losing the new span.
The method of limiting the number of segments which are recorded has been changed from post-record check to pre-record check to avoid unnecessary instrumentation if the number of segments already exceeds the threshold and the data would be thrown away anyway. This also allows propagation of ignored status of segment downstream via mechanism described above to avoid broken "VNode" segments in trace view.
Setting logging functions to no-ops when logging level indicates since the logging module seems to be stringifying everything which is sent to it regardless of logging level set, which led to up to 2x performance hit in extreme case of very simple endpoints.
Search for top-level module now stops when it finds skywalking since an app may be run by a framework which imports it into its own node context and would wind up with the wrong top-level directory to import from.
"peer" removed from Context.newExitSpan() and is now treated just like the other parameters.

wu-sheng · 2021-05-12T12:51:59Z

Setting logging functions to no-ops when logging level indicates since the logging module seems to be stringifying everything which is sent to it regardless of logging level set, which led to up to 2x performance hit in extreme case of very simple endpoints.

Personally, not about this PR, I think in the future, nodejs/python could adopt log service rather than span's log, which Java agent has done as log gRPC reporter. In there, you still could have all context and collaboration between tracing and logging, but as log could be sent as soon as possible separately, it has much better performance.
Of course, follow the log level makes sense. And should consider dynamic-configuration feature implementation in the agent to change the behavior dynamically.

tom-pytel · 2021-05-12T12:53:34Z

Setting logging functions to no-ops when logging level indicates since the logging module seems to be stringifying everything which is sent to it regardless of logging level set, which led to up to 2x performance hit in extreme case of very simple endpoints.

Personally, not about this PR, I think in the future, nodejs/python could adopt log service rather than span's log, which Java agent has done as log gRPC reporter. In there, you still could have all context and collaboration between tracing and logging, but as log could be sent as soon as possible separately, it has much better performance.
Of course, follow the log level makes sense. And should consider dynamic-configuration feature implementation in the agent to change the behavior dynamically.

It seems to me like the current logging was just a quick initial draft to get something working, not sure if it was meant to last?

kezhenxu94 · 2021-05-12T12:56:40Z

The logging changes in this PR is all about the agent itself, not the instrumented services. They're totally different things.

wu-sheng · 2021-05-12T12:57:18Z

It seems to me like the current logging was just a quick initial draft to get something working, not sure if it was meant to last?

Span's logs are something you want to inject dynamically in the plugin, real log protocol(https://github.com/apache/skywalking-data-collect-protocol/blob/master/logging/Logging.proto) should be used to collect user(target app)'s logs. Make sense?

wu-sheng · 2021-05-12T12:59:28Z

The logging changes in this PR is all about the agent itself, not the instrumented services. They're totally different things.

OK, got it.

wu-sheng · 2021-05-12T13:00:55Z

"peer" removed from Context.newExitSpan() and is now treated just like the other parameters.

If there is no peer, the sw8 header could not be assembled correctly. How do the test cases work?

tom-pytel · 2021-05-12T13:06:34Z

"peer" removed from Context.newExitSpan() and is now treated just like the other parameters.

If there is no peer, the sw8 header could not be assembled correctly. How do the test cases work?

There is a peer it is just being set explicitly after the call rather than passing as parameter to newExitSpan() and being set in this function. A minor change to allow consolidating common functionality of newExitSpan, newEntrySpan and newLocalSpan.

wu-sheng · 2021-05-12T13:17:02Z

"peer" removed from Context.newExitSpan() and is now treated just like the other parameters.

If there is no peer, the sw8 header could not be assembled correctly. How do the test cases work?

There is a peer it is just being set explicitly after the call rather than passing as parameter to newExitSpan() and being set in this function. A minor change to allow consolidating common functionality of newExitSpan, newEntrySpan and newLocalSpan.

This was designed from day one, because you want to make sure you have this, rather than facing no-peer in creating headers. Do we expect to have an error if it is really missed?

wu-sheng · 2021-05-12T13:21:26Z

The method of limiting the number of segments

I don't know nodejs agent core. But please make sure all sampling mechanisms optional and could be assigned through agent#start

tom-pytel · 2021-05-12T13:22:51Z

The method of limiting the number of segments

I don't know nodejs agent core. But please make sure all sampling mechanisms optional and could be assigned through agent#start

Which sampling mechanism are you referring to?

tom-pytel · 2021-05-12T13:24:28Z

There is a peer it is just being set explicitly after the call rather than passing as parameter to newExitSpan() and being set in this function. A minor change to allow consolidating common functionality of newExitSpan, newEntrySpan and newLocalSpan.

This was designed from day one, because you want to make sure you have this, rather than facing no-peer in creating headers. Do we expect to have an error if it is really missed?

All the plugins set it, and the chance of not setting it as a programming error is the same as the chance of not setting any other parameter like layer or component ID. All of which are non-desirable cases obviously but we don't give those parameters any special treatment...

wu-sheng · 2021-05-12T13:25:09Z

Which sampling mechanism are you referring to?

I mean, this is actually a sampling mechanism. Typically, it is called adoptive sampling. Because sampling bases on the volume of network and backend capabilities.

tom-pytel · 2021-05-12T13:28:09Z

Which sampling mechanism are you referring to?

I mean, this is actually a sampling mechanism. Typically, it is called adoptive sampling. Because sampling bases on the volume of network and backend capabilities.

You mean the "throttling" or "limiting" mechanism I changed? Doesn't matter what its called, it is settable, it uses the same SW_AGENT_MAX_BUFFER_SIZE or config.maxBufferSize to set the sampling size as the previous method, so no change there.

wu-sheng · 2021-05-12T13:40:25Z

Which sampling mechanism are you referring to?

I mean, this is actually a sampling mechanism. Typically, it is called adaptive sampling. Because sampling bases on the volume of network and backend capabilities.

You mean the "throttling" or "limiting" mechanism I changed? Doesn't matter what its called, it is settable, it uses the same SW_AGENT_MAX_BUFFER_SIZE or config.maxBufferSize to set the sampling size as the previous method, so no change there.

I think should have a setting called config.sampling=adaptive, then your codes work. Also, we could extend to have config-sampling=rate etc.

tom-pytel · 2021-05-12T13:45:00Z

I think should have a setting called config.sampling=adaptive, then your codes work. Also, we could extend to have config-sampling=rate etc.

In this PR I only changed how the current sampling method is implemented, not how it is defined, but those are all ideas for future PR extensions. One idea might be a general rule system for this and the various "ignore" options which specifies which endpoints to record and which to exclude which might include sampling frequency and max counts. But again, that is for a future PR...

wu-sheng · 2021-05-12T13:47:38Z

I think should have a setting called config.sampling=adaptive, then your codes work. Also, we could extend to have config-sampling=rate etc.

In this PR I only changed how the current sampling method is implemented, not how it is defined, but those are all ideas for future PR extensions. One idea might be a general rule system for this and the various "ignore" options which specifies which endpoints to record and which to exclude which might include sampling frequency and max counts. But again, that is for a future PR...

Could you update the document about this?

tom-pytel · 2021-05-12T13:53:57Z

SW_AGENT_MAX_BUFFER_SIZE is already documented in the README and I didn't change how it behaves, just how it is implemented internally. Or do you mean another document? Though you are right and I will add about how the ignore option is propagated downstream since people might otherwise be confused.

wu-sheng · 2021-05-12T13:56:39Z

Or do you mean another document? Though you are right and I will add about how the ignore option is propagated downstream since people might otherwise be confused.

Yes, I mean about how and when this works.

tom-pytel · 2021-05-12T19:43:20Z

@kezhenxu94: I've added two small cleanups:

Moved ContextManager.spansDup() from the Context.new?Span() functions to the Context.start() function to be more symmetric with stop(), async() and resync() but also because the spansDup() needs to happen before the spans.push() and there is no guarantee the user will call start() just after the Context.new?Span() call.
Removed the Buffer class since without the size check it is just a wrapper for a JS array.

tom-pytel added 5 commits May 11, 2021 09:13

fixed disabled logging performance hit

f6ee6a9

fix find top module if running in framework

dd856c4

peer removed from Context.newExitSpan()

ee165c6

minor ExitSpan fix to set operation from self

b2e3d28

ignore propagation, span child of finished span

4bf24c4

kezhenxu94 added core enhancement New feature or request labels May 12, 2021

kezhenxu94 added this to the 0.3.0 milestone May 12, 2021

changed method of restricting number of segments

0a3ba60

tom-pytel force-pushed the master branch from 06db610 to 0a3ba60 Compare May 12, 2021 14:02

wu-sheng requested a review from kezhenxu94 May 12, 2021 14:16

moved spansDup() to SpanContext.start()

056ca22

removed unnecessary Buffer.ts

803ff63

ContextCarrier fix when there is no clientAddress

2886801

kezhenxu94 approved these changes May 14, 2021

View reviewed changes

kezhenxu94 merged commit 090072e into apache:master May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segment reference and reporting overhaul #50

Segment reference and reporting overhaul #50

tom-pytel commented May 12, 2021 •

edited

Loading

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021

kezhenxu94 commented May 12, 2021

wu-sheng commented May 12, 2021

wu-sheng commented May 12, 2021

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021 •

edited

Loading

wu-sheng commented May 12, 2021

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021

tom-pytel commented May 12, 2021

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021

Segment reference and reporting overhaul #50

Segment reference and reporting overhaul #50

Conversation

tom-pytel commented May 12, 2021 • edited Loading

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021

kezhenxu94 commented May 12, 2021

wu-sheng commented May 12, 2021

wu-sheng commented May 12, 2021

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021 • edited Loading

wu-sheng commented May 12, 2021

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021

tom-pytel commented May 12, 2021

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021

wu-sheng commented May 12, 2021

tom-pytel commented May 12, 2021

tom-pytel commented May 12, 2021 •

edited

Loading

tom-pytel commented May 12, 2021 •

edited

Loading