Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to determine the parent context of the second of two logically sequential requests #892

Open
spockz opened this issue Apr 7, 2019 · 15 comments

Comments

@spockz
Copy link

spockz commented Apr 7, 2019

This is an issue for discussing the proposal mentioned in gitter to consider using the context of the request(s) logically preceding a request as its parent context.

Given the following pseudo-code, it is – intuitively – obvious that the parent span of both call1 and call2 should be the same as there is no dependency.

val firstCall: Future[Rep] = call1()
val secondCall             = call2()

Now imagine the following pseudo-code where the second call (call2) depends on the outcome of call1. Most clients/instrumentations will still use the same parent span for both calls. This makes observing behaviour correctly hard because it requires insight into the code to see what was actually written.

val firstCall: Future[Rep] = call1()
val secondCall             = firstCall.map(firstResult => call2(firstResult+1))

As one of the use-cases of using tracing is to determine why and how a certain response came to be, my proposal is to use to the context of call1 as the parent of call2 when call2 depends on the response of call1.

As I know of no instrumentations that actually do this, I'm wondering why. Is it incredibly stupid to do this, or just hard?


We can extend the discussion on what the parent contexts should be for the following case:

val firstCall: Future[Rep] = call1()
val secondCall             = call2()
val thirdCall              = join(firstCall, secondCall).map((firstResult, secondResult) => call3(firstRresult+secondResult))

In this case I would argue that call3 should have both the contexts of call1 and call2 as parent. Multiple parents is something we know from events already.

@codefromthecrypt
Copy link
Member

copying some folks: @openzipkin/core @openzipkin/armeria @yschimke @bsideup @marcingrzejszczak
and also some farther from this code for hopefully more point of view: @Xylus @wu-sheng

so this topic "async callback tracing" has been debated a lot which is why I asked to raise this as an issue.. firstly to this repo as we have a test case we can proceed with. The question is that when you are an RPC async response or error callback, what should the parent be?

I think when I asked the same in javascript once, @eirslett mentioned that result callbacks on a client should infer the parent from the caller of the client. This would cause less confusion with any child spans that are incidental to the network request (such as DNS). It also reads more like sequential code.

For example, let's say you have an RPC by itself, and it makes a DNS or related activity. This is clearly a causally linked child. A follow-up RPC request, such as a http redirect, is interestingly not the child of the first RPC request, as it would confuse this and mess up aggregations such as counts (as clients are not expected to be children of clients). For this reason, when we have a chance of a retry we create an "application span" which is the parent of the network requests.

It would follow to me that business code should not use the client call they have a data dependency on as the parent context. Not only could that result in a tree depth of hundreds or thousands, and confuse the structure of the trace, but it would create different tree pattern just on accident of using async style. Let's take a look at sequential vs async wrt data dependencies.

ex the following async call:

val firstCall: Future[Rep] = call1()
val secondCall                 = firstCall.map(firstResult => call2(firstResult+1))

is the async variant of the following top-to-bottom code

val firstResult = call1()
return call2(firstResult+1))

I'd like to use this as evidence that the response/error callback should not need to look to see if a value is read in order to decide what the parent is. A data dependency is hard to track down anyway, we try to do this in messaging (as the message itself is the context), but in RPC, we don't and if we started to now, I do think our traces would end up very deep when people are making a lot of RPC calls and this would be unintuitive and break analysis.

fight me :D

@devinsba
Copy link
Member

Chiming in with a small idea, this seems like the correct use case for the OT follows-from relationship type. That way parent-child is maintained, but the relationship is also captured, in that you can account for why the second span is paused

@codefromthecrypt
Copy link
Member

@devinsba let's say we made the parent the ID of the client on the response chain, adding a tag about the relationship, wouldn't you still have an ancesteral issue regarding RPC analysis? A client would be calling a client possibly 100 in a straight line (staircase of a very deep hierarchy trace).. even if we made some changes here, I think it would break amazon X-Ray and others. Those who had the whole trace could correct it back to the real parent, but those who didn't or didn't update to that server-side code will look like broken instrumentation.

If instead, we were thinking about the follows-from being a separate span ID (like tag="http.response_span=xxxx" in the callback), that's something we could add in instrumentation (leaving the real parent alone). This probably means co-propagating the real parent, vs the response callback depending on how we do this. we could try to be clever about it.

However we think about this, we should be careful to not misinterpret the data or use some OT concept.. it is often discussed before practiced. I'm not sure which backends (certainly not any clouds) support this. I'm pretty sure "follows-from" was about layout concerns with messaging processing and discussed back when appdash was a thing. Here's an issue about that here openzipkin/zipkin#1701 cc @tacigar I do think the message processing case with or without a follows-from ref could be done more elegantly.

Back to RPC I think the closest match to the problem space is the async sequence ID which is something pinpoint uses to clarify scheduling order. Again, that's a separate thing.

One reason I raised this issue in a concrete place, like instrumentation, was to show not tell. We have a great chance to avoid mistakes some "standards" things make, but showing exactly what happens when modeling is changed without a way to revert it. Remember, we have access to the in memory storage with has a dependency linker. It shouldn't be too hard to show exactly the concerns I've mentioned which gives folks a chance to mitigate them in worst case a FinishedSpanHandler, which rewrites the IDs back :P

To get towards concrete..
We know tools do things like count RPCs (X-Ray does this), and moving ID layout to depth vs breadth will break that in a fairly major way. We also know some look at trace depth as a signal, and also this will break that. We also know that it has always been the case that client calling client is a bug, and this breaks that :) I'm probably forgetting some things..

to show not tell could be to make a test case, which insulates us from the breaking, but allow folks to play what'if games. If we couple what ifs with a trace image and some tests that still pass on dependency linking etc or a change list to fix them, we could move carefully if this way is forward.

@devinsba
Copy link
Member

I actually meant exactly what you said, I think it should be a tag that potentially we assign special meaning to in the UI, but leaves the data model alone. So in the example given:

val firstCall: Future[Rep] = call1()
val secondCall                 = firstCall.map(firstResult => call2(firstResult+1))

The spans for clients call1 and call2 share the same parent, but the span for call2 has a tag followsFrom=${call1.spanId}. Though there is the case where call2 waits for multiple futures so maybe that is a list, though last write wins should also work here

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Apr 11, 2019

@devinsba ack.

I still would hesitate using "followsFrom" word, for this unless it is well defined and also well used in practice.. we could figure that out by looking at all the source code made by OT servers (there are few of them, might be as easy as looking at jaeger ui as I know no other that would have likely used this besides the defunct appdash). Remember people use brave to send to jaeger, and we don't want to create junk there.. I doubt they register multiple parents in any way. The linked span thing in OC may also have completely different semantics.. avoiding naming this tag something used for other things could be a good strategy as it is easy enough to rename a tag to something else.

To make a concrete experiment, I think we'd need something more tough than 2 calls as the edge cases come out in parallel or nested N calls hung off the response thread.

I suspect to get the immediate parent ID, we may need some trickery as our "current span" will still the real parent and we have no idea if a callback will ever have another client span invoked. It might smell like co-propagation, just in case a client wants this data. This is the instrumentation tricky part.. maybe a special "extra" could be made to stuff the "last client" for this reason.. wdyt?

@yschimke
Copy link

I don’t have much skin in this game but strongly agree on using with using FollowsFrom from reading the spec.

Without the spec, whatever you do is just another divergent practice. Following it you can achieve what you want in practice but when implementations don’t align at least you can raise a bug.

Sent with GitHawk

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Apr 12, 2019 via email

@codefromthecrypt
Copy link
Member

here are some examples of follows-from that aren't test cases (there aren't many!)

instana's tracer throwing it away

  @Override
  public Tracer.SpanBuilder addReference(String referenceType, SpanContext referencedContext) {
    if (References.CHILD_OF.equals(referenceType)) {
      return asChildOf(referencedContext);
    } else if (References.FOLLOWS_FROM.equals(referenceType)) {
      // TODO: Currently the FOLLOWS_FROM semantic is not fully supported, but its better to have CHILD_OF than nothing
      return asChildOf(referencedContext);
    } else {
      return this;
    }
}

jaeger.. only a month ago handling it.. and treating exactly the same as child of
jaegertracing/jaeger-ui#335

  computeAncestors(rootSpan: Span) {
    let currentSpan: Span = rootSpan;
    while (currentSpan && Array.isArray(currentSpan.references) && currentSpan.references.length) {
      const { refType, span } = currentSpan.references[0] || {};
      if (span && (refType === 'FOLLOWS_FROM' || refType === 'CHILD_OF')) {
        this.ancestorIds.push(span.spanID);
        currentSpan = span;
      } else {
        break;
      }
    }
  }

The point is that we can't blindly look at something mentioned as spec as actually having practice or being meaningfully evaluated.

@yschimke
Copy link

Google Search and the first spec that comes up :) . Please don't take this as any strong stance on my part. I got asked a question, with a suggested solution, I googled the solution and it appears to be defined in a relevant spec.

https://github.com/opentracing/specification/blob/master/specification.md
https://github.com/opentracing/opentracing-go/blob/master/tracer.go

         // FollowsFromRef refers to a parent Span that does not depend in any way
	// on the result of the new child Span. For instance, one might use
	// FollowsFromRefs to describe pipeline stages separated by queues,
	// or a fire-and-forget cache insert at the tail end of a web request.
	//
	// A FollowsFromRef Span is part of the same logical trace as the new Span:
	// i.e., the new Span is somehow caused by the work of its FollowsFromRef.
	//
	// All of the following could be valid timing diagrams for children that
	// "FollowFrom" a parent.
	//
	//     [-Parent Span-]  [-Child Span-]
	//
	//
	//     [-Parent Span--]
	//      [-Child Span-]
	//
	//
	//     [-Parent Span-]
	//                 [-Child Span-]
	//
	// See http://opentracing.io/spec/
	//
	// See opentracing.FollowsFrom()

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Apr 12, 2019 via email

@yschimke
Copy link

On reading it, it also made sense

Sent with GitHawk

@codefromthecrypt
Copy link
Member

anyone want to run a test with the OT approach then? I think there are folks who want to.

@devinsba
Copy link
Member

Actually based on the fact they have actually defined the meaning of that reference type, it was a TODO last I saw, this is not a case that matches that semantic anyway. As the parent does depend on the result. This would be a different reference type, one where a spans duration depends on sibling. (The normal sync model being the timestamp depending on a sibling)

@jcchavezs
Copy link
Contributor

jcchavezs commented Apr 25, 2019

Jumping late into this discussion but here are my 3p:

  1. I might be wrong but the described name in followsFrom from OT is the opposite as the behaviour we describe here. As @adriancole mentions, it is more for the sake of saying "this child span could fail in the future but that won't affect the output of the parent".
  2. I usually don't like synthetic tags as the proposed followsFrom (If something I would call it based_on or comes_from) but I have some questions here: what if you launch two parallel request R1 and R2 (let's say in gorutines) and after those two you decide to launch the third one R3, thus R1, R2 and R3 will have the same parent but R3 will depend on R1 and R2 (no dependency between R1 and R2), then the depends relationship will be N:N because the output of R1 can be used for deciding R3 and a potential R4.
  3. The way I see extra (if you are talking about extra fields in the context) is more about outbound propagation but the way I see it, you won't propagate it to other hops.

Also cc @ghermeto as he might have some valuable input about this.

@ghermeto
Copy link
Member

When you have two promises called one after the other, just means they are sequential, not that one is the parent of the other. You can visualize the parent. So, this:

function parent() {
   request1().then(result => request2() );
}

can be written as:

async function parent() {
    await request1();
   request2();
}

which shows that just because they are sequential doesn’t imply hierarchy…

I agree with Adrian statement that client calls are not expected to be children of client.

IMO followsFrom is to be used in the case of "stages separated by queues", which is very similar to the case we discussed on SF, which was a stream processor type of architecture (where a message is sent, processed, triggers a new message and so on...). On this case I wouldn't just see them as sequential, but as parent/child where the child can happen after the parent has finished:

[-Parent Span-]
                [-Child Span-]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants