Skip to content

Distributed Tracing

Regunath B edited this page Jun 30, 2015 · 17 revisions

Distributed tracing is quite a useful feature to have in a micro-services environment. Phantom's support for this is based on Twitter's Zipkin which is an open source implementation of Google Dapper paper. Zipkin is implemented in Scala. Phantom uses Brave Java distributed tracing implementation that is compatible with Zipkin. The Collector, Query and Web interfaces of Zipkin are deployed as-is following the : Zipkin Install.

The tracing instrumentation using Brave libraries in Phantom is as shown below: Phantom tracing design

Phantom can participate in a distributed trace (or initiate one) where-in all calls to deployed handlers are instrumented to emit spans. The core libraries take care of copying relevant span information across threads (Netty worker, Hystrix command processor) and this support extends to Http, Thrift and Command protocols. For Http, the trace and parent span information is also copied onto Http request headers so that downstream services invoked by Phantom handlers can also participate in the trace initiated by Phantom or its client.

Tracing instrumentation

The following core components in Phantom are instrumented in order to support tracing:

  • Socket Listeners i.e. Netty Channel Handlers - these components are instrumented to start a new Server trace, disable tracing for the request or participate in a trace that was initiated by the client application. Also takes care of copying active trace related information to handler requests that originate here.
  • Executors i.e. Hystrix Command wrappers for handlers - these components are instrumented to emit client traces for all outgoing calls to services. The Http implementation additionally also copies active trace information onto request headers.
  • Task context (as described in Nesting calls) - copies active trace information to nested calls on handlers when TaskResult executeCommand(String commandName, byte[] data, Map<String,String> params) is called. A sample trace with nested calls is shown below: Nested tracing

Phantom also provides support for initializing traces in Java Servlet containers. This is available as a servlet filter ServletTraceFilter that may be suitably initialized and configured using standard filter definitions in web.xml deployment configuration files. This feature is useful when tracing Web APIs using Phantom as explained below:

  • Edit the web application/API web.xml and register Phantom web app event listener (WebContextLoaderListener) as follows:
<listener>
    <listener-class>com.flipkart.phantom.runtime.impl.spring.web.WebContextLoaderListener</listener-class>
</listener> 

This listener locates a file by name common-web-config.xml in the configuration deployment directory i.e. /resources/external, and if present, creates a Spring Application Context containing common proxy and web beans that would serve as parent for the concerned web application. common-web-config.xml contains the tracing filter declaration and other event producer-consumer beans required by tracing. A sample is available here : Web API Tracing

  • Again edit the web application/API web.xml and add the Spring DelegatingFilterProxy as a filter and specify URL pattern mapping:
<filter>
    <filter-name>ServletTraceFilter</filter-name>
    <filter-class>org.springframework.web.filter.DelegatingFilterProxy</filter-class>
</filter>
<filter-mapping>
    <filter-name>ServletTraceFilter</filter-name>
    <url-pattern>/apis</url-pattern>
</filter-mapping>	

This will ensure that all requests to the URL pattern /apis will be intercepted and traced by the Phantom tracing filter.

Span Collectors

Phantom uses the Trooper Event framework libraries to publish service proxy events to logical VM endpoints like evt://com.flipkart.phantom.events.HTTP_HANDLER. The events contain useful information about Hystrix events like Success, Failure, Fallback, Thread pool rejections etc. The events are consumed from endpoints by consumers like the RequestLogger that logs the events to file.

The tracing implementation i.e. a Brave SpanCollector piggybacks on this implementation to emit service proxy events containing Zipkin Span information to the logical VM endpoint evt://com.flipkart.phantom.events.TRACING_COLLECTOR. This indirection provides a useful abstraction to span emitting and collection/forwarding.

The PushToZipkinEventConsumer class is an Event consumer that consumes the emitted Span data from evt://com.flipkart.phantom.events.TRACING_COLLECTOR endpoint and forwards it to any Thrift interface that can receive and store Span data such as the Brave Flume agent or the Zipkin Collector process. A sample configuration for using this span collector is available in Http proxy with tracing as shown below:

<bean id="zipkinCollector" class="com.flipkart.phantom.event.consumer.PushToZipkinEventConsumer">
    <property name="requestLogger" ref="commonRequestLogger"/>
    <property name="spanCollector">
        <bean class="com.flipkart.phantom.task.impl.collector.DelegatingZipkinSpanCollector">
            <property name="zipkinCollectorHost" value="localhost"/>
            <property name="zipkinCollectorPort" value="9410"/>                
        </bean>
    </property>
    <property name="subscriptions">
        <list>
            <value>evt://com.flipkart.phantom.events.TRACING_COLLECTOR</value>
        </list>
    </property>
</bean>

Controlling sampling

Sampling of requests is essential for low overhead distributed tracing as described by Zipkin and Dapper. It is also necessary to leave tracing instrumentation turned-on by default in production deployments but incur minimal overheads in data capture and storage. Sampling is controlled using one or both mechanisms described below:

  • For traces originating outside Phantom (say in the client app) - In this scenario, the trace in initiated externally and Phantom only participates in it by emitting spans for handler calls. Tracing can be switched on/off by setting the request header attribute : X-B3-Sampled to true or false. Additionally each deployed handler may be configured with a Brave TraceFilter implementation to override emitting spans for that handler.
  • For traces originating within Phantom (say when a Http, Thrift or Command proxy is deployed) - This is controlled by configuring a Brave TraceFilter implementation on Phantom's protocol specific Netty channel handlers i.e. RoutingHttpChannelHandler or one of its sub-types, AsyncCommandProcessingChannelHandler, CommandProcessingChannelHandler, ThriftChannelHandler. Tracing is turned-off by default for these handlers. It may be turned on with a suitable implementation like the Brave Zookeeper based tracing filter : ZooKeeperSamplingTraceFilter or a simple FixedSampleRateTraceFilter in Http proxy with tracing example as shown below:
<bean id="httpRequestHandler" class="com.flipkart.phantom.runtime.impl.server.netty.handler.http.HttpChannelHandler" scope="prototype">
    <property name="defaultChannelGroup" ref="defaultChannelGroup"/>
    <property name="repository" ref="httpProxyRepository"/>
    <property name="defaultProxy" value="defaultProxy" />
    <property name="eventProducer" ref="serviceProxyEventProducer"/>
    <property name="eventDispatchingSpanCollector" ref="eventDispatchingSpanCollector"/>
    <property name="traceFilter">
        <!-- Trace every call -->
        <bean class="com.github.kristofa.brave.FixedSampleRateTraceFilter">
	    <constructor-arg index="0" value="1"/>                
        </bean>
    </property>
</bean>