Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add inbound transport as label to collector metrics #1446

Merged
merged 14 commits into from
Apr 6, 2019
Merged

Add inbound transport as label to collector metrics #1446

merged 14 commits into from
Apr 6, 2019

Conversation

guanw
Copy link
Contributor

@guanw guanw commented Mar 26, 2019

Which problem is this PR solving?

Resolves #1444

Short description of the changes

  • Adding metrics on grpc/http/tchannel endpoints for collector

@guanw guanw changed the title Judew collector endpoints metrics adding collector endpoints metrics Mar 26, 2019
Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a separate counter. We already have metrics partitioned by the span format

jaeger_collector_traces_received_total{debug="false",format="jaeger",svc="other-services"} 0
jaeger_collector_traces_received_total{debug="false",format="unknown",svc="other-services"} 0
jaeger_collector_traces_received_total{debug="false",format="zipkin",svc="other-services"} 0
jaeger_collector_traces_received_total{debug="true",format="jaeger",svc="other-services"} 0
jaeger_collector_traces_received_total{debug="true",format="unknown",svc="other-services"} 0
jaeger_collector_traces_received_total{debug="true",format="zipkin",svc="other-services"} 0

I would simply add transport as another dimension here.

cmd/collector/app/metrics.go Outdated Show resolved Hide resolved
@guanw
Copy link
Contributor Author

guanw commented Mar 26, 2019

Also this build error doesn't seem to happen when I build locally, any idea?

/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/go.uber.org/zap/logger.go:225 +0x96 github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.newDeadlockDetector.func1(0xc0ffffffff) /home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/deadlock_detector.go:71 +0x4d8 github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*deadlockDetector).start.func1(0xc000162bd8, 0xc00016e3e0) /home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/deadlock_detector.go:163 +0x288 created by github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*deadlockDetector).start /home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/deadlock_detector.go:152 +0x346 FAIL github.com/jaegertracing/jaeger/cmd/ingester/app/consumer 3.025s make: *** [cover] Error 1 The command "if [ "$TESTS" == true ]; then make test-ci ; else echo 'skipping tests'; fi" exited with 2.

@guanw
Copy link
Contributor Author

guanw commented Mar 27, 2019

I don't think we need a separate counter. We already have metrics partitioned by the span format

jaeger_collector_traces_received_total{debug="false",format="jaeger",svc="other-services"} 0
jaeger_collector_traces_received_total{debug="false",format="unknown",svc="other-services"} 0
jaeger_collector_traces_received_total{debug="false",format="zipkin",svc="other-services"} 0
jaeger_collector_traces_received_total{debug="true",format="jaeger",svc="other-services"} 0
jaeger_collector_traces_received_total{debug="true",format="unknown",svc="other-services"} 0
jaeger_collector_traces_received_total{debug="true",format="zipkin",svc="other-services"} 0

I would simply add transport as another dimension here.

@yurishkuro It already seems too many dimensions here. Do we really want to add on top of that? Eventually we are only interested in knowing where spans are coming from.

@jpkrohling
Copy link
Contributor

It already seems too many dimensions here. Do we really want to add on top of that? Eventually we are only interested in knowing where spans are coming from.

Yes, we want another dimension there. Otherwise, we'll end up with two metrics reporting the same numbers.

@yurishkuro
Copy link
Member

Yes, add as another dimension. It's still the same metric, span count, no matter how we slice and dice it.

Because format and transport are static dimensions, we need to avoid string concatenations and mutexes.

@guanw
Copy link
Contributor Author

guanw commented Mar 28, 2019

hmmm. The thing I notice is: other-services are too broad and I cannot pin a transport dimension to this category. I will have to leave it as blank. But this could be problematic say if requests from services that directly writing to collectors come in later and was simply grouped as other-services? Then we still have no way of figuring out whether we are receiving requests from those services? @yurishkuro @jpkrohling What do you think?

Also it seems both tchannel and http are invoking SubmitZipkinBatch and SubmitBatches? Any suggestion on differentiating calls in tchannel and http? I'm thinking passing a key-value pair in context but not sure this is the best practice.

e.g

ctx = context.WithValue(ctx, "transport", "HTTP" or "TChannel")

and just get it later in span_handler.go. Please advise.

@vprithvi
Copy link
Contributor

vprithvi commented Apr 1, 2019

I just took a look at this and feel that we should add metrics directly to the outer level handlers. Specifically, adding metrics to http_handler.go, grpc_handler.go, and span_handler.go.

I don't understand why span_processor.go requires knowledge of where the spans come from. (I'm not even sure why it cares about the format - Zipkin vs Jaeger).

Right now, http_handler creates batches which are sent to span_handler, which then sends it to span_processor. By refactoring http_handler to call ProcessSpans directly, we will avoid double counting metrics.

@yurishkuro Are there any arguments of keeping these metrics in span_processor?

@yurishkuro
Copy link
Member

@vprithvi I think the main argument is the explosion of different metrics names instead of just partitioning them along the natural dimensions. There's really only one metric, number of spans received. We can partition it by span format, span submission protocol, etc.

Having said that, spans received by service is also just a dimension of the above metric, but we pulled it separately because when we only need to plot the total traffic, we need to aggregate across a lot of dimensions.

@codecov
Copy link

codecov bot commented Apr 1, 2019

Codecov Report

Merging #1446 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1446      +/-   ##
==========================================
+ Coverage   99.82%   99.82%   +<.01%     
==========================================
  Files         173      173              
  Lines        8162     8179      +17     
==========================================
+ Hits         8148     8165      +17     
  Misses          7        7              
  Partials        7        7
Impacted Files Coverage Δ
cmd/collector/app/tchannel_handler.go 100% <100%> (ø) ⬆️
cmd/collector/app/metrics.go 100% <100%> (ø) ⬆️
cmd/collector/app/zipkin/http_handler.go 100% <100%> (ø) ⬆️
cmd/collector/app/span_processor.go 100% <100%> (ø) ⬆️
cmd/collector/app/options.go 100% <100%> (ø) ⬆️
cmd/collector/app/grpc_handler.go 100% <100%> (ø) ⬆️
cmd/collector/app/thrift_span_handler.go 100% <100%> (ø) ⬆️
cmd/collector/app/http_handler.go 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b2a2c1...7ca10ee. Read the comment docs.

@guanw
Copy link
Contributor Author

guanw commented Apr 1, 2019

@yurishkuro Could you take a look at travis-ci? There are two tests I can't seem to pass somehow..

@yurishkuro
Copy link
Member

NB: please update the PR title to follow guidelines for commit messages.

Jude Wang added 7 commits April 4, 2019 13:11
Signed-off-by: Jude Wang <judew@uber.com>
Signed-off-by: Jude Wang <judew@uber.com>
Signed-off-by: Jude Wang <judew@uber.com>
Signed-off-by: Jude Wang <judew@uber.com>
Signed-off-by: Jude Wang <judew@uber.com>
Signed-off-by: Jude Wang <judew@uber.com>
Signed-off-by: Jude Wang <judew@uber.com>
@@ -104,20 +109,27 @@ func newMetricsBySvc(factory metrics.Factory, category string) metricsBySvc {
}

func newCountsBySvc(factory metrics.Factory, category string, maxServiceNames int) countsBySvc {
// Add 3 to maxServiceNames threshold to compensate for extra slots taken by transport types
maxServiceNames = maxServiceNames + 3
Copy link
Contributor Author

@guanw guanw Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a good idea. I'm combining otherServices with transport types so this makes the user-defined counter space smaller. So I'm compensating 3 here to make the test pass. The difference is that when user have numOfServices > maxServiceNames they might notice in a metric graph like grafana the real total number of serviceNames shows maxServiceNames + 3

@guanw guanw changed the title adding collector endpoints metrics adding collector transport metrics Apr 4, 2019
@guanw
Copy link
Contributor Author

guanw commented Apr 4, 2019

@yurishkuro is there any way for me to restart crossdock test on travis. It's running fine on my local environment.

@yurishkuro
Copy link
Member

crossdock is successful, the other tests are failing

@guanw
Copy link
Contributor Author

guanw commented Apr 5, 2019

crossdock is successful, the other tests are failing

For the first one I checked unit test&code coverage under cmd/ingester/app/consumer/ but both seem alright.

The second failure I ran make install-ci locally it also passed..

Am I missing anything?

@jpkrohling
Copy link
Contributor

There seems to be a deadlock in an ingester test.
=== RUN   TestSaramaConsumerWrapper_start_Messages
2019-04-04T18:44:13.167Z	DEBUG	consumer/deadlock_detector.go:151	Starting global deadlock detector
2019-04-04T18:44:13.168Z	INFO	consumer/consumer.go:77	Starting main loop
2019-04-04T18:44:13.168Z	INFO	consumer/consumer.go:111	Starting message handler	{"partition": 316}
2019-04-04T18:44:13.169Z	DEBUG	consumer/consumer.go:134	Got msg	{"msg": {"Key":null,"Value":null,"Topic":"morekuzambu","Partition":316,"Offset":1,"Timestamp":"0001-01-01T00:00:00Z","BlockTimestamp":"0001-01-01T00:00:00Z","Headers":null}}
2019-04-04T18:44:13.169Z	INFO	consumer/processor_factory.go:62	Creating new processors	{"partition": 316}
2019-04-04T18:44:13.169Z	DEBUG	processor/parallel_processor.go:50	Spawning goroutines to process messages	{"num_routines": 1}
2019-04-04T18:44:13.170Z	INFO	consumer/consumer.go:155	Closing partition consumer	{"partition": 316}
2019-04-04T18:44:13.171Z	INFO	consumer/consumer.go:131	Message channel closed. 	{"partition": 316}
2019-04-04T18:44:13.171Z	DEBUG	processor/parallel_processor.go:75	Initiated shutdown of processor goroutines
2019-04-04T18:44:13.171Z	INFO	consumer/consumer.go:158	Closed partition consumer	{"partition": 316}
2019-04-04T18:44:13.171Z	INFO	processor/parallel_processor.go:78	Completed shutdown of processor goroutines
2019-04-04T18:44:13.171Z	DEBUG	consumer/deadlock_detector.go:195	Closing deadlock detector	{"partition": 316}
2019-04-04T18:44:13.171Z	INFO	consumer/consumer.go:155	Closing partition consumer	{"partition": 316}
2019-04-04T18:44:13.172Z	INFO	consumer/deadlock_detector.go:115	Closing ticker routine	{"partition": 316}
2019-04-04T18:44:13.172Z	INFO	consumer/consumer.go:158	Closed partition consumer	{"partition": 316}
2019-04-04T18:44:13.172Z	INFO	consumer/consumer.go:162	Starting error handler	{"partition": 316}
2019-04-04T18:44:16.170Z	PANIC	consumer/deadlock_detector.go:71	No messages processed in the last check interval	{"partition": -1, "stack": "goroutine 7 [running]:\ngh.neting.cc/jaegertracing/jaeger/cmd/ingester/app/consumer.newDeadlockDetector.func1(0xc0ffffffff)\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/deadlock_detector.go:73 +0x2d4\ngh.neting.cc/jaegertracing/jaeger/cmd/ingester/app/consumer.(*deadlockDetector).start.func1(0xc00017a968, 0xc00000e300)\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/deadlock_detector.go:163 +0x288\ncreated by github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*deadlockDetector).start\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/deadlock_detector.go:152 +0x346\n\ngoroutine 1 [chan receive]:\ntesting.(*T).Run(0xc0000d0400, 0xfb6220, 0x28, 0xfc9810, 0x1)\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/testing/testing.go:917 +0x693\ntesting.runTests.func1(0xc0000d0400)\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/testing/testing.go:1157 +0xa9\ntesting.tRunner(0xc0000d0400, 0xc000137d48)\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/testing/testing.go:865 +0x164\ntesting.runTests(0xc0000b2f40, 0x18b55a0, 0x15, 0x15, 0x0)\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/testing/testing.go:1155 +0x524\ntesting.(*M).Run(0xc000120200, 0x0)\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/testing/testing.go:1072 +0x2ec\nmain.main()\n\t_testmain.go:140 +0x335\n\ngoroutine 6 [semacquire]:\nsync.runtime_Semacquire(0xc00000e468)\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/runtime/sema.go:56 +0x39\nsync.(*WaitGroup).Wait(0xc00000e460)\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/sync/waitgroup.go:130 +0xb2\ngh.neting.cc/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).Close(0xc00017a8f0, 0xc000170300, 0xe9d480)\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:102 +0x175\ngh.neting.cc/jaegertracing/jaeger/cmd/ingester/app/consumer.TestSaramaConsumerWrapper_start_Messages(0xc000170300)\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer_test.go:165 +0xb1e\ntesting.tRunner(0xc000170300, 0xfc9810)\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/testing/testing.go:865 +0x164\ncreated by testing.(*T).Run\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/testing/testing.go:916 +0x65b\n\ngoroutine 8 [chan receive]:\ngh.neting.cc/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).Start.func1(0xc00017a8f0)\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:78 +0x289\ncreated by github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).Start\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:76 +0x79\n\ngoroutine 10 [semacquire]:\nsync.runtime_SemacquireMutex(0xc00017a9a4, 0x900000000)\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/runtime/sema.go:71 +0x3d\nsync.(*Mutex).Lock(0xc00017a9a0)\n\t/home/travis/.gimme/versions/go1.12.1.linux.amd64/src/sync/mutex.go:134 +0x1e2\ngh.neting.cc/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).handleErrors(0xc00017a8f0, 0x13c, 0xc000034660)\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:163 +0x1ce\ncreated by github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*Consumer).Start.func1\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/consumer.go:92 +0x26b\n"}
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.newDeadlockDetector.func1
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/deadlock_detector.go:71
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*deadlockDetector).start.func1
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/deadlock_detector.go:163
panic: No messages processed in the last check interval
goroutine 7 [running]:
github.com/jaegertracing/jaeger/vendor/go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0002ba000, 0xc000196200, 0x2, 0x2)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/go.uber.org/zap/zapcore/entry.go:229 +0x70f
github.com/jaegertracing/jaeger/vendor/go.uber.org/zap.(*Logger).Panic(0xc00017e120, 0xfbbbcd, 0x30, 0xc000196200, 0x2, 0x2)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/vendor/go.uber.org/zap/logger.go:225 +0x96
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.newDeadlockDetector.func1(0xc0ffffffff)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/deadlock_detector.go:71 +0x4d8
github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*deadlockDetector).start.func1(0xc00017a968, 0xc00000e300)
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/deadlock_detector.go:163 +0x288
created by github.com/jaegertracing/jaeger/cmd/ingester/app/consumer.(*deadlockDetector).start
	/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/ingester/app/consumer/deadlock_detector.go:152 +0x346
FAIL	github.com/jaegertracing/jaeger/cmd/ingester/app/consumer	3.026s
make: *** [cover] Error 1
The command "if [ "$TESTS" == true ]; then make test-ci ; else echo 'skipping tests'; fi" exited with 2.
And something related to Travis itself
grouped write of manifest, lock and vendor: error while writing out vendor tree: failed to write dep tree: failed to export golang.org/x/net: failed to fetch source for https://go.googlesource.com/net: unable to get repository: Cloning into '/home/travis/gopath/pkg/dep/sources/https---go.googlesource.com-net'...
POST git-upload-pack (491 bytes)
remote: RESOURCE_EXHAUSTED Resource has been exhausted (e.g. check quota).
[type.googleapis.com/google.rpc.QuotaFailure]
violations {
  subject: "IP 104.154.113.151"
  description: "Short term bandwidth rate limit exceeded"
}
fatal: protocol error: bad pack header
: command failed: [git clone --recursive -v --progress https://go.googlesource.com/net /home/travis/gopath/pkg/dep/sources/https---go.googlesource.com-net]: exit status 128
make: *** [install] Error 1
The command "make install-ci" failed and exited with 2 during .

I restarted both jobs. Let's see if it helps.

@guanw
Copy link
Contributor Author

guanw commented Apr 5, 2019

There seems to be a deadlock in an ingester test.
And something related to Travis itself
I restarted both jobs. Let's see if it helps.

hmmm. Interesting.. the second round passed and I ran some go test -race github.com/jaegertracing/jaeger/cmd/ingester/<subdirectory> but they all ran fine.

@guanw
Copy link
Contributor Author

guanw commented Apr 5, 2019

@jpkrohling @yurishkuro Can either of you pick up the review now that tests are passing?

cmd/collector/app/metrics.go Outdated Show resolved Hide resolved
cmd/collector/app/grpc_handler.go Outdated Show resolved Hide resolved
cmd/collector/app/metrics.go Outdated Show resolved Hide resolved
Jude Wang and others added 7 commits April 5, 2019 14:58
Signed-off-by: Jude Wang <judew@uber.com>
Signed-off-by: Jude Wang <judew@uber.com>
Signed-off-by: Jude Wang <judew@uber.com>
Signed-off-by: Yuri Shkuro <ys@uber.com>
Signed-off-by: Yuri Shkuro <ys@uber.com>
Signed-off-by: Yuri Shkuro <ys@uber.com>
@yurishkuro yurishkuro changed the title adding collector transport metrics Add inbound transport as label to collector metrics Apr 6, 2019
@yurishkuro yurishkuro merged commit dbb6ea7 into jaegertracing:master Apr 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants