Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2] Configure health check extension for all configs #5861

Merged
merged 32 commits into from
Aug 23, 2024

Conversation

Wise-Wizard
Copy link
Contributor

@Wise-Wizard Wise-Wizard commented Aug 18, 2024

Which problem is this PR solving?

Part of #5633, part of #5859

Description of the changes

  • Integrate health check extension to monitor and report Jaeger V2 component's health
  • Enhance all-in-one CI test to ping the new health port

How was this change tested?

The changes were tested by running the following command:

make test
CI actions and new Unit Tests

Checklist

  • I have read CONTRIBUTING_GUIDELINES.md
  • I have signed all commits
  • I have added unit tests for the new functionality
  • I have run lint and test steps successfully
    • for jaeger: make lint test
    • for jaeger-ui: yarn lint and yarn test

Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
@Wise-Wizard Wise-Wizard requested a review from a team as a code owner August 18, 2024 05:00
@dosubot dosubot bot added the v2 label Aug 18, 2024
@Wise-Wizard
Copy link
Contributor Author

Wise-Wizard commented Aug 18, 2024

Hi, so the health check configuration should be added to every config file individually itself right?

Copy link

codecov bot commented Aug 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.79%. Comparing base (8f2543c) to head (b1d0431).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5861   +/-   ##
=======================================
  Coverage   96.79%   96.79%           
=======================================
  Files         342      342           
  Lines       16525    16525           
=======================================
  Hits        15996    15996           
  Misses        341      341           
  Partials      188      188           
Flag Coverage Δ
badger_v1 8.05% <ø> (ø)
badger_v2 1.81% <ø> (ø)
cassandra-3.x-v1 16.61% <ø> (ø)
cassandra-3.x-v2 1.74% <ø> (ø)
cassandra-4.x-v1 16.61% <ø> (ø)
cassandra-4.x-v2 1.74% <ø> (ø)
elasticsearch-6.x-v1 18.77% <ø> (-0.02%) ⬇️
elasticsearch-7.x-v1 18.84% <ø> (+0.01%) ⬆️
elasticsearch-8.x-v1 19.03% <ø> (+0.01%) ⬆️
elasticsearch-8.x-v2 1.81% <ø> (+0.01%) ⬆️
grpc_v1 9.52% <ø> (ø)
grpc_v2 7.13% <ø> (-0.02%) ⬇️
kafka-v1 9.74% <ø> (ø)
kafka-v2 1.81% <ø> (ø)
memory_v2 1.81% <ø> (ø)
opensearch-1.x-v1 18.89% <ø> (ø)
opensearch-2.x-v1 18.88% <ø> (-0.02%) ⬇️
opensearch-2.x-v2 1.80% <ø> (-0.02%) ⬇️
unittests 95.27% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yurishkuro
Copy link
Member

Yes. And you need your mention it in the service section.

Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
@yurishkuro
Copy link
Member

See #5859

@Wise-Wizard
Copy link
Contributor Author

See #5859

Step 1 is done, I will proceed with e2e tests now.

Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
@yurishkuro yurishkuro added the changelog:exprimental Change to an experimental part of the code label Aug 19, 2024
Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
cmd/jaeger/collector-with-kafka.yaml Show resolved Hide resolved
cmd/jaeger/ingester-remote-storage.yaml Show resolved Hide resolved
cmd/jaeger/ingester-remote-storage.yaml Outdated Show resolved Hide resolved
cmd/jaeger/internal/integration/e2e_integration.go Outdated Show resolved Hide resolved
Wise-Wizard and others added 5 commits August 21, 2024 15:38
Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
Signed-off-by: Yuri Shkuro <github@ysh.us>
@yurishkuro
Copy link
Member

error in the logs

  2024-08-21T21:59:58.834Z	error	extensions/extensions.go:52	Failed to start extension	{"kind": "extension", "name": "healthcheckv2", "error": "listen tcp 127.0.0.1:13132: bind: address already in use"}

Signed-off-by: Yuri Shkuro <github@ysh.us>
@Wise-Wizard
Copy link
Contributor Author

kafka tests are green, but grpc v2 is consistently failing. It's also missing logs from one of the binaries.

image
This is the error that is being shown.
Does this have to do something with the healthcheck extension?

Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
Wise-Wizard and others added 2 commits August 22, 2024 17:09
Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
@Wise-Wizard
Copy link
Contributor Author

kafka tests are green, but grpc v2 is consistently failing. It's also missing logs from one of the binaries.

image This is the error that is being shown. Does this have to do something with the healthcheck extension?

All remaining gRPC storage tests are being passed. Only the getServices operation is adding jaeger to the response for some reason.

@yurishkuro
Copy link
Member

While the test is running you can query it for traces with jaeger service name to see which endpoint is causing it to appear. It's possible that it comes from the health extension, although it would be strange since we did not enable global tracer, only jtracer for our parts.

@Wise-Wizard
Copy link
Contributor Author

Hi @yurishkuro, So I ran the gRPC integration test locally and the logs are perfect and no error is being reported in the test.
The jaeger variable is not being added in getService operation when running the storage-integration-test locally.
image
image

Signed-off-by: Yuri Shkuro <github@ysh.us>
@yurishkuro
Copy link
Member

I added logging of the traces corresponding to the service names that are retrieved, and turns out the "offending" service name "jaeger" is coming from traces on the write path

    integration.go:164: span: Service: jaeger, TraceID: 0e5d9afa746db440f417202ece817ba7, 
            Operation: jaeger.storage.v1.SpanWriterPlugin/WriteSpan

This is actually (a) a great find, and (b) a problem, because we should never be tracing the write path due to the risk of infinite self-trace loop.

@yurishkuro
Copy link
Member

what is weird is why this doesn't seem to happen when running the test locally...

Signed-off-by: Yuri Shkuro <github@ysh.us>
Signed-off-by: Yuri Shkuro <github@ysh.us>
Signed-off-by: Yuri Shkuro <github@ysh.us>
Signed-off-by: Yuri Shkuro <github@ysh.us>
Signed-off-by: Yuri Shkuro <github@ysh.us>
Signed-off-by: Yuri Shkuro <github@ysh.us>
Signed-off-by: Yuri Shkuro <github@ysh.us>
Signed-off-by: Yuri Shkuro <github@ysh.us>
Signed-off-by: Yuri Shkuro <github@ysh.us>
@yurishkuro yurishkuro merged commit 18cb683 into jaegertracing:main Aug 23, 2024
45 checks passed
@Wise-Wizard
Copy link
Contributor Author

Yes exactly, I had added the same logs in order to debug it, but surprisingly locally all the tests were being passed.

@yurishkuro
Copy link
Member

yurishkuro commented Aug 23, 2024

It's a timing issue. The traces used in the unit tests are directly written to the pipeline and quickly end up in the storage. But the extraneous traces from "jaeger" first go through SDK's batch exporter, which buffers spans for some time before flushing. In local tests those "jaeger" traces don't make it in time for the test traces to be detected. I was able to reproduce it locally with this diff:

diff --git a/cmd/jaeger/internal/integration/e2e_integration.go b/cmd/jaeger/internal/integration/e2e_integration.go
index 55f1517f..c65ca281 100644
--- a/cmd/jaeger/internal/integration/e2e_integration.go
+++ b/cmd/jaeger/internal/integration/e2e_integration.go
@@ -184,6 +184,8 @@ func (s *E2EStorageIntegration) doHealthCheck(t *testing.T) bool {
 		t.Logf("Received non-K status %s: %s", healthResponse.Status, string(body))
 		return false
 	}
+	t.Logf("%s is ready, but let's sleep a bit", s.BinaryName)
+	time.Sleep(30*time.Second + 3*time.Second)
 	return true
 }

diff --git a/cmd/jaeger/internal/integration/grpc_test.go b/cmd/jaeger/internal/integration/grpc_test.go
index 4f7c1cee..6cf0cddf 100644
--- a/cmd/jaeger/internal/integration/grpc_test.go
+++ b/cmd/jaeger/internal/integration/grpc_test.go
@@ -4,11 +4,9 @@
 package integration

 import (
-	"fmt"
 	"testing"

 	"github.com/jaegertracing/jaeger/plugin/storage/integration"
-	"github.com/jaegertracing/jaeger/ports"
 )

 type GRPCStorageIntegration struct {
@@ -33,7 +31,7 @@ func TestGRPCStorage(t *testing.T) {
 		E2EStorageIntegration: E2EStorageIntegration{
 			ConfigFile: "../../config-remote-storage.yaml",
 			// TODO this should be removed in favor of default health check endpoint
-			HealthCheckEndpoint: fmt.Sprintf("http://localhost:%d/", ports.QueryHTTP),
 		},
 	}
 	s.CleanUp = s.cleanUp

@Wise-Wizard
Copy link
Contributor Author

It's a timing issue. The traces used in the unit tests are directly written to the pipeline and quickly end up in the storage. But the extraneous traces from "jaeger" first go through SDK's batch exporter, which buffers spans for some time before flushing. In local tests those "jaeger" traces don't make it in time for the test traces to be detected. I was able to reproduce it locally with this diff:

diff --git a/cmd/jaeger/internal/integration/e2e_integration.go b/cmd/jaeger/internal/integration/e2e_integration.go
index 55f1517f..c65ca281 100644
--- a/cmd/jaeger/internal/integration/e2e_integration.go
+++ b/cmd/jaeger/internal/integration/e2e_integration.go
@@ -184,6 +184,8 @@ func (s *E2EStorageIntegration) doHealthCheck(t *testing.T) bool {
 		t.Logf("Received non-K status %s: %s", healthResponse.Status, string(body))
 		return false
 	}
+	t.Logf("%s is ready, but let's sleep a bit", s.BinaryName)
+	time.Sleep(30*time.Second + 3*time.Second)
 	return true
 }

diff --git a/cmd/jaeger/internal/integration/grpc_test.go b/cmd/jaeger/internal/integration/grpc_test.go
index 4f7c1cee..6cf0cddf 100644
--- a/cmd/jaeger/internal/integration/grpc_test.go
+++ b/cmd/jaeger/internal/integration/grpc_test.go
@@ -4,11 +4,9 @@
 package integration

 import (
-	"fmt"
 	"testing"

 	"github.com/jaegertracing/jaeger/plugin/storage/integration"
-	"github.com/jaegertracing/jaeger/ports"
 )

 type GRPCStorageIntegration struct {
@@ -33,7 +31,7 @@ func TestGRPCStorage(t *testing.T) {
 		E2EStorageIntegration: E2EStorageIntegration{
 			ConfigFile: "../../config-remote-storage.yaml",
 			// TODO this should be removed in favor of default health check endpoint
-			HealthCheckEndpoint: fmt.Sprintf("http://localhost:%d/", ports.QueryHTTP),
 		},
 	}
 	s.CleanUp = s.cleanUp

Ah, makes sense now.

JaredTan95 pushed a commit to JaredTan95/jaeger that referenced this pull request Aug 28, 2024
…5861)

**Which problem is this PR solving?**

Part of jaegertracing#5633, part of jaegertracing#5859

**Description of the changes**
* Integrate health check extension to monitor and report Jaeger V2
component's health
* Enhance all-in-one CI test to ping the new health port

**How was this change tested?**

The changes were tested by running the following command:

```bash
make test
```
```bash
CI actions and new Unit Tests
```
**Checklist**

- [x] I have read
[CONTRIBUTING_GUIDELINES.md](https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md)
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - `for jaeger: make lint test`
  - `for jaeger-ui: yarn lint` and `yarn test`

---------

Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
Signed-off-by: Yuri Shkuro <github@ysh.us>
Co-authored-by: Yuri Shkuro <yurishkuro@users.noreply.github.com>
Co-authored-by: Yuri Shkuro <github@ysh.us>
Signed-off-by: Jared Tan <jian.tan@daocloud.io>
mahadzaryab1 pushed a commit to mahadzaryab1/jaeger that referenced this pull request Aug 31, 2024
…5861)

**Which problem is this PR solving?**

Part of jaegertracing#5633, part of jaegertracing#5859

**Description of the changes**
* Integrate health check extension to monitor and report Jaeger V2
component's health
* Enhance all-in-one CI test to ping the new health port

**How was this change tested?**

The changes were tested by running the following command:

```bash
make test
```
```bash
CI actions and new Unit Tests
```
**Checklist**

- [x] I have read
[CONTRIBUTING_GUIDELINES.md](https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md)
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - `for jaeger: make lint test`
  - `for jaeger-ui: yarn lint` and `yarn test`

---------

Signed-off-by: Wise-Wizard <saransh.shankar@gmail.com>
Signed-off-by: Yuri Shkuro <github@ysh.us>
Co-authored-by: Yuri Shkuro <yurishkuro@users.noreply.github.com>
Co-authored-by: Yuri Shkuro <github@ysh.us>
Signed-off-by: Mahad Zaryab <mahadzaryab1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog:exprimental Change to an experimental part of the code v2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants