Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App-shared port without DNS server is not necessarily an issue #4315

Merged
merged 1 commit into from
Oct 2, 2024

Conversation

milan-zededa
Copy link
Contributor

Not having DNS servers configured (while link is OK and IP address is assigned), is not necessarily a failure for an app-shared interface. Some applications might not require DNS for external hostnames and may only use internal DNS servers to resolve the names of other apps running on the same device. However, we should still issue a warning about this potential issue.

Not having DNS servers configured (while link is OK and IP address
is assigned), is not necessarily a failure for an app-shared interface.
Some applications might not require DNS for external hostnames
and may only use internal DNS servers to resolve the names of other apps
running on the same device. However, we should still issue a warning about
this potential issue.

Signed-off-by: Milan Lenco <milan@zededa.com>
@milan-zededa milan-zededa added the stable Should be backported to stable release(s) label Oct 1, 2024
@milan-zededa
Copy link
Contributor Author

@OhmSpectator It seems that we have caught some panic in the PSI unit test:

=== FAIL: agentlog TestPsiEveIntegratedStartTwice (10.08s)
    agentlog_test.go:624: could not start the integrated psi collector API: could not start the server in 10 seconds

=== FAIL: agentlog TestPsiEveIntegratedStopUnstarted (unknown)
	github.com/lf-edge/eve/pkg/pillar/agentlog	coverage: 24.8% of statements
panic: test timed out after 10m0s
running tests:
	TestPsiEveIntegratedStopUnstarted (9m43s)

https://github.com/lf-edge/eve/actions/runs/11123159400/job/30906006512?pr=4315

@OhmSpectator
Copy link
Member

This code fails:

	go agentlog.ListenDebug(logObj, "stackdump", "memdump")
	started := false
	// Wait for ListenDebug to start http service (try to get 200 response)
	// Limit the number of retries to avoid infinite loop, try for 10 seconds
	for i := 0; i < 100; i++ {
		resp, err := http.Get("http://127.0.0.1:6543")
		if err == nil && resp.StatusCode == 200 {
			started = true
			break
		}
		time.Sleep(100 * time.Millisecond)
	}
	if !started {
		return nil, fmt.Errorf("could not start the server in 10 seconds")
	

May it be somehow related to the changes in the PR?

@christoph-zededa, do we have other unit tests for the Debug Server? It fails to start for some reason...

@OhmSpectator
Copy link
Member

@milan-zededa, could you reproduce it locally?
Usually for that, I do:

cd pkg/pillar
make enter-docker-dev
... it enters container here
(container) cd pillar/agentlog
(container) go test -v

@christoph-zededa
Copy link
Contributor

@christoph-zededa, do we have other unit tests for the Debug Server? It fails to start for some reason...

We have only https://github.com/lf-edge/eve/blob/master/pkg/pillar/agentlog/http-debug_test.go#L66

Also I would not call this a unit test ...

@OhmSpectator
Copy link
Member

@christoph-zededa, do we have other unit tests for the Debug Server? It fails to start for some reason...

We have only https://github.com/lf-edge/eve/blob/master/pkg/pillar/agentlog/http-debug_test.go#L66

Also I would not call this a unit test ...

Ooph... Can we automate all the steps mentioned in the comments?.. Ok, @milan-zededa, this one is also to be checked if the failure is consistent... I'll try locally as well.

@christoph-zededa
Copy link
Contributor

Also I see:

2024-10-01T10:20:24.2670978Z ✖  agentlog (10m0.131s)

So the test times out ..., too.

@OhmSpectator
Copy link
Member

Also I see:

2024-10-01T10:20:24.2670978Z ✖  agentlog (10m0.131s)

So the test times out ..., too.

Only in this PR, or the master as well?

@christoph-zededa
Copy link
Contributor

Only in this PR, or the master as well?

I doubt it is in master otherwise we would have seen failed tests.

@OhmSpectator
Copy link
Member

Only in this PR, or the master as well?

I doubt it is in master otherwise we would have seen failed tests.

Then, this PR somehow affected our debug server started by Pillar...

@christoph-zededa
Copy link
Contributor

Then, this PR somehow affected our debug server started by Pillar...

Or the test is flaky ;-)

@OhmSpectator
Copy link
Member

I tried it locally and it ran successfully...

Copy link
Member

@OhmSpectator OhmSpectator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restarted the tests here, and they also work fine...

✓ agentlog (24.775s)

Hm... Interesting; what would be a reason for this long server start?... @christoph-zededa, could we add a basic HttpDebug test that we can run in CI?

@christoph-zededa
Copy link
Contributor

I restarted the tests here, and they also work fine...

✓ agentlog (24.775s)

Hm... Interesting; what would be a reason for this long server start?... @christoph-zededa, could we add a basic HttpDebug test that we can run in CI?

seems it was in this mutex here: https://github.com/lf-edge/eve/actions/runs/11123159400/job/30906006512?pr=4315#step:3:583 (agentlog_test.go:408 )

I am not sure if another test helps, your test is already testing it as well and testing that the golang http server works is already covered by their (=golang standard lib) tests enough.

@OhmSpectator
Copy link
Member

I restarted the tests here, and they also work fine...

✓ agentlog (24.775s)

Hm... Interesting; what would be a reason for this long server start?... @christoph-zededa, could we add a basic HttpDebug test that we can run in CI?

seems it was in this mutex here: https://github.com/lf-edge/eve/actions/runs/11123159400/job/30906006512?pr=4315#step:3:583 (agentlog_test.go:408 )

I am not sure if another test helps, your test is already testing it as well and testing that the golang http server works is already covered by their (=golang standard lib) tests enough.

Okay, I see one case where we can get a deadlock here. I will fix it later. But it could have happened already when HttpSerrver failed to start in 10 seconds (I do not unlock the lock in this case and return).

@OhmSpectator
Copy link
Member

Also, can the lock still be locked if a signal interrupts the previous instance of the server running in a test? I'll create a ticket for that. I hate "easy go testing".

@christoph-zededa
Copy link
Contributor

Also, can the lock still be locked if a signal interrupts the previous instance of the server running in a test?

Like f.e. SIGTERM? The mutex is only valid during runtime.

@OhmSpectator
Copy link
Member

Also, can the lock still be locked if a signal interrupts the previous instance of the server running in a test?

Like f.e. SIGTERM? The mutex is only valid during runtime.

Hm... Then I have no idea how we could deadlock there... Wanna take a look? =)

@OhmSpectator
Copy link
Member

Yeah, indeed, it was the case. The server has not started in one test, forgot to unlock the mutex, and failed the next test in deadlock. I'll fix the deadlock soon.
I still have a question: why were 10 seconds insufficient for the server to start? Should we increase it time, @christoph-zededa? Or can it be the regular problem described by @milan-zededa before of the runner that does not give control to the VM?

@OhmSpectator
Copy link
Member

#4316 is here

@christoph-zededa
Copy link
Contributor

I still have a question: why were 10 seconds insufficient for the server to start?

Would be good to have a profile or more log messages to see what is going on.

@OhmSpectator
Copy link
Member

Would be good to have a profile or more log messages to see what is going on.

Weren't you able to reproduce it locally with the HttpDebug test you mentioned before?

@OhmSpectator
Copy link
Member

@milan-zededa Sorry for discussing all that time a failure in the HttpDebug test)) I'll also take a look at the code of the PR itself))

Copy link
Member

@OhmSpectator OhmSpectator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the Eden tests.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@milan-zededa
Copy link
Contributor Author

Looks like tests already run and succeeded: https://github.com/lf-edge/eve/actions/runs/11134071507
No idea why they are not linked from this PR and instead it is shown as Skipped.

@OhmSpectator OhmSpectator merged commit e7bb230 into lf-edge:master Oct 2, 2024
63 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stable Should be backported to stable release(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants