Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable APF #4042

Closed
wants to merge 1 commit into from
Closed

Disable APF #4042

wants to merge 1 commit into from

Conversation

johnbelamaric
Copy link
Contributor

Fixes #4023

The upstream bug is currently patched in kubernetes/kubernetes but has not been cherry-picked to release branches. We can try re-enabling this after that is done or we have upgraded to the 1.29 packages. Given that Porch is an aggregated API server, it may be that APF in the primary API server already protects Porch, but that has not been verified.

Signed-off-by: John Belamaric <jbelamaric@google.com>
@johnbelamaric
Copy link
Contributor Author

Note: I am verifying that this fixes the problem, we should know by tomorrow. Without this change, we see a crash about 2-3 times per day.

@johnbelamaric
Copy link
Contributor Author

johnbelamaric commented Sep 19, 2023

Ok, this sort of worked. No more APF errors. But I still saw one restart. Investigating further, memory spikes so my guess is the restart is an OOM kill. Here is the instance with APF disabled:

Screenshot from 2023-09-19 09-16-38

Looking back in time, we see these regular spikes, like this

Memory:
Screenshot from 2023-09-19 09-18-42

CPU:
Screenshot from 2023-09-19 09-19-57

Is there something that happens periodically in Porch? I don't see anything particularly unusual in the logs, but I do see a lot of the "overnotifying" and "sending watch" messages. Here is the histogram of log entries:

Screenshot from 2023-09-19 09-26-30

This does seem to correlate with the other spikes.

@johnbelamaric
Copy link
Contributor Author

So my current hypothesis is this:

  • we saw APF errors because APF was actually getting triggered and had this bug
  • it was getting triggered because we start hammering the api server, probably due to a watch storm of some kind

The watch events being sent just before the crash are 85%+ of the log entries - we're talking over 400k in the 30m before the crash. So, there is something wrong in that watch code somewhere.

@johnbelamaric
Copy link
Contributor Author

I don't think this is needed, #4048 is better.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

porch: porch-server crash with "Unable to derive new concurrency limits"
1 participant