Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase default MAXLOAD #123

Closed
LukeHandle opened this issue Oct 6, 2017 · 12 comments
Closed

Increase default MAXLOAD #123

LukeHandle opened this issue Oct 6, 2017 · 12 comments
Milestone

Comments

@LukeHandle
Copy link
Contributor

Hey there,

I see in #100 we made changes to the default MAXLOAD variable:

NEW: MAXLOAD - 10*cpu(dynamic by default, was 1000)
OLD: MAXLOAD="1000"

This had previously been set over in 4c03a4c with the logic that we'd rather have logging in the event of an issue and further crash than massive empty gaps where we are left guessing what might have been going on.

While it is agreeable that recap and the tools it triggers will contribute to the load, the "black box" style information of why we crashed is preferred default. I've spoken with Man Chung and he still believes that this value is too low - though he might chime in with more words.

@man-chung
Copy link
Contributor

I think the Dynamic MAXLOAD value will be too low. Single proc quad core will be 40, 80 with HT. Even a hex core with HT will only be 120. I was in a server the other day with 160, and it was still perfectly workable on the SSH commandline.

Recap is run via cron, a load average spike might put the load average above threshold between recap's interval - resulting in no useful logs.

I'd take the trade off on the server falling over that little sooner if I stand a better chance at capturing the root cause.

Any one know a formula for a high result to begin with, but will produce not so high result as you increase your base variables?

EG
1 sibling = 100
5 siblings = 250
10 sibling = 300
20 sibling = 325
That sort of growing decay?

@tonyskapunk
Copy link
Contributor

Hey @LukeHandle thanks for opening this issue.

The main reason of the change was to try to get a dynamic limit and attempting to be closer to real-scenarios as the old one was generic and too high.

@man-chung thanks for pointing out the load of 160, that's a good example. The suggestion you provide is also quite valuable, I'm pretty sure we can accommodate something like that.

@tonyskapunk tonyskapunk added this to the 1.2.0 milestone Oct 6, 2017
@tonyskapunk
Copy link
Contributor

Thinking about the ratio I dug into the amount of siblings detected in the most modern CPUs and can get really high, for example a CPU with 28 cores, 56 cores and HT[0]. Could be above the 3000!

Based on the feedback here provided I'm thinking that an alternative is to always attempt to run recap.

  • The purpose of recap is to obtain a snapshot of the resources at the running time and if the server is experiencing high load(higher than the limit set) it would be quite useful to obtain that information instead of recap bailing out.
  • The purpose of the load evaluation is to prevent adding more load to the system through recap.

The alternative is to run recap with timeout just in case the execution does not complete within a time frame and prevent other attempts to fire up. This way if the server is with a high load that prevents recap to complete the process will be terminated.

By default recap is executed every 5min, the timeout could be set to half that time.

Thoughts?

[0] https://www.intel.com/content/www/us/en/products/processors/xeon/scalable/platinum-processors/platinum-8176f.html

@man-chung
Copy link
Contributor

Unless you want to go back to simply looking at purely the number of cores.

In my mind, the simplistic way is to run recap until the entire box goes down. That would be preferable to no data at all. In the end, the result is the same. I really don't think recap will bring a box down if it wasn't already headed that way anyway.

So I am onboard if you either:

  • Added execution timeout logic, or

  • Added lock file control, or

  • Added a pid/process check.

@tonyskapunk
Copy link
Contributor

recap already avoids new processes of running of via lock-file control. :D

Hypothetically the timeout would only ensure a run would end.

@LukeHandle
Copy link
Contributor Author

So either:

  • Adding the execution timeout and removing the load check
  • Setting the default load check to be 9999
  • or did I misunderstand? The latter is more in-line with what we've previously done - can we see particularly benefit of the former?

@tonyskapunk
Copy link
Contributor

@LukeHandle right, but mostly looking into the first option, based on the purposes included below. Adding timeout and removing the load check is a way to implement this.
If we set it to 9999 the purpose of it is pretty much the same as not evaluating it in the first place.

As mentioned before, I think the idea behind the implementation we currently have in place has to deal with these two purposes:

  • The purpose of recap is to obtain a snapshot of the resources at the running time and if the server is experiencing high load(higher than the limit set) it would be quite useful to obtain that information instead of recap bailing out.
  • The purpose of the load evaluation is to prevent adding more load to the system through recap.

If having a limit (dynamically or fixed)low is preventing recap from running then the first purpose is not being met. This is a good reason to either increase or remove the load check.

If recap contributes to load in the server then is not good either, but I agree with what @man-chung commented:

I really don't think recap will bring a box down if it wasn't already headed that way anyway.

Now, the suggestion of adding timeout is just in case there is such load on a server that could prevent recap from finishing before the next run starts.

In resume I think we should:

  • Get rid of load check
  • Possibly add timeout in the cronjob.

Thoughts?

@man-chung
Copy link
Contributor

I vote for remove load check and implement timeout, because:

  • Remove load check - It was probably going to crash anyway
  • Add Timeout - If the load was that high, then it would probably never have complete anyway.

@LukeHandle
Copy link
Contributor Author

Now, the suggestion of adding timeout is just in case there is such load on a server that could prevent recap from finishing before the next run starts.

Is this preferred over waiting for the existing to exit? Do we think if I got stuck it will not be stuck the next time around we run? Either way we won't be launching another attempt as the lock exists.

What advantage is there from also adding the timeout?

@tonyskapunk
Copy link
Contributor

The main advantage I see with the use of timeout is to be a safety net in case recap is stuck indefinitely.

Is a bit difficult to thing about a formal example but let's say recap gets stuck with the fdisk report, that implies that some of the previous reports produced logs. If we timeout it, next run it will produce the other reports and potentially get stuck again on the fdisk one. Without the timeout it will hang indefinitely without producing any other report at all.

I'll start coding the deprecation of load then :), unless there is any other reason not to do so.

@tonyskapunk
Copy link
Contributor

Please take some time to review #130 with the proposed implementation of what was discussed in here.

@tonyskapunk
Copy link
Contributor

fixed in #130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants