-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase default MAXLOAD #123
Comments
I think the Dynamic MAXLOAD value will be too low. Single proc quad core will be 40, 80 with HT. Even a hex core with HT will only be 120. I was in a server the other day with 160, and it was still perfectly workable on the SSH commandline. Recap is run via cron, a load average spike might put the load average above threshold between recap's interval - resulting in no useful logs. I'd take the trade off on the server falling over that little sooner if I stand a better chance at capturing the root cause. Any one know a formula for a high result to begin with, but will produce not so high result as you increase your base variables? EG |
Hey @LukeHandle thanks for opening this issue. The main reason of the change was to try to get a dynamic limit and attempting to be closer to real-scenarios as the old one was generic and too high. @man-chung thanks for pointing out the load of 160, that's a good example. The suggestion you provide is also quite valuable, I'm pretty sure we can accommodate something like that. |
Thinking about the ratio I dug into the amount of siblings detected in the most modern CPUs and can get really high, for example a CPU with 28 cores, 56 cores and HT[0]. Could be above the 3000! Based on the feedback here provided I'm thinking that an alternative is to always attempt to run recap.
The alternative is to run recap with By default recap is executed every 5min, the timeout could be set to half that time. Thoughts? |
Unless you want to go back to simply looking at purely the number of cores. In my mind, the simplistic way is to run recap until the entire box goes down. That would be preferable to no data at all. In the end, the result is the same. I really don't think recap will bring a box down if it wasn't already headed that way anyway. So I am onboard if you either:
|
Hypothetically the timeout would only ensure a run would end. |
So either:
|
@LukeHandle right, but mostly looking into the first option, based on the purposes included below. Adding As mentioned before, I think the idea behind the implementation we currently have in place has to deal with these two purposes:
If having a limit (dynamically or fixed)low is preventing recap from running then the first purpose is not being met. This is a good reason to either increase or remove the load check. If recap contributes to load in the server then is not good either, but I agree with what @man-chung commented:
Now, the suggestion of adding In resume I think we should:
Thoughts? |
I vote for remove load check and implement timeout, because:
|
Is this preferred over waiting for the existing to exit? Do we think if I got stuck it will not be stuck the next time around we run? Either way we won't be launching another attempt as the lock exists. What advantage is there from also adding the timeout? |
The main advantage I see with the use of Is a bit difficult to thing about a formal example but let's say I'll start coding the deprecation of load then :), unless there is any other reason not to do so. |
Please take some time to review #130 with the proposed implementation of what was discussed in here. |
fixed in #130 |
Hey there,
I see in #100 we made changes to the default MAXLOAD variable:
This had previously been set over in 4c03a4c with the logic that we'd rather have logging in the event of an issue and further crash than massive empty gaps where we are left guessing what might have been going on.
While it is agreeable that
recap
and the tools it triggers will contribute to the load, the "black box" style information of why we crashed is preferred default. I've spoken with Man Chung and he still believes that this value is too low - though he might chime in with more words.The text was updated successfully, but these errors were encountered: