Increase default MAXLOAD #123

LukeHandle · 2017-10-06T10:41:00Z

Hey there,

I see in #100 we made changes to the default MAXLOAD variable:

NEW: MAXLOAD - 10*cpu(dynamic by default, was 1000)
OLD: MAXLOAD="1000"

This had previously been set over in 4c03a4c with the logic that we'd rather have logging in the event of an issue and further crash than massive empty gaps where we are left guessing what might have been going on.

While it is agreeable that recap and the tools it triggers will contribute to the load, the "black box" style information of why we crashed is preferred default. I've spoken with Man Chung and he still believes that this value is too low - though he might chime in with more words.

The text was updated successfully, but these errors were encountered:

man-chung · 2017-10-06T11:24:17Z

I think the Dynamic MAXLOAD value will be too low. Single proc quad core will be 40, 80 with HT. Even a hex core with HT will only be 120. I was in a server the other day with 160, and it was still perfectly workable on the SSH commandline.

Recap is run via cron, a load average spike might put the load average above threshold between recap's interval - resulting in no useful logs.

I'd take the trade off on the server falling over that little sooner if I stand a better chance at capturing the root cause.

Any one know a formula for a high result to begin with, but will produce not so high result as you increase your base variables?

EG
1 sibling = 100
5 siblings = 250
10 sibling = 300
20 sibling = 325
That sort of growing decay?

tonyskapunk · 2017-10-06T14:08:45Z

Hey @LukeHandle thanks for opening this issue.

The main reason of the change was to try to get a dynamic limit and attempting to be closer to real-scenarios as the old one was generic and too high.

@man-chung thanks for pointing out the load of 160, that's a good example. The suggestion you provide is also quite valuable, I'm pretty sure we can accommodate something like that.

tonyskapunk · 2017-10-20T15:26:24Z

Thinking about the ratio I dug into the amount of siblings detected in the most modern CPUs and can get really high, for example a CPU with 28 cores, 56 cores and HT[0]. Could be above the 3000!

Based on the feedback here provided I'm thinking that an alternative is to always attempt to run recap.

The purpose of recap is to obtain a snapshot of the resources at the running time and if the server is experiencing high load(higher than the limit set) it would be quite useful to obtain that information instead of recap bailing out.
The purpose of the load evaluation is to prevent adding more load to the system through recap.

The alternative is to run recap with timeout just in case the execution does not complete within a time frame and prevent other attempts to fire up. This way if the server is with a high load that prevents recap to complete the process will be terminated.

By default recap is executed every 5min, the timeout could be set to half that time.

Thoughts?

[0] https://www.intel.com/content/www/us/en/products/processors/xeon/scalable/platinum-processors/platinum-8176f.html

man-chung · 2017-10-20T15:54:50Z

Unless you want to go back to simply looking at purely the number of cores.

In my mind, the simplistic way is to run recap until the entire box goes down. That would be preferable to no data at all. In the end, the result is the same. I really don't think recap will bring a box down if it wasn't already headed that way anyway.

So I am onboard if you either:

Added execution timeout logic, or
Added lock file control, or
Added a pid/process check.

tonyskapunk · 2017-10-20T16:23:55Z

recap already avoids new processes of running of via lock-file control. :D

Hypothetically the timeout would only ensure a run would end.

LukeHandle · 2017-10-25T15:22:05Z

So either:

Adding the execution timeout and removing the load check
Setting the default load check to be 9999

or did I misunderstand? The latter is more in-line with what we've previously done - can we see particularly benefit of the former?

tonyskapunk · 2017-10-25T17:05:58Z

@LukeHandle right, but mostly looking into the first option, based on the purposes included below. Adding timeout and removing the load check is a way to implement this.
If we set it to 9999 the purpose of it is pretty much the same as not evaluating it in the first place.

As mentioned before, I think the idea behind the implementation we currently have in place has to deal with these two purposes:

The purpose of recap is to obtain a snapshot of the resources at the running time and if the server is experiencing high load(higher than the limit set) it would be quite useful to obtain that information instead of recap bailing out.

The purpose of the load evaluation is to prevent adding more load to the system through recap.

If having a limit (dynamically or fixed)low is preventing recap from running then the first purpose is not being met. This is a good reason to either increase or remove the load check.

If recap contributes to load in the server then is not good either, but I agree with what @man-chung commented:

I really don't think recap will bring a box down if it wasn't already headed that way anyway.

Now, the suggestion of adding timeout is just in case there is such load on a server that could prevent recap from finishing before the next run starts.

In resume I think we should:

Get rid of load check
Possibly add timeout in the cronjob.

Thoughts?

man-chung · 2017-10-25T17:13:45Z

I vote for remove load check and implement timeout, because:

Remove load check - It was probably going to crash anyway
Add Timeout - If the load was that high, then it would probably never have complete anyway.

LukeHandle · 2017-10-25T18:19:30Z

Now, the suggestion of adding timeout is just in case there is such load on a server that could prevent recap from finishing before the next run starts.

Is this preferred over waiting for the existing to exit? Do we think if I got stuck it will not be stuck the next time around we run? Either way we won't be launching another attempt as the lock exists.

What advantage is there from also adding the timeout?

tonyskapunk · 2017-10-25T19:21:29Z

The main advantage I see with the use of timeout is to be a safety net in case recap is stuck indefinitely.

Is a bit difficult to thing about a formal example but let's say recap gets stuck with the fdisk report, that implies that some of the previous reports produced logs. If we timeout it, next run it will produce the other reports and potentially get stuck again on the fdisk one. Without the timeout it will hang indefinitely without producing any other report at all.

I'll start coding the deprecation of load then :), unless there is any other reason not to do so.

tonyskapunk · 2017-10-27T14:03:21Z

Please take some time to review #130 with the proposed implementation of what was discussed in here.

tonyskapunk · 2017-10-31T13:31:11Z

fixed in #130

tonyskapunk added the enhancement label Oct 6, 2017

tonyskapunk added this to the 1.2.0 milestone Oct 6, 2017

tonyskapunk mentioned this issue Oct 27, 2017

Deprecating MAXLOAD, adding timeout. #130

Merged

tonyskapunk closed this as completed Oct 31, 2017

This was referenced Nov 2, 2017

recap release 1.2.0 #132

Merged

Removing references of MAXLOAD in config file. #134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase default MAXLOAD #123

Increase default MAXLOAD #123

LukeHandle commented Oct 6, 2017

man-chung commented Oct 6, 2017

tonyskapunk commented Oct 6, 2017

tonyskapunk commented Oct 20, 2017

man-chung commented Oct 20, 2017

tonyskapunk commented Oct 20, 2017

LukeHandle commented Oct 25, 2017

tonyskapunk commented Oct 25, 2017

man-chung commented Oct 25, 2017

LukeHandle commented Oct 25, 2017

tonyskapunk commented Oct 25, 2017

tonyskapunk commented Oct 27, 2017

tonyskapunk commented Oct 31, 2017

Increase default MAXLOAD #123

Increase default MAXLOAD #123

Comments

LukeHandle commented Oct 6, 2017

man-chung commented Oct 6, 2017

tonyskapunk commented Oct 6, 2017

tonyskapunk commented Oct 20, 2017

man-chung commented Oct 20, 2017

tonyskapunk commented Oct 20, 2017

LukeHandle commented Oct 25, 2017

tonyskapunk commented Oct 25, 2017

man-chung commented Oct 25, 2017

LukeHandle commented Oct 25, 2017

tonyskapunk commented Oct 25, 2017

tonyskapunk commented Oct 27, 2017

tonyskapunk commented Oct 31, 2017