-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Rework second-copy scaling logic #49
Conversation
Motivation Model-mesh will currently ensure that there's two copies of any model which has been used "recently" to make them "HA" and minimize the chance of disruption if a single pod dies. However this has recently proved problematic in one case involving a periodic invocation of a large number of models (for example once per day). Second copies of all are loaded within a small timeframe, filling the cache and evicting other recently-used models. In this case there's little value in having second copies since the usage is isolated. A better heuristic is needed in deciding when redundant copies should be added. Modifications - Remove the current logic related to second-copy triggering. This includes methods invoked on the inference request path and from the regular janitor task. - Piggy-back on the existing frequent rate-tracking task to keep track of "iteration numbers" in which single-copy models are loaded, and only trigger a second copy when there's a prior usage more than 7 minutes but less than 40 minutes ago. If the usage is confined to a < 7min window it could be isolated; if > 40min apart the value of a second copy is lower (probability of pod death causing disruption is minimal). - By-pass the second-copy triggering altogether if the cache is full and LRU is low - More aggressively scale down second copies for inactive models, especially if the cache is full / LRU recent - Update unit test to reflect new behaviour Result More effective use of available model cache space, which should result in lower model memory requirement for many use cases. Note that this does not affect auto-scaling behaviour based on request load. A single copy may now scale up immediately if subjected to sufficient load independent of the aforementioned redundant copy logic. Signed-off-by: Nick Hill <nickhill@us.ibm.com>
public static final long SECOND_COPY_ADD_AGE_UNUSED_MS = 10 * 60_000L; // 10mins | ||
public static final long SECOND_COPY_ADD_AGE_USED_MS = 12 * 3600_000L; // 12hours | ||
final int SECOND_COPY_MAX_AGE_ITERS = Integer.getInteger("mm.max_second_copy_age_itrs", 240); | ||
final int SECOND_COPY_MIN_AGE_ITERS = Integer.getInteger("mm.min_second_copy_age_itrs", 42); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, these time units are in iterations
but the constants below are in ms
. Any chance this can be refactored to take in ms
(or minutes
) here, and have the conversion to iterations done taking the rate checking task frequency into account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joerunde good question, I had considered this but it gets a bit messy depending on the granularity of the task frequency. I will have a go at making the change though since you've suggested it!
|
||
/* Last-used age after which second model copies will be removed | ||
* (2->1 scaledown). Note this is a maximum and a smaller value might | ||
* be used if the global age of the cache is small (less than 5x this value). | ||
* be used if the global age of the cache is small (less than 10x this value). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this heuristic also be configurable? For the large model use case, users expect the global cache age to be in days, and it would be a really bad thing if it ever went down to 1 hour.
I guess the rest of this fix aims to avoid the 1->2 scaleup to help the global cache age stay long in the first place, but in the case where enough of these larger models are scaled up to 2 copies anyway, maybe it would be useful to configure this value of small
so that it could be set to like... 12 hours? 24 hours?
Or, maybe it would be enough just to make this value (SECOND_COPY_REMOVE_MAX_AGE_MS
) configurable, so users could more aggressively clean out second copies where they know models are large and cache space is at a premium?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hoping this is a reasonable balance for now, we can always revisit exposing additional levers if it turns out to be needed.
|
||
final long before = logger.isDebugEnabled() ? nanoTime() : 0L; | ||
boolean success = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does success
mean? Is this true
both if we don't need to scale and didn't, or needed to scale and did?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's more to ensure we don't progress the common lastCheckTime
if we didn't reset all of the individual counters of the cache entries used within the last time interval. I.e. if there was an unexpected exception meaning that we aborted prematurely. But looking closer at the nesting/ordering I don't think it's really necessary so I'll just remove it, thanks for asking about it!
// Model was used within the target range [MIN_AGE, MAX_AGE] iterations ago | ||
// so trigger loading of a second copy | ||
|
||
// Don't do it if > 90% full and cache is younger than 6 hours |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔🤔🤔
I guess 90% seems reasonable, except maybe in the case where models are really small, like <0.01% of the cache each?
Should the 6 hours here be configurable too?
Super unclear where to draw the line between having way too many levers and having a user complain that 6 hours here is not great because they want their global lru to stay in the days
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess 90% seems reasonable, except maybe in the case where models are really small, like <0.1% of the cache each?
This would still be relatively close to full, and the cutoff would only apply if all of the very large number of models had been used recently (so still kind of reasonable behaviour).
Super unclear where to draw the line between having way too many levers and having a user complain that 6 hours here is not great because they want their global lru to stay in the days
Exactly, an overarching goal is to avoid/minimize this kind of configuration as much as possible and just have it adapt and work well depending on the various workload parameters in each case.
Test looks okay, I'll trust that the internals work! Just kinda wondering if there's a way to influence all these heuristics, maybe with like a |
Signed-off-by: Nick Hill <nickhill@us.ibm.com>
Thanks @joerunde I've made a few updates addressing and/or inspired by your comments. In particular having the add-copy LRU threshold be based on the corresponding max lastUsed age, to ensure it makes sense relative to the 10% LRU scale-down logic. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you have addressed Joe's comments. Thanks, @njhill ! Some parameters could become configurable later if needed. Also need to document clearly the impact when we provide more knobs for users to turn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lpgtm too!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chinhuang007, joerunde, njhill The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
Motivation Somehow I failed to push a prepared final commit to #49's branch before it was merged. Embarrassingly this includes removal of a stdout debug message. Modification - Adjust min LRU for second copy scaleup: Make it a function of SECOND_COPY_MAX_AGE_SECS to ensure that there won't be flip-flopping with scaling down. - Remove spurious debug sysout Result Second copy scaling thresholds should be better aligned, stop debug messages spewing to stdout.
Motivation Somehow I failed to push a prepared final commit to #49's branch before it was merged. Embarrassingly this includes removal of a stdout debug message. Modification - Adjust min LRU for second copy scaleup: Make it a function of SECOND_COPY_MAX_AGE_SECS to ensure that there won't be flip-flopping with scaling down - Remove spurious debug sysout Result Second copy scaling thresholds should be better aligned, stop debug messages spewing to stdout. Signed-off-by: Nick Hill <nickhill@us.ibm.com>
#### Motivation Somehow I failed to push a prepared final commit to #49's branch before it was merged. Embarrassingly this includes removal of a stdout debug message. ##### Modification - Adjust min LRU for second copy scaleup: Make it a function of `SECOND_COPY_MAX_AGE_SECS` to ensure that there won't be flip-flopping with scaling down - Remove spurious debug sysout ##### Result Second copy scaling thresholds should be better aligned, stop debug messages spewing to stdout. Signed-off-by: Nick Hill <nickhill@us.ibm.com>
Motivation
Model-mesh will currently ensure that there's two copies of any model which has been used "recently" to make them "HA" and minimize the chance of disruption if a single pod dies. However this has recently proved problematic in one case involving a periodic invocation of a large number of models (for example once per day). Second copies of all are loaded within a small timeframe, filling the cache and evicting other recently-used models. In this case there's little value in having second copies since the usage is isolated.
A better heuristic is needed for deciding when redundant copies should be added.
Modifications
Result
More effective use of available model cache space, which should result in lower model memory requirement for many use cases.
Note that this does not affect auto-scaling behaviour based on request load. A single copy may now scale up immediately if subjected to sufficient load independent of the aforementioned redundant copy logic.
Signed-off-by: Nick Hill nickhill@us.ibm.com