Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Rework second-copy scaling logic #49

Merged
merged 2 commits into from
Aug 2, 2022
Merged

Conversation

njhill
Copy link
Member

@njhill njhill commented Jul 29, 2022

Motivation

Model-mesh will currently ensure that there's two copies of any model which has been used "recently" to make them "HA" and minimize the chance of disruption if a single pod dies. However this has recently proved problematic in one case involving a periodic invocation of a large number of models (for example once per day). Second copies of all are loaded within a small timeframe, filling the cache and evicting other recently-used models. In this case there's little value in having second copies since the usage is isolated.

A better heuristic is needed for deciding when redundant copies should be added.

Modifications

  • Remove the current logic related to second-copy triggering. This includes methods invoked on the inference request path and from the regular janitor task.
  • Piggy-back on the existing frequent rate-tracking task to keep track of "iteration numbers" in which single-copy models are loaded, and only trigger a second copy when there's a prior usage more than 7 minutes but less than 40 minutes ago. If the usage is confined to a < 7min window it could be isolated; if > 40min apart the value of a second copy is lower (probability of pod death causing disruption is minimal).
  • By-pass the second-copy triggering altogether if the cache is full and LRU is low
  • More aggressively scale down second copies for inactive models, especially if the cache is full / LRU recent
  • Update unit test to reflect new behaviour

Result

More effective use of available model cache space, which should result in lower model memory requirement for many use cases.

Note that this does not affect auto-scaling behaviour based on request load. A single copy may now scale up immediately if subjected to sufficient load independent of the aforementioned redundant copy logic.

Signed-off-by: Nick Hill nickhill@us.ibm.com

Motivation

Model-mesh will currently ensure that there's two copies of any model which has been used "recently" to make them "HA" and minimize the chance of disruption if a single pod dies. However this has recently proved problematic in one case involving a periodic invocation of a large number of models (for example once per day). Second copies of all are loaded within a small timeframe, filling the cache and evicting other recently-used models. In this case there's little value in having second copies since the usage is isolated.

A better heuristic is needed in deciding when redundant copies should be added.

Modifications

- Remove the current logic related to second-copy triggering. This includes methods invoked on the inference request path and from the regular janitor task.
- Piggy-back on the existing frequent rate-tracking task to keep track of "iteration numbers" in which single-copy models are loaded, and only trigger a second copy when there's a prior usage more than 7 minutes but less than 40 minutes ago. If the usage is confined to a < 7min window it could be isolated; if > 40min apart the value of a second copy is lower (probability of pod death causing disruption is minimal).
- By-pass the second-copy triggering altogether if the cache is full and LRU is low
- More aggressively scale down second copies for inactive models, especially if the cache is full / LRU recent
- Update unit test to reflect new behaviour

Result

More effective use of available model cache space, which should result in lower model memory requirement for many use cases.

Note that this does not affect auto-scaling behaviour based on request load. A single copy may now scale up immediately if subjected to sufficient load independent of the aforementioned redundant copy logic.

Signed-off-by: Nick Hill <nickhill@us.ibm.com>
public static final long SECOND_COPY_ADD_AGE_UNUSED_MS = 10 * 60_000L; // 10mins
public static final long SECOND_COPY_ADD_AGE_USED_MS = 12 * 3600_000L; // 12hours
final int SECOND_COPY_MAX_AGE_ITERS = Integer.getInteger("mm.max_second_copy_age_itrs", 240);
final int SECOND_COPY_MIN_AGE_ITERS = Integer.getInteger("mm.min_second_copy_age_itrs", 42);
Copy link

@joerunde joerunde Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, these time units are in iterations but the constants below are in ms. Any chance this can be refactored to take in ms (or minutes) here, and have the conversion to iterations done taking the rate checking task frequency into account?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joerunde good question, I had considered this but it gets a bit messy depending on the granularity of the task frequency. I will have a go at making the change though since you've suggested it!


/* Last-used age after which second model copies will be removed
* (2->1 scaledown). Note this is a maximum and a smaller value might
* be used if the global age of the cache is small (less than 5x this value).
* be used if the global age of the cache is small (less than 10x this value).
Copy link

@joerunde joerunde Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this heuristic also be configurable? For the large model use case, users expect the global cache age to be in days, and it would be a really bad thing if it ever went down to 1 hour.

I guess the rest of this fix aims to avoid the 1->2 scaleup to help the global cache age stay long in the first place, but in the case where enough of these larger models are scaled up to 2 copies anyway, maybe it would be useful to configure this value of small so that it could be set to like... 12 hours? 24 hours?

Or, maybe it would be enough just to make this value (SECOND_COPY_REMOVE_MAX_AGE_MS) configurable, so users could more aggressively clean out second copies where they know models are large and cache space is at a premium?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping this is a reasonable balance for now, we can always revisit exposing additional levers if it turns out to be needed.


final long before = logger.isDebugEnabled() ? nanoTime() : 0L;
boolean success = false;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does success mean? Is this true both if we don't need to scale and didn't, or needed to scale and did?

Copy link
Member Author

@njhill njhill Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's more to ensure we don't progress the common lastCheckTime if we didn't reset all of the individual counters of the cache entries used within the last time interval. I.e. if there was an unexpected exception meaning that we aborted prematurely. But looking closer at the nesting/ordering I don't think it's really necessary so I'll just remove it, thanks for asking about it!

// Model was used within the target range [MIN_AGE, MAX_AGE] iterations ago
// so trigger loading of a second copy

// Don't do it if > 90% full and cache is younger than 6 hours
Copy link

@joerunde joerunde Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔🤔🤔

I guess 90% seems reasonable, except maybe in the case where models are really small, like <0.01% of the cache each?

Should the 6 hours here be configurable too?

Super unclear where to draw the line between having way too many levers and having a user complain that 6 hours here is not great because they want their global lru to stay in the days

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess 90% seems reasonable, except maybe in the case where models are really small, like <0.1% of the cache each?

This would still be relatively close to full, and the cutoff would only apply if all of the very large number of models had been used recently (so still kind of reasonable behaviour).

Super unclear where to draw the line between having way too many levers and having a user complain that 6 hours here is not great because they want their global lru to stay in the days

Exactly, an overarching goal is to avoid/minimize this kind of configuration as much as possible and just have it adapt and work well depending on the various workload parameters in each case.

@joerunde
Copy link

joerunde commented Aug 1, 2022

Test looks okay, I'll trust that the internals work!

Just kinda wondering if there's a way to influence all these heuristics, maybe with like a TARGET_MINIMUM_LRU configuration or something so we can trigger some different behavior in the small, lightweight model case vs the large, hard to load model case, without having a hundred different configurations exposed

Signed-off-by: Nick Hill <nickhill@us.ibm.com>
@njhill
Copy link
Member Author

njhill commented Aug 2, 2022

Thanks @joerunde I've made a few updates addressing and/or inspired by your comments. In particular having the add-copy LRU threshold be based on the corresponding max lastUsed age, to ensure it makes sense relative to the 10% LRU scale-down logic. PTAL

Copy link

@chinhuang007 chinhuang007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have addressed Joe's comments. Thanks, @njhill ! Some parameters could become configurable later if needed. Also need to document clearly the impact when we provide more knobs for users to turn.

Copy link

@joerunde joerunde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lpgtm too!

@kserve-oss-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chinhuang007, joerunde, njhill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@chinhuang007
Copy link

/lgtm

@kserve-oss-bot kserve-oss-bot merged commit 2790ef2 into main Aug 2, 2022
@njhill njhill deleted the slower-scale2 branch August 2, 2022 17:02
njhill added a commit that referenced this pull request Aug 12, 2022
Motivation

Somehow I failed to push a prepared final commit to #49's branch before it was merged. Embarrassingly this includes removal of a stdout debug message.

Modification

- Adjust min LRU for second copy scaleup: Make it a function of SECOND_COPY_MAX_AGE_SECS to ensure that there won't be flip-flopping with scaling down.
- Remove spurious debug sysout

Result

Second copy scaling thresholds should be better aligned, stop debug messages spewing to stdout.
njhill added a commit that referenced this pull request Aug 12, 2022
Motivation

Somehow I failed to push a prepared final commit to #49's branch before it was merged. Embarrassingly this includes removal of a stdout debug message.

Modification

- Adjust min LRU for second copy scaleup: Make it a function of SECOND_COPY_MAX_AGE_SECS to ensure that there won't be flip-flopping with scaling down
- Remove spurious debug sysout

Result

Second copy scaling thresholds should be better aligned, stop debug messages spewing to stdout.

Signed-off-by: Nick Hill <nickhill@us.ibm.com>
kserve-oss-bot pushed a commit that referenced this pull request Aug 12, 2022
#### Motivation

Somehow I failed to push a prepared final commit to #49's branch before it was merged. Embarrassingly this includes removal of a stdout debug message.

##### Modification

- Adjust min LRU for second copy scaleup: Make it a function of `SECOND_COPY_MAX_AGE_SECS` to ensure that there won't be flip-flopping with scaling down
- Remove spurious debug sysout

##### Result

Second copy scaling thresholds should be better aligned, stop debug messages spewing to stdout.

Signed-off-by: Nick Hill <nickhill@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants