feat: Rework second-copy scaling logic #49

njhill · 2022-07-29T22:07:45Z

Motivation

Model-mesh will currently ensure that there's two copies of any model which has been used "recently" to make them "HA" and minimize the chance of disruption if a single pod dies. However this has recently proved problematic in one case involving a periodic invocation of a large number of models (for example once per day). Second copies of all are loaded within a small timeframe, filling the cache and evicting other recently-used models. In this case there's little value in having second copies since the usage is isolated.

A better heuristic is needed for deciding when redundant copies should be added.

Modifications

Remove the current logic related to second-copy triggering. This includes methods invoked on the inference request path and from the regular janitor task.
Piggy-back on the existing frequent rate-tracking task to keep track of "iteration numbers" in which single-copy models are loaded, and only trigger a second copy when there's a prior usage more than 7 minutes but less than 40 minutes ago. If the usage is confined to a < 7min window it could be isolated; if > 40min apart the value of a second copy is lower (probability of pod death causing disruption is minimal).
By-pass the second-copy triggering altogether if the cache is full and LRU is low
More aggressively scale down second copies for inactive models, especially if the cache is full / LRU recent
Update unit test to reflect new behaviour

Result

More effective use of available model cache space, which should result in lower model memory requirement for many use cases.

Note that this does not affect auto-scaling behaviour based on request load. A single copy may now scale up immediately if subjected to sufficient load independent of the aforementioned redundant copy logic.

Signed-off-by: Nick Hill nickhill@us.ibm.com

Motivation Model-mesh will currently ensure that there's two copies of any model which has been used "recently" to make them "HA" and minimize the chance of disruption if a single pod dies. However this has recently proved problematic in one case involving a periodic invocation of a large number of models (for example once per day). Second copies of all are loaded within a small timeframe, filling the cache and evicting other recently-used models. In this case there's little value in having second copies since the usage is isolated. A better heuristic is needed in deciding when redundant copies should be added. Modifications - Remove the current logic related to second-copy triggering. This includes methods invoked on the inference request path and from the regular janitor task. - Piggy-back on the existing frequent rate-tracking task to keep track of "iteration numbers" in which single-copy models are loaded, and only trigger a second copy when there's a prior usage more than 7 minutes but less than 40 minutes ago. If the usage is confined to a < 7min window it could be isolated; if > 40min apart the value of a second copy is lower (probability of pod death causing disruption is minimal). - By-pass the second-copy triggering altogether if the cache is full and LRU is low - More aggressively scale down second copies for inactive models, especially if the cache is full / LRU recent - Update unit test to reflect new behaviour Result More effective use of available model cache space, which should result in lower model memory requirement for many use cases. Note that this does not affect auto-scaling behaviour based on request load. A single copy may now scale up immediately if subjected to sufficient load independent of the aforementioned redundant copy logic. Signed-off-by: Nick Hill <nickhill@us.ibm.com>

joerunde · 2022-08-01T17:39:48Z

src/main/java/com/ibm/watson/modelmesh/ModelMesh.java

-    public static final long SECOND_COPY_ADD_AGE_UNUSED_MS = 10 * 60_000L; // 10mins
-    public static final long SECOND_COPY_ADD_AGE_USED_MS = 12 * 3600_000L; // 12hours
+    final int SECOND_COPY_MAX_AGE_ITERS = Integer.getInteger("mm.max_second_copy_age_itrs", 240);
+    final int SECOND_COPY_MIN_AGE_ITERS = Integer.getInteger("mm.min_second_copy_age_itrs", 42);


Interesting, these time units are in iterations but the constants below are in ms. Any chance this can be refactored to take in ms (or minutes) here, and have the conversion to iterations done taking the rate checking task frequency into account?

@joerunde good question, I had considered this but it gets a bit messy depending on the granularity of the task frequency. I will have a go at making the change though since you've suggested it!

joerunde · 2022-08-01T17:45:46Z

src/main/java/com/ibm/watson/modelmesh/ModelMesh.java


    /* Last-used age after which second model copies will be removed
     * (2->1 scaledown). Note this is a maximum and a smaller value might
-     * be used if the global age of the cache is small (less than 5x this value).
+     * be used if the global age of the cache is small (less than 10x this value).


Should this heuristic also be configurable? For the large model use case, users expect the global cache age to be in days, and it would be a really bad thing if it ever went down to 1 hour.

I guess the rest of this fix aims to avoid the 1->2 scaleup to help the global cache age stay long in the first place, but in the case where enough of these larger models are scaled up to 2 copies anyway, maybe it would be useful to configure this value of small so that it could be set to like... 12 hours? 24 hours?

Or, maybe it would be enough just to make this value (SECOND_COPY_REMOVE_MAX_AGE_MS) configurable, so users could more aggressively clean out second copies where they know models are large and cache space is at a premium?

I'm hoping this is a reasonable balance for now, we can always revisit exposing additional levers if it turns out to be needed.

joerunde · 2022-08-01T17:50:27Z

src/main/java/com/ibm/watson/modelmesh/ModelMesh.java


-                    final long before = logger.isDebugEnabled() ? nanoTime() : 0L;
+                    boolean success = false;


What does success mean? Is this true both if we don't need to scale and didn't, or needed to scale and did?

No it's more to ensure we don't progress the common lastCheckTime if we didn't reset all of the individual counters of the cache entries used within the last time interval. I.e. if there was an unexpected exception meaning that we aborted prematurely. But looking closer at the nesting/ordering I don't think it's really necessary so I'll just remove it, thanks for asking about it!

joerunde · 2022-08-01T18:01:52Z

src/main/java/com/ibm/watson/modelmesh/ModelMesh.java

+                                        // Model was used within the target range [MIN_AGE, MAX_AGE] iterations ago
+                                        // so trigger loading of a second copy
+
+                                        // Don't do it if > 90% full and cache is younger than 6 hours


🤔🤔🤔

I guess 90% seems reasonable, except maybe in the case where models are really small, like <0.01% of the cache each?

Should the 6 hours here be configurable too?

Super unclear where to draw the line between having way too many levers and having a user complain that 6 hours here is not great because they want their global lru to stay in the days

I guess 90% seems reasonable, except maybe in the case where models are really small, like <0.1% of the cache each?

This would still be relatively close to full, and the cutoff would only apply if all of the very large number of models had been used recently (so still kind of reasonable behaviour).

Super unclear where to draw the line between having way too many levers and having a user complain that 6 hours here is not great because they want their global lru to stay in the days

Exactly, an overarching goal is to avoid/minimize this kind of configuration as much as possible and just have it adapt and work well depending on the various workload parameters in each case.

joerunde · 2022-08-01T18:05:24Z

Test looks okay, I'll trust that the internals work!

Just kinda wondering if there's a way to influence all these heuristics, maybe with like a TARGET_MINIMUM_LRU configuration or something so we can trigger some different behavior in the small, lightweight model case vs the large, hard to load model case, without having a hundred different configurations exposed

Signed-off-by: Nick Hill <nickhill@us.ibm.com>

njhill · 2022-08-02T00:24:48Z

Thanks @joerunde I've made a few updates addressing and/or inspired by your comments. In particular having the add-copy LRU threshold be based on the corresponding max lastUsed age, to ensure it makes sense relative to the 10% LRU scale-down logic. PTAL

chinhuang007

Looks like you have addressed Joe's comments. Thanks, @njhill ! Some parameters could become configurable later if needed. Also need to document clearly the impact when we provide more knobs for users to turn.

joerunde

lpgtm too!

kserve-oss-bot · 2022-08-02T16:19:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chinhuang007, joerunde, njhill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [njhill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

chinhuang007 · 2022-08-02T16:56:25Z

/lgtm

Motivation Somehow I failed to push a prepared final commit to #49's branch before it was merged. Embarrassingly this includes removal of a stdout debug message. Modification - Adjust min LRU for second copy scaleup: Make it a function of SECOND_COPY_MAX_AGE_SECS to ensure that there won't be flip-flopping with scaling down. - Remove spurious debug sysout Result Second copy scaling thresholds should be better aligned, stop debug messages spewing to stdout.

Motivation Somehow I failed to push a prepared final commit to #49's branch before it was merged. Embarrassingly this includes removal of a stdout debug message. Modification - Adjust min LRU for second copy scaleup: Make it a function of SECOND_COPY_MAX_AGE_SECS to ensure that there won't be flip-flopping with scaling down - Remove spurious debug sysout Result Second copy scaling thresholds should be better aligned, stop debug messages spewing to stdout. Signed-off-by: Nick Hill <nickhill@us.ibm.com>

#### Motivation Somehow I failed to push a prepared final commit to #49's branch before it was merged. Embarrassingly this includes removal of a stdout debug message. ##### Modification - Adjust min LRU for second copy scaleup: Make it a function of `SECOND_COPY_MAX_AGE_SECS` to ensure that there won't be flip-flopping with scaling down - Remove spurious debug sysout ##### Result Second copy scaling thresholds should be better aligned, stop debug messages spewing to stdout. Signed-off-by: Nick Hill <nickhill@us.ibm.com>

kserve-oss-bot added the do-not-merge/work-in-progress label Jul 29, 2022

kserve-oss-bot requested review from animeshsingh and Tomcli July 29, 2022 22:07

kserve-oss-bot added the approved label Jul 29, 2022

njhill marked this pull request as ready for review August 1, 2022 16:08

kserve-oss-bot removed the do-not-merge/work-in-progress label Aug 1, 2022

njhill requested a review from chinhuang007 August 1, 2022 17:29

joerunde reviewed Aug 1, 2022

View reviewed changes

Address some of @joerunde's comments

3e5fbe3

Signed-off-by: Nick Hill <nickhill@us.ibm.com>

chinhuang007 approved these changes Aug 2, 2022

View reviewed changes

joerunde approved these changes Aug 2, 2022

View reviewed changes

kserve-oss-bot assigned chinhuang007 Aug 2, 2022

kserve-oss-bot added the lgtm label Aug 2, 2022

kserve-oss-bot merged commit 2790ef2 into main Aug 2, 2022

njhill deleted the slower-scale2 branch August 2, 2022 17:02

njhill mentioned this pull request Aug 12, 2022

fix: Adjustments/cleanup missed from prior PR #49 #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Rework second-copy scaling logic #49

feat: Rework second-copy scaling logic #49

njhill commented Jul 29, 2022

joerunde Aug 1, 2022 •

edited

Loading

njhill Aug 1, 2022

joerunde Aug 1, 2022 •

edited

Loading

njhill Aug 1, 2022

joerunde Aug 1, 2022

njhill Aug 1, 2022 •

edited

Loading

joerunde Aug 1, 2022 •

edited

Loading

njhill Aug 1, 2022

joerunde commented Aug 1, 2022

njhill commented Aug 2, 2022

chinhuang007 left a comment

joerunde left a comment

kserve-oss-bot commented Aug 2, 2022

chinhuang007 commented Aug 2, 2022


		final long before = logger.isDebugEnabled() ? nanoTime() : 0L;
		boolean success = false;

feat: Rework second-copy scaling logic #49

feat: Rework second-copy scaling logic #49

Conversation

njhill commented Jul 29, 2022

Motivation

Modifications

Result

joerunde Aug 1, 2022 • edited Loading

Choose a reason for hiding this comment

njhill Aug 1, 2022

Choose a reason for hiding this comment

joerunde Aug 1, 2022 • edited Loading

Choose a reason for hiding this comment

njhill Aug 1, 2022

Choose a reason for hiding this comment

joerunde Aug 1, 2022

Choose a reason for hiding this comment

njhill Aug 1, 2022 • edited Loading

Choose a reason for hiding this comment

joerunde Aug 1, 2022 • edited Loading

Choose a reason for hiding this comment

njhill Aug 1, 2022

Choose a reason for hiding this comment

joerunde commented Aug 1, 2022

njhill commented Aug 2, 2022

chinhuang007 left a comment

Choose a reason for hiding this comment

joerunde left a comment

Choose a reason for hiding this comment

kserve-oss-bot commented Aug 2, 2022

chinhuang007 commented Aug 2, 2022

joerunde Aug 1, 2022 •

edited

Loading

joerunde Aug 1, 2022 •

edited

Loading

njhill Aug 1, 2022 •

edited

Loading

joerunde Aug 1, 2022 •

edited

Loading