Illustrate how we use queues #71

kaituo · 2021-05-25T23:39:08Z

Note: since there are a lot of dependencies, I only list the main class and test code to save reviewers' time. The build will fail due to missing dependencies. I will use that PR just for review. will not merge it. Will have a big one in the end and merge once after all review PRs get approved. Now the code is missing unit tests. Posting PRs now to meet the cutoff date (June 1). Will add unit tests, run performance tests, and fix bugs before the official release.

Description

We have created multiple queues for rate-limiting expensive requests. This PR illustrates how we actually use these queues.

We store as many frequently used entity models in a cache as allowed by the memory limit (10% heap). If an entity feature is a hit, we use the in-memory model to detect anomalies and record results using the result write queue.
If an entity feature is a miss, we check if there is free memory or any other entity's model can be evacuated. An in-memory entity's frequency may be lower compared to the cache miss entity. If that's the case, we replace the lower frequency entity's model with the higher frequency entity's model. To load the higher frequency entity's model, we first check if a model exists on disk by sending a checkpoint read queue request. If there is a checkpoint, we load it to memory, perform detection, and save the result using the result write queue. Otherwise, we enqueue a cold start request to the cold start queue for model training. If training is successful, we save the learned model via the checkpoint write queue.
We also have the cold entity queue configured for cold entities, and the model training and inference are connected by serial juxtaposition to limit resource usage.

Testing done:

Manual tests using 10 HCAD detectors and 12,000 entities in a 3 node cluster.

Check List

[ Y ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

We have created multiple queues for rate-limiting expensive requests. This PR illustrates how we actually use these queues. 1. We store as many frequently used entity models in a cache as allowed by the memory limit (10% heap). If an entity feature is a hit, we use the in-memory model to detect anomalies and record results using the result write queue. 2. If an entity feature is a miss, we check if there is free memory or any other entity's model can be evacuated. An in-memory entity's frequency may be lower compared to the cache miss entity. If that's the case, we replace the lower frequency entity's model with the higher frequency entity's model. To load the higher frequency entity's model, we first check if a model exists on disk by sending a checkpoint read queue request. If there is a checkpoint, we load it to memory, perform detection, and save the result using the result write queue. Otherwise, we enqueue a cold start request to the cold start queue for model training. If training is successful, we save the learned model via the checkpoint write queue. 3. We also have the cold entity queue configured for cold entities, and the model training and inference are connected by serial juxtaposition to limit resource usage. Testing done: 1. Manual tests using 10 HCAD detectors and 12,000 entities in a 3 node cluster.

src/main/java/org/opensearch/ad/util/ExceptionUtil.java

src/main/java/org/opensearch/ad/transport/EntityResultTransportAction.java

This PR is a conglomerate of the following PRs. #60 #64 #65 #67 #68 #69 #70 #71 #74 #75 #76 #77 #78 #79 #82 #83 #84 #92 #94 #93 #95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167

This PR is a conglomerate of the following PRs. #60 #64 #65 #67 #68 #69 #70 #71 #74 #75 #76 #77 #78 #79 #82 #83 #84 #92 #94 #93 #95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167

kaituo requested review from jmazanec15 and weicongs-amazon May 25, 2021 23:39

kaituo force-pushed the queueUsage branch from 0507bf8 to 6c5f8f3 Compare May 25, 2021 23:43

kaituo requested a review from jngz-es June 1, 2021 18:37

jmazanec15 reviewed Jun 1, 2021

View reviewed changes

Refactoring and fix comments

4bb113c

jmazanec15 approved these changes Jun 3, 2021

View reviewed changes

jngz-es approved these changes Jun 22, 2021

View reviewed changes

kaituo closed this Jun 22, 2021

kaituo mentioned this pull request Jul 6, 2021

multi-category support, rate limiting, and pagination #121

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Illustrate how we use queues #71

Illustrate how we use queues #71

kaituo commented May 25, 2021

Illustrate how we use queues #71

Illustrate how we use queues #71

Conversation

kaituo commented May 25, 2021

Description

Check List