Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illustrate how we use queues #71

Closed
wants to merge 2 commits into from

Conversation

kaituo
Copy link
Collaborator

@kaituo kaituo commented May 25, 2021

Note: since there are a lot of dependencies, I only list the main class and test code to save reviewers' time. The build will fail due to missing dependencies. I will use that PR just for review. will not merge it. Will have a big one in the end and merge once after all review PRs get approved. Now the code is missing unit tests. Posting PRs now to meet the cutoff date (June 1). Will add unit tests, run performance tests, and fix bugs before the official release.

Description

We have created multiple queues for rate-limiting expensive requests. This PR illustrates how we actually use these queues.

  1. We store as many frequently used entity models in a cache as allowed by the memory limit (10% heap). If an entity feature is a hit, we use the in-memory model to detect anomalies and record results using the result write queue.
  2. If an entity feature is a miss, we check if there is free memory or any other entity's model can be evacuated. An in-memory entity's frequency may be lower compared to the cache miss entity. If that's the case, we replace the lower frequency entity's model with the higher frequency entity's model. To load the higher frequency entity's model, we first check if a model exists on disk by sending a checkpoint read queue request. If there is a checkpoint, we load it to memory, perform detection, and save the result using the result write queue. Otherwise, we enqueue a cold start request to the cold start queue for model training. If training is successful, we save the learned model via the checkpoint write queue.
  3. We also have the cold entity queue configured for cold entities, and the model training and inference are connected by serial juxtaposition to limit resource usage.

Testing done:

  1. Manual tests using 10 HCAD detectors and 12,000 entities in a 3 node cluster.

Check List

  • [ Y ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

We have created multiple queues for rate-limiting expensive requests.  This PR illustrates how we actually use these queues.
1. We store as many frequently used entity models in a cache as allowed by the memory limit (10% heap). If an entity feature is a hit, we use the in-memory model to detect anomalies and record results using the result write queue.
2. If an entity feature is a miss, we check if there is free memory or any other entity's model can be evacuated. An in-memory entity's frequency may be lower compared to the cache miss entity. If that's the case, we replace the lower frequency entity's model with the higher frequency entity's model. To load the higher frequency entity's model, we first check if a model exists on disk by sending a checkpoint read queue request. If there is a checkpoint, we load it to memory, perform detection, and save the result using the result write queue. Otherwise, we enqueue a cold start request to the cold start queue for model training. If training is successful, we save the learned model via the checkpoint write queue.
3. We also have the cold entity queue configured for cold entities, and the model training and inference are connected by serial juxtaposition to limit resource usage.

Testing done:
1. Manual tests using 10 HCAD detectors and 12,000 entities in a 3 node cluster.
@kaituo kaituo closed this Jun 22, 2021
kaituo added a commit that referenced this pull request Jul 12, 2021
This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
ohltyler pushed a commit that referenced this pull request Sep 1, 2021
This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants