multi-category support, rate limiting, and pagination #121

kaituo · 2021-07-06T23:37:07Z

Description

This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167

Issues Resolved

#85 (rate limiting)
#87 (multiple category fields)
#86 (pagination)

Check List

[ X ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

EntityColdStartWorker used COLD_ENTITY_QUEUE_CONCURRENCY instead of ENTITY_COLD_START_QUEUE_CONCURRENCY. This PR fixes the bug. This PR also made changes to AnomalyDetectorSettings.java by * removing unused setting COLD_ENTITY_QUEUE_CONCURRENCY * renaming setting from entity_coldstart to entity_cold_start to be consistent Testing done: 1. Added unit tests for the changed code.

We schedule the next run if there are more requests than allowed in a single batch. Previously, we used filteredRequests.size > batchSize to decide whether a next run is required or not. But filteredRequests may be low despite there are still a lot of requests in the queue. This PR fixes the bug by using requestSize > batchSize instead. This PR also changes the return type of pullRequests as we don’t use its return value. Testing done: 1. Added unit tests for the changed code.

The stats API broadcasts requests to all nodes and renders node responses using toXContent. For the local node, the stats API's calls toXContent on the node response directly. For remote node, the coordinating node gets a serialized content from ADStatsNodeResponse.writeTo, deserializes the content and renders the result using toXContent. ADStatsNodeResponse.writeTo uses StreamOutput::writeGenericValue that only recognizes built-in types (e.g., String or primitive types). It throws exceptions for types such as java.time.Instant and Entity. This PR fixes the bug by converting non-built-in types to built-in types in node stats response. This PR also removes model id from Entity.toString as it is random across different runs and keeps only attributes. This can help locate the same node in hash ring. Testing done: * Verified stats API response on a multi-node cluster for a multiple-category field detector. * Added unit tests.

…ing once Currently, I prune overflow requests in multiple loop iterations. This PR calculates how many requests we should prune directly to avoid this while loop. This PR also: *Put the hardcoded worker name to variables for code reuse. *Fixed a bug in RateLimitedRequestWorker’s constructor that doesn’t assign variables in the setting update consumer. Testing done: * added unit tests for related changes.

PriorityTracker uses a ConcurrentSkipListSet to record priorities/frequencies of entities. I didn’t realize that ConcurrentSkipListSet.first() and ConcurrentSkipListSet.last() method throws exceptions when the set is empty. This PR adds an empty check. Also, we want PriorityCache.selectUpdateCandidate to be side-effect free. Thus, this PR replaces the computeBufferIfAbsent method call with activeEnities.get as computeBufferIfAbsent has side effects by creating a CacheBuffer. Testing done: * created unit tests for related changes.

We throw an EndRunException (endNow = true) whenever there is an SearchPhaseExecutionException. EndRunException (endNow = true) stops a detector immediately. But these query exceptions can happen during the starting phase of the OpenSearch process. To confirm the EndRunException is not a false positive, this PR fixed that by setting the stopNow parameter of these EndRunException to false. Testing done: * created unit tests for related changes.

ohltyler · 2021-07-08T17:22:45Z

src/main/java/org/opensearch/ad/rest/RestGetAnomalyDetectorAction.java

+
+        if (entityName != null && entityValue != null) {
+            // single-stream profile request:
+            // GET _opendistro/_anomaly_detection/detectors/<detectorId>/_profile/init_progress?category_field=<field-name>&entity=<value>


nit: _opendistro -> _plugins

ohltyler · 2021-07-08T17:23:01Z

src/main/java/org/opensearch/ad/rest/RestGetAnomalyDetectorAction.java

+            return Entity.createSingleAttributeEntity(detectorId, entityName, entityValue);
+        } else if (request.hasContent()) {
+            /* HCAD profile request:
+             * GET _opendistro/_anomaly_detection/detectors/<detectorId>/_profile/init_progress


nit: _opendistro -> _plugins

ylwu-amzn · 2021-07-08T20:58:48Z

Some high level question which not covered in previous PR review.

Current design will dispatch entities to different work nodes with consistent hashing. Each work node will cache entities in queue and run RCF model, then bulk write anomaly result to index. If current detection interval timeout, we just drop all remaining entities in queue to avoid carry over to next interval. It's possible that some worker node may receive more entities which needs to run longer like need to load checkpoints. So the unbalanced load may cause some worker node drops more entities. But seems not a big concern as we don't guarantee all entities will be detected, there are other reasons can cause dropping entities like worker node queue is full.

Have some concern about the horizontal scalability. For example, if user scale cluster from 10 data nodes to 100 data nodes, then will have 100 worker nodes running entities, that means we may send out 100 bulk write AD result requests concurrently. But the AD result index shard setting is still the same, that may bring higher write load and user's general bulk write request may impacted. Confirm with PM/SDM if the concern of horizontal scalable issue is reasonable.

kaituo · 2021-07-08T21:23:35Z

Some high level question which not covered in previous PR review.

Current design will dispatch entities to different work nodes with consistent hashing. Each work node will cache entities in queue and run RCF model, then bulk write anomaly result to index. If current detection interval timeout, we just drop all remaining entities in queue to avoid carry over to next interval. It's possible that some worker node may receive more entities which needs to run longer like need to load checkpoints. So the unbalanced load may cause some worker node drops more entities. But seems not a big concern as we don't guarantee all entities will be detected, there are other reasons can cause dropping entities like worker node queue is full.

Have some concern about the horizontal scalability. For example, if user scale cluster from 10 data nodes to 100 data nodes, then will have 100 worker nodes running entities, that means we may send out 100 bulk write AD result requests concurrently. But the AD result index shard setting is still the same, that may bring higher write load and user's general bulk write request may impacted. Confirm with PM/SDM if the concern of horizontal scalable issue is reasonable.

Will confirm with Sean about the horizontal scalability issue. Result write is not problem, as we are gonna drop results when the index pressure is high. But requests like reading checkpoints are impacted when
1)traffic is random (like ip address)
2)there are more entities than available memory can host

codecov-commenter · 2021-07-08T21:36:13Z

Codecov Report

Merging #121 (69490ff) into main (f7e34e5) will decrease coverage by 3.20%.
The diff coverage is 68.59%.

@@             Coverage Diff              @@
##               main     #121      +/-   ##
============================================
- Coverage     79.30%   76.09%   -3.21%     
- Complexity     2692     2884     +192     
============================================
  Files           243      264      +21     
  Lines         11132    12435    +1303     
  Branches       1014     1222     +208     
============================================
+ Hits           8828     9463     +635     
- Misses         1888     2455     +567     
- Partials        416      517     +101

Flag	Coverage Δ
plugin	`76.09% <68.59%> (-3.21%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...va/org/opensearch/ad/AnomalyDetectorJobRunner.java	`75.00% <0.00%> (ø)`
.../java/org/opensearch/ad/AnomalyDetectorRunner.java	`44.70% <0.00%> (-1.08%)`	⬇️
...in/java/org/opensearch/ad/EntityProfileRunner.java	`6.85% <0.00%> (-71.32%)`	⬇️
...ain/java/org/opensearch/ad/caching/DoorKeeper.java	`94.11% <ø> (ø)`
...in/java/org/opensearch/ad/constant/CommonName.java	`66.66% <ø> (ø)`
...g/opensearch/ad/transport/ADStatsNodeResponse.java	`100.00% <ø> (ø)`
...ch/ad/transport/handler/DetectionStateHandler.java	`87.50% <ø> (ø)`
...ad/transport/handler/MultiEntityResultHandler.java	`13.51% <0.00%> (-9.75%)`	⬇️
...c/main/java/org/opensearch/ad/util/ParseUtils.java	`53.75% <0.00%> (-2.98%)`	⬇️
...in/java/org/opensearch/ad/feature/ScriptMaker.java	`11.11% <11.11%> (ø)`
... and 114 more

kaituo · 2021-07-08T23:40:53Z

Some high level question which not covered in previous PR review.
Current design will dispatch entities to different work nodes with consistent hashing. Each work node will cache entities in queue and run RCF model, then bulk write anomaly result to index. If current detection interval timeout, we just drop all remaining entities in queue to avoid carry over to next interval. It's possible that some worker node may receive more entities which needs to run longer like need to load checkpoints. So the unbalanced load may cause some worker node drops more entities. But seems not a big concern as we don't guarantee all entities will be detected, there are other reasons can cause dropping entities like worker node queue is full.
Have some concern about the horizontal scalability. For example, if user scale cluster from 10 data nodes to 100 data nodes, then will have 100 worker nodes running entities, that means we may send out 100 bulk write AD result requests concurrently. But the AD result index shard setting is still the same, that may bring higher write load and user's general bulk write request may impacted. Confirm with PM/SDM if the concern of horizontal scalable issue is reasonable.

Will confirm with Sean about the horizontal scalability issue. Result write is not problem, as we are gonna drop results when the index pressure is high. But requests like reading checkpoints are impacted when
1)traffic is random (like ip address)
2)there are more entities than available memory can host

@ylwu-amzn, asked Sean. He said creating an issue to track #125 and don't block code commit.

kaituo · 2021-07-08T23:43:00Z

@ylwu-amzn, please also comment on the issue I created with solutions.

This PR is a conglomerate of the following PRs. #60 #64 #65 #67 #68 #69 #70 #71 #74 #75 #76 #77 #78 #79 #82 #83 #84 #92 #94 #93 #95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167

kaituo added 12 commits July 6, 2021 15:17

Multi-category field

4a8a882

Remove unnecessary license and add comments

6393e71

Add tests for QueuedRequest

c13314e

cleaning expired requests before calling the drain method

3ed02af

Add unit tests for CheckpointDao

c02cca1

ignore jacoco coverage for a couple of classes for now

0dee272

kaituo requested review from ylwu-amzn, dai-chen, jmazanec15, LiuJoyceC, ohltyler, penghuo and spbjss July 6, 2021 23:37

kaituo changed the title ~~Multi cat5~~ multi-category support, rate limiting, and pagination Jul 6, 2021

ohltyler added the v1.1.0 label Jul 8, 2021

ohltyler reviewed Jul 8, 2021

View reviewed changes

address Tyler's comments; add more tests

69490ff

kaituo mentioned this pull request Jul 8, 2021

Adaptive scaling of AD indices #125

Open

ylwu-amzn approved these changes Jul 12, 2021

View reviewed changes

ohltyler approved these changes Jul 12, 2021

View reviewed changes

kaituo merged commit ea22d59 into opensearch-project:main Jul 12, 2021

ohltyler mentioned this pull request Jul 29, 2021

Support multiple category fields opensearch-project/anomaly-detection-dashboards-plugin#66

Merged

1 task

kaituo added the feature new feature label Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-category support, rate limiting, and pagination #121

multi-category support, rate limiting, and pagination #121

kaituo commented Jul 6, 2021 •

edited

Loading

ohltyler Jul 8, 2021

kaituo Jul 8, 2021

ohltyler Jul 8, 2021

kaituo Jul 8, 2021

ylwu-amzn commented Jul 8, 2021

kaituo commented Jul 8, 2021

codecov-commenter commented Jul 8, 2021

kaituo commented Jul 8, 2021 •

edited

Loading

kaituo commented Jul 8, 2021

multi-category support, rate limiting, and pagination #121

multi-category support, rate limiting, and pagination #121

Conversation

kaituo commented Jul 6, 2021 • edited Loading

Description

Issues Resolved

Check List

ohltyler Jul 8, 2021

Choose a reason for hiding this comment

kaituo Jul 8, 2021

Choose a reason for hiding this comment

ohltyler Jul 8, 2021

Choose a reason for hiding this comment

kaituo Jul 8, 2021

Choose a reason for hiding this comment

ylwu-amzn commented Jul 8, 2021

kaituo commented Jul 8, 2021

codecov-commenter commented Jul 8, 2021

Codecov Report

kaituo commented Jul 8, 2021 • edited Loading

kaituo commented Jul 8, 2021

kaituo commented Jul 6, 2021 •

edited

Loading

kaituo commented Jul 8, 2021 •

edited

Loading