Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-category support, rate limiting, and pagination #121

Merged
merged 13 commits into from
Jul 12, 2021

Conversation

kaituo
Copy link
Collaborator

@kaituo kaituo commented Jul 6, 2021

Description

This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167

Issues Resolved

#85 (rate limiting)
#87 (multiple category fields)
#86 (pagination)

Check List

  • [ X ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

kaituo added 12 commits July 6, 2021 15:17
EntityColdStartWorker used COLD_ENTITY_QUEUE_CONCURRENCY instead of ENTITY_COLD_START_QUEUE_CONCURRENCY. This PR fixes the bug.

This PR also made changes to AnomalyDetectorSettings.java by
* removing unused setting COLD_ENTITY_QUEUE_CONCURRENCY
* renaming setting from entity_coldstart to  entity_cold_start to be consistent

Testing done:
1. Added unit tests for the changed code.
We schedule the next run if there are more requests than allowed in a single batch. Previously, we used filteredRequests.size > batchSize to decide whether a next run is required or not. But filteredRequests may be low despite there are still a lot of requests in the queue. This PR fixes the bug by using requestSize > batchSize instead.

This PR also changes the return type of pullRequests as we don’t use its return value.

Testing done:
1. Added unit tests for the changed code.
The stats API broadcasts requests to all nodes and renders node responses using toXContent. For the local node, the stats API's calls toXContent on the node response directly. For remote node, the coordinating node gets a serialized content from

ADStatsNodeResponse.writeTo, deserializes the content and renders the result using toXContent. ADStatsNodeResponse.writeTo uses StreamOutput::writeGenericValue that only recognizes built-in types (e.g., String or primitive types). It throws exceptions for types such as java.time.Instant and Entity. This PR fixes the bug by converting non-built-in types to built-in types in node stats response.

This PR also removes model id from Entity.toString as it is random across different runs and keeps only attributes.  This can help locate the same node in hash ring.

Testing done:
* Verified stats API response on a multi-node cluster for a multiple-category field detector.
* Added unit tests.
…ing once

Currently, I prune overflow requests in multiple loop iterations. This PR calculates how many requests we should prune directly to avoid this while loop.

This PR also:
*Put the hardcoded worker name to variables for code reuse.
*Fixed a bug in RateLimitedRequestWorker’s constructor that doesn’t assign variables in the setting update consumer.

Testing done:
* added unit tests for related changes.
PriorityTracker uses a ConcurrentSkipListSet to record priorities/frequencies of entities. I didn’t realize that ConcurrentSkipListSet.first() and ConcurrentSkipListSet.last() method throws exceptions when the set is empty. This PR adds an empty check.

Also, we want PriorityCache.selectUpdateCandidate to be side-effect free. Thus, this PR replaces the computeBufferIfAbsent method call with activeEnities.get as computeBufferIfAbsent has side effects by creating a CacheBuffer.

Testing done:
* created unit tests for related changes.
We throw an EndRunException (endNow = true) whenever there is an SearchPhaseExecutionException. EndRunException (endNow = true)  stops a detector immediately. But these query exceptions can happen during the starting phase of the OpenSearch process. To confirm the EndRunException is not a false positive, this PR fixed that by setting the stopNow parameter of these EndRunException to false.

Testing done:
* created unit tests for related changes.
@kaituo kaituo changed the title Multi cat5 multi-category support, rate limiting, and pagination Jul 6, 2021
@ohltyler ohltyler added the v1.1.0 label Jul 8, 2021

if (entityName != null && entityValue != null) {
// single-stream profile request:
// GET _opendistro/_anomaly_detection/detectors/<detectorId>/_profile/init_progress?category_field=<field-name>&entity=<value>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _opendistro -> _plugins

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return Entity.createSingleAttributeEntity(detectorId, entityName, entityValue);
} else if (request.hasContent()) {
/* HCAD profile request:
* GET _opendistro/_anomaly_detection/detectors/<detectorId>/_profile/init_progress
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _opendistro -> _plugins

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@ylwu-amzn
Copy link
Collaborator

Some high level question which not covered in previous PR review.

Current design will dispatch entities to different work nodes with consistent hashing. Each work node will cache entities in queue and run RCF model, then bulk write anomaly result to index. If current detection interval timeout, we just drop all remaining entities in queue to avoid carry over to next interval. It's possible that some worker node may receive more entities which needs to run longer like need to load checkpoints. So the unbalanced load may cause some worker node drops more entities. But seems not a big concern as we don't guarantee all entities will be detected, there are other reasons can cause dropping entities like worker node queue is full.

Have some concern about the horizontal scalability. For example, if user scale cluster from 10 data nodes to 100 data nodes, then will have 100 worker nodes running entities, that means we may send out 100 bulk write AD result requests concurrently. But the AD result index shard setting is still the same, that may bring higher write load and user's general bulk write request may impacted. Confirm with PM/SDM if the concern of horizontal scalable issue is reasonable.

@kaituo
Copy link
Collaborator Author

kaituo commented Jul 8, 2021

Some high level question which not covered in previous PR review.

Current design will dispatch entities to different work nodes with consistent hashing. Each work node will cache entities in queue and run RCF model, then bulk write anomaly result to index. If current detection interval timeout, we just drop all remaining entities in queue to avoid carry over to next interval. It's possible that some worker node may receive more entities which needs to run longer like need to load checkpoints. So the unbalanced load may cause some worker node drops more entities. But seems not a big concern as we don't guarantee all entities will be detected, there are other reasons can cause dropping entities like worker node queue is full.

Have some concern about the horizontal scalability. For example, if user scale cluster from 10 data nodes to 100 data nodes, then will have 100 worker nodes running entities, that means we may send out 100 bulk write AD result requests concurrently. But the AD result index shard setting is still the same, that may bring higher write load and user's general bulk write request may impacted. Confirm with PM/SDM if the concern of horizontal scalable issue is reasonable.

Will confirm with Sean about the horizontal scalability issue. Result write is not problem, as we are gonna drop results when the index pressure is high. But requests like reading checkpoints are impacted when
1)traffic is random (like ip address)
2)there are more entities than available memory can host

@codecov-commenter
Copy link

Codecov Report

Merging #121 (69490ff) into main (f7e34e5) will decrease coverage by 3.20%.
The diff coverage is 68.59%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main     #121      +/-   ##
============================================
- Coverage     79.30%   76.09%   -3.21%     
- Complexity     2692     2884     +192     
============================================
  Files           243      264      +21     
  Lines         11132    12435    +1303     
  Branches       1014     1222     +208     
============================================
+ Hits           8828     9463     +635     
- Misses         1888     2455     +567     
- Partials        416      517     +101     
Flag Coverage Δ
plugin 76.09% <68.59%> (-3.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...va/org/opensearch/ad/AnomalyDetectorJobRunner.java 75.00% <0.00%> (ø)
.../java/org/opensearch/ad/AnomalyDetectorRunner.java 44.70% <0.00%> (-1.08%) ⬇️
...in/java/org/opensearch/ad/EntityProfileRunner.java 6.85% <0.00%> (-71.32%) ⬇️
...ain/java/org/opensearch/ad/caching/DoorKeeper.java 94.11% <ø> (ø)
...in/java/org/opensearch/ad/constant/CommonName.java 66.66% <ø> (ø)
...g/opensearch/ad/transport/ADStatsNodeResponse.java 100.00% <ø> (ø)
...ch/ad/transport/handler/DetectionStateHandler.java 87.50% <ø> (ø)
...ad/transport/handler/MultiEntityResultHandler.java 13.51% <0.00%> (-9.75%) ⬇️
...c/main/java/org/opensearch/ad/util/ParseUtils.java 53.75% <0.00%> (-2.98%) ⬇️
...in/java/org/opensearch/ad/feature/ScriptMaker.java 11.11% <11.11%> (ø)
... and 114 more

@kaituo
Copy link
Collaborator Author

kaituo commented Jul 8, 2021

Some high level question which not covered in previous PR review.
Current design will dispatch entities to different work nodes with consistent hashing. Each work node will cache entities in queue and run RCF model, then bulk write anomaly result to index. If current detection interval timeout, we just drop all remaining entities in queue to avoid carry over to next interval. It's possible that some worker node may receive more entities which needs to run longer like need to load checkpoints. So the unbalanced load may cause some worker node drops more entities. But seems not a big concern as we don't guarantee all entities will be detected, there are other reasons can cause dropping entities like worker node queue is full.
Have some concern about the horizontal scalability. For example, if user scale cluster from 10 data nodes to 100 data nodes, then will have 100 worker nodes running entities, that means we may send out 100 bulk write AD result requests concurrently. But the AD result index shard setting is still the same, that may bring higher write load and user's general bulk write request may impacted. Confirm with PM/SDM if the concern of horizontal scalable issue is reasonable.

Will confirm with Sean about the horizontal scalability issue. Result write is not problem, as we are gonna drop results when the index pressure is high. But requests like reading checkpoints are impacted when
1)traffic is random (like ip address)
2)there are more entities than available memory can host

@ylwu-amzn, asked Sean. He said creating an issue to track #125 and don't block code commit.

@kaituo
Copy link
Collaborator Author

kaituo commented Jul 8, 2021

@ylwu-amzn, please also comment on the issue I created with solutions.

@kaituo kaituo merged commit ea22d59 into opensearch-project:main Jul 12, 2021
@kaituo kaituo added the feature new feature label Aug 12, 2021
ohltyler pushed a commit that referenced this pull request Sep 1, 2021
This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants