Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Entity to support multi-category field #78

Closed
wants to merge 4 commits into from

Conversation

kaituo
Copy link
Collaborator

@kaituo kaituo commented May 26, 2021

Note: since there are a lot of dependencies, I only list the main class and test code to save reviewers' time. The build will fail due to missing dependencies. I will use that PR just for review. will not merge it. Will have a big one in the end and merge once after all review PRs get approved. Now the code is missing unit tests. Will add unit tests, run performance tests, and fix bugs before the official release.

Description

This PR modified Entity to include multiple categorical fields. One key change is to not use categorical values directly in the model document Id. The problem is that HCAD v1 uses the categorical value as part of the model document Id, but ES’s document Id can be at most 512 bytes. Categorical values are usually less than 256 characters, but can grow to 32766 in theory. HCAD v1 skips an entity if the entity's name is more than 256 characters. We cannot do that in v2 as that can reject a lot of entities. To overcome the obstacle, we hash categorical values to a 128-bit string (like SHA-1 that git uses) and use the hash as part of the model document Id.

We have choices to make regarding when to use the hash as part of a model document Id: for all HC detectors or a HC detector with multiple categorical fields. The challenge lies in providing backward compatibility of looking for a model checkpoint in the case of a HC detector with one categorical field. If using hashes for all HC detectors, we need two get requests to ensure that a model checkpoint exists. One uses the document Id without a hash, while one uses the document Id with a hash. The dual get requests are ineffective. If limiting hashes to a HC detector with multiple categorical fields, there is no backward compatibility issue. However, the code will be branchy. One may wonder if backward compatibility can be ignored; indeed, the old checkpoints will be gone after a transition period during upgrading. During the transition period, HC detectors can experience unnecessary cold starts as if the detectors were just started. Checkpoint index size can double if every model has two model documents. The transition period can be as long as 3 days since our checkpoint retention period is 3 days. There is no perfect solution. We prefer limiting hashes to an HC detector with multiple categorical fields as its customer impact is none.

Testing done:

  1. Manually tested the model id of multi-category entity works, and the single-category entity does not break.

Check List

  • [ Y ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This PR modified Entity to include multiple categorical fields.  One key change is to not use categorical values directly in the model document Id. The problem is that HCAD v1 uses the categorical value as part of the model document Id, but ES’s document Id can be at most 512 bytes. Categorical values are usually less than 256 characters, but can grow to 32766 in theory. HCAD v1 skips an entity if the entity's name is more than 256 characters. We cannot do that in v2 as that can reject a lot of entities. To overcome the obstacle, we hash categorical values to a 128-bit string (like SHA-1 that git uses) and use the hash as part of the model document Id.

We have choices to make regarding when to use the hash as part of a model document Id: for all HC detectors or a HC detector with multiple categorical fields. The challenge lies in providing backward compatibility of looking for a model checkpoint in the case of a HC detector with one categorical field. If using hashes for all HC detectors, we need two get requests to ensure that a model checkpoint exists. One uses the document Id without a hash, while one uses the document Id with a hash. The dual get requests are ineffective. If limiting hashes to a HC detector with multiple categorical fields, there is no backward compatibility issue. However, the code will be branchy. One may wonder if backward compatibility can be ignored; indeed, the old checkpoints will be gone after a transition period during upgrading. During the transition period, HC detectors can experience unnecessary cold starts as if the detectors were just started. Checkpoint index size can double if every model has two model documents. The transition period can be as long as 3 days since our checkpoint retention period is 3 days. There is no perfect solution. We prefer limiting hashes to an HC detector with multiple categorical fields as its customer impact is none.

Testing done:
1. Manually tested the model id of multi-category entity works, and the single-category entity does not break.
@kaituo kaituo requested review from ylwu-amzn and ohltyler May 26, 2021 22:11
Copy link
Member

@ohltyler ohltyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -136,14 +144,17 @@ private void onFailure(Exception e, ActionListener<List<AnomalyResult>> listener
// TODO return exception like IllegalArgumentException to explain data is not enough for preview
// This also requires front-end change to handle error message correspondingly
// We return empty list for now to avoid breaking front-end
if (e instanceof OpenSearchSecurityException) {
listener.onFailure(e);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If exception is OpenSearchSecurityException, will call onFailure first, then call onResponse, it will throw connection closed exception.

Add return in this if block, or put listener.onResponse(Collections.emptyList()); in else block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, fixed this one and another similar bug.

Copy link
Collaborator

@ylwu-amzn ylwu-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kaituo kaituo closed this Jun 22, 2021
kaituo added a commit that referenced this pull request Jul 12, 2021
This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
ohltyler pushed a commit that referenced this pull request Sep 1, 2021
This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants