Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] Finetune Schedule to be less noisy on retry and retry slower #88531

Merged

Conversation

hendrikmuhs
Copy link

@hendrikmuhs hendrikmuhs commented Jul 14, 2022

reduce amount of log and audits if the same failure happens in a row
and change the minimum wait time for retrying to 5s

This finetunes changes done in #84657 which makes retrying independent of frequency, which as a result triggers retry faster than before with the side-effect of producing more noise. For this change I investigated the log/audit behavior and adjusted it.

Notes:

  • the change contains mostly test changes, the important code changes are:
    • set MIN_DELAY_MILLIS to 5s (from 1s)
    • dedup logs/audits based on exception class
      • the 1st (retry-able) failure gets logged/audited, consequent failures only if they differ from the last one. The last retry is logged as well
      • exceptions are usually wrapped in a SearchPhaseExecutionException, so it is important to unwrap this as a 1st step. Just taking the message from the unwrapped class would still produce a lot of noise, because the messages can slightly differ although they have the same root cause. Using the class name of the unwrapped exception seems like a pragmatic way to dedup them.
      • I moved the last error into the context, note: there is no need to persist this as we do not stop the transform after such a failure, but keep the error in-memory only
  • Testing:
  • Log/Audit message:
    • "Transform encountered an exception: [{Exception}]; Will automatically retry [1/10]"

relates #84657

(marked as non-issue as this is an addition to #84657)

… change

the mininimum wait time for retrying to 5s
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jul 14, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

}

@Override
public void assertMatched() {
assertThat("expected to see " + expectedName + " but did not", saw, equalTo(true));
assertThat("expected to see " + expectedName + " but did not", count, equalTo(expectedCount));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If would be good to distinguish the case where we don't see the message at all to the case where we see it too many times. Otherwise, if we ever get this test failure we won't know which scenario was to blame.

In fact, just reading further the error message in MultipleSeenAuditExpectation is ideal, so maybe just delete this class and always use MultipleSeenAuditExpectation with expected count 1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that I changed the way it worked, before it just flipped a bool, so it was "seen at least once", with the count I changed it to "exactly once". However the tests pass and "exactly once" is probably what we want anyway.

I will merge the 2 classes and keep the constructor that defaults to "seen once" as convenience. That way it does not trigger further refactorings.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hendrikmuhs hendrikmuhs merged commit 9a0f05f into elastic:master Jul 14, 2022
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jul 15, 2022
* upstream/master: (2974 commits)
  Reserved cluster state service (elastic#88527)
  Add transport action immutable state checks (elastic#88491)
  Remove suggest flag from index stats docs (elastic#85479)
  Polling cluster formation state for master-is-stable health indicator (elastic#88397)
  Add test execution guide in yamlRestTest asciidoc (elastic#88490)
  Add troubleshooting guide for corrupt repository (elastic#88391)
  [Transform] Finetune Schedule to be less noisy on retry and retry slower (elastic#88531)
  Updatable API keys - auto-update legacy RDs (elastic#88514)
  Fix typo in TransportForceMergeAction and TransportClearIndicesCacheA… (elastic#88064)
  Fixed NullPointerException on bulk request (elastic#88358)
  Avoid needless index metadata builders during reroute (elastic#88506)
  Set metadata on request in API key noop test (elastic#88507)
  Fix passing positional args to ES in Docker (elastic#88502)
  Improve description for task api detailed param (elastic#88493)
  Support cartesian shape with doc values (elastic#88487)
  Promote usage of Subjects in Authentication class (elastic#88494)
  Add CCx 2.0 feature flag (elastic#88451)
  Reword the watcher 'always' and 'never' condition docs (elastic#86105)
  Simplify azure discovery installation docs (elastic#88404)
  Breakup FIPS CI testing jobs
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java
#	x-pack/plugin/mapper-aggregate-metric/src/main/java/org/elasticsearch/xpack/aggregatemetric/mapper/AggregateDoubleMetricFieldMapper.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml/Transform Transform >non-issue Team:ML Meta label for the ML team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants