-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response #43045
[ML] Add DatafeedTimingStats to datafeed GetDatafeedStatsAction.Response #43045
Conversation
Pinging @elastic/ml-core |
bc22def
to
57d3fbe
Compare
57d3fbe
to
c2a7577
Compare
88f7335
to
fa34a44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All datafeed BWC tests are muted still (I think) due to the changes to the date_histogram
aggregation interval change. If they weren't I think they would fail due to streaming issues since the 7.3.0 code does not have the change yet.
.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java
Outdated
Show resolved
Hide resolved
.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java
Outdated
Show resolved
Hide resolved
...gin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetDatafeedsStatsAction.java
Outdated
Show resolved
Hide resolved
.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java
Show resolved
Hide resolved
...gin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetDatafeedsStatsAction.java
Outdated
Show resolved
Hide resolved
fa34a44
to
1862f74
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All datafeed BWC tests are muted still (I think) due to the changes to the
date_histogram
aggregation interval change. If they weren't I think they would fail due to streaming issues since the 7.3.0 code does not have the change yet.
I'm not sure I understand. Which 'streaming issues' do you have in mind?
.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java
Outdated
Show resolved
Hide resolved
.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java
Outdated
Show resolved
Hide resolved
...gin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetDatafeedsStatsAction.java
Outdated
Show resolved
Hide resolved
.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java
Show resolved
Hide resolved
...gin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetDatafeedsStatsAction.java
Outdated
Show resolved
Hide resolved
@przemekwitek this is the streaming issue I was talking about: https://github.com/elastic/elasticsearch/pull/43045/files#diff-34cde95ae0dc7e6c0e10460382efb116R151 This code does not exist in 7.3.0 yet, but this stream from a 7.3.0 node is expected to have this format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments about not changing the rules on which indices each job's results could be in. I think this is important because its complicated enough at the moment and we have a plan to make it easier for users to set up rolling results indices.
...plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedTimingStatsReporter.java
Show resolved
Hide resolved
...ore/src/main/java/org/elasticsearch/xpack/core/ml/job/persistence/AnomalyDetectorsIndex.java
Outdated
Show resolved
Hide resolved
...k/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/persistence/JobResultsProvider.java
Outdated
Show resolved
Hide resolved
Do you think it will be ok if I just backport this PR to 7.3 short after it is merged to master? |
95032cd
to
6b2d9a7
Compare
@przemekwitek all the datafeed bwc tests are currently muted I due to another issue. I would just make sure they are muted before backport. If your BWC passes here in the PR, then it should be fine to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those doubles should not be optional in the stream in/out methods.
.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java
Outdated
Show resolved
Hide resolved
.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/datafeed/DatafeedTimingStats.java
Outdated
Show resolved
Hide resolved
run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two last questions around design:
- Why aren't we using the search time returned in the search payload?
- Should we restart the states when a data feed is deleted and recreated but the job remains the same?
public <T, R> R executeWithReporting(Function<T, R> function, T argument) { | ||
Instant before = clock.instant(); | ||
R result = function.apply(argument); | ||
Instant after = clock.instant(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is an interesting predicament. I can understand the reason for wanting "wall clock time" for the entire function, but strictly speaking, this is NOT the search time. The search time is returned in the search payload and could gathered from there.
Though I also see the benefit of having the entire method wrapped.
Unsure if the complexity is necessary though. Statistically, both the real search time, and the time spent in this method should have 100% correlation and give the same indication for run time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks for this suggestion Ben. The interface of DatafeedTimingStatsReporter is much cleaner now.
.setSize(1) | ||
.setIndicesOptions(IndicesOptions.lenientExpandOpen()) | ||
.setQuery(QueryBuilders.idsQuery().addIds(DatafeedTimingStats.documentId(jobId))) | ||
.addSort(SortBuilders.fieldSort(DatafeedTimingStats.TOTAL_SEARCH_TIME_MS.getPreferredName()).order(SortOrder.DESC)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this is latest
. I thought TOTAL_SEARCH_TIME_MS
was a summation of the individual search times.
Ah, I see, the assumption that it is always monotonically increasing.
Do we account for the data feed being deleted, and a new one being created? Either we should be keeping the previous data feeds search time total, or we should restart the summation back at zero.
If we want an actual summation of search time for the lifetime of the job, then continuing to increase the stats from a previously deleted datafeed makes sense.
If we want an accurate search time per data feed, then this search would fail as we could restart the search time statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting point. The datafeed timing stats are persisted with the job's ID and are deleted when the job is deleted. So internally they are associated with the job rather than the datafeed. But then they're reported via the datafeed stats endpoint, so externally they appear to be associated with the datafeed.
As long as you always create and delete the job and datafeed together in quick succession it works. But if the datafeed is deleted and recreated independently then it doesn't. To make that work the lifecycle of the datafeed stats needs to be moved from the job to the datafeed. But that is also hard because the datafeed does not have an index - only the job has an index. Even this is complicated as the job results can be spread over multiple indices, so deletion of the timing stats is not deleting a single document in a single index but potentially multiple documents each with the same ID but in different indices.
@przemekwitek please can you have a look at the delete datafeed code and see how hard it would be to do a delete-by-query to delete the datafeed timing stats when a datafeed is deleted? I think if it's not too hard we should do this, because externally the stats are part of the datafeed. But if it's hard let me know and we can think again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this is
latest
. I thoughtTOTAL_SEARCH_TIME_MS
was a summation of the individual search times.Ah, I see, the assumption that it is always monotonically increasing.
That's correct.
To make it more explicit I could introduce another field, last_modification_timestamp. BTW, this sounds like something every document could benefit from. Is such a functionality built-in or does such a field has to be added on per-index basis?
Do we account for the data feed being deleted, and a new one being created? Either we should be keeping the previous data feeds search time total, or we should restart the summation back at zero.
If we want an actual summation of search time for the lifetime of the job, then continuing to increase the stats from a previously deleted datafeed makes sense.
If we want an accurate search time per data feed, then this search would fail as we could restart the search time statistics.
Good point, so far I was treating job-datafeed pair as a unity so I have to revisit that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting point. The datafeed timing stats are persisted with the job's ID and are deleted when the job is deleted. So internally they are associated with the job rather than the datafeed. But then they're reported via the datafeed stats endpoint, so externally they appear to be associated with the datafeed.
As long as you always create and delete the job and datafeed together in quick succession it works. But if the datafeed is deleted and recreated independently then it doesn't. To make that work the lifecycle of the datafeed stats needs to be moved from the job to the datafeed. But that is also hard because the datafeed does not have an index - only the job has an index.
Correct, I did not want to introduce yet another index, especially that we do not plan to foster job-datafeed divergence.
Even this is complicated as the job results can be spread over multiple indices, so deletion of the timing stats is not deleting a single document in a single index but potentially multiple documents each with the same ID but in different indices.
@przemekwitek please can you have a look at the delete datafeed code and see how hard it would be to do a delete-by-query to delete the datafeed timing stats when a datafeed is deleted? I think if it's not too hard we should do this, because externally the stats are part of the datafeed. But if it's hard let me know and we can think again.
I'll do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now done.
PTAL
2617de0
to
afc0875
Compare
…imingStatsReporter where it logically belongs
70c4410
to
aa64029
Compare
* Use DeleteByQueryAction to delete DatafeedTimingStats
* Refine the new tests in DatafeedJobsIT test suite
- match: { datafeeds.0.datafeed_id: "datafeed-1"} | ||
- match: { datafeeds.0.state: "stopped"} | ||
- match: { datafeeds.0.timing_stats.job_id: "get-datafeed-stats-1"} | ||
- match: { datafeeds.0.timing_stats.search_count: 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something that ensures that the timing stats document is searchable in the index after stopping a datafeed? I couldn't see this but maybe I'm just missing it (in which case the rest of this comment is irrelevant).
In other words, what stops the following sequence of events?
- Datafeed is started
- Datafeed does a search
- Datafeed indexes a timing stats document, which sits in the translog and is not yet committed to Lucene
- Datafeed is stopped
- Get datafeed stats searches the index, and does not find the most recent timing stats document containing details of the search
- Translog is flushed and the latest timing stats becomes searchable
The test might be asserting in between steps 5 and 6. Of course, it's possible that the document becomes searchable before the get datafeed stats request runs, so then the test will pass.
There are two approaches for solving this:
- On stopping a datafeed persist the most recent timing stats again with refresh=immediate.
- Use refresh=immediate every time datafeed timing stats are persisted.
I prefer option 1 but if it's really hard then option 2 is probably OK as the volume is not that high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to make this (already big) PR any bigger so I implemented approach 2. and added TODO to revisit in the future.
Is that ok?
BTW, I was not aware of the translog
datafeedConfigBuilder -> { | ||
String jobId = datafeedConfigBuilder.build().getJobId(); | ||
JobDataDeleter jobDataDeleter = new JobDataDeleter(client, jobId); | ||
jobDataDeleter.deleteDatafeedTimingStats( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the timing stats should only be deleted if the job ID is changed. As it stands some small change like increasing the query delay slightly will reset the timing stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
Done + added test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for all the iterations. Let's merge this and do any further tweaks in the followup PR that adds an average search time per bucket.
This PR introduces DatafeedTimingStats class which will hold various timing-related metrics useful for debugging.
Initially, the only metric available is totalSearchTimeMs which accumulates total time spent by datafeed searching indices.
Related issue: #29857