-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] FullClusterRestartIT.testSnapshotRestore breaks other tests by wiping out persistent tasks #36816
Comments
Pinging @elastic/ml-core |
Muted in 06b175d |
AwaitsFix: #36816 This test fails quite reliably.
AwaitsFix: #36816 This test fails quite reliably.
I have this one failing on 6.6 and few times on 6.x build stats I wonder if we should back port this change?
|
One of the 6.x failures is https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/376/consoleText The error is:
The root cause seems to be a race. From
...
...
...
Or in plain English:
I worry that this is a real bug that would affect customers on 6.7. Immediately before the first attempt at starting the job we have:
We rewrite the persistent tasks in that migration step, and there could be a race there. Thankfully the step of migrating configs for open jobs is new in 6.7, so not released yet. We need to scrutinise in detail the code that rewrites the persistent tasks when migrating the open job configs before 6.7 is released. The race is probably not that likely, as the BWC tests had iterated from 5.6.0 to 5.6.10 before hitting this problem, and there's no reason to think it's unique to upgrading from 5.6.10. So that suggests something like a 9% chance of the race occurring (1 in 11). |
@droberts195 https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/394/console
and also: Should we mute this or are you close to a fix? |
@cbuescher please can you mute it in master and 6.x? I think it's going to be at least a day before I can have a fix ready to merge. |
Mutes on master and 6.x |
I found the problem, and thankfully it is not in the code that we expect production users will run. The problem is that (The fact that restoring global cluster state resets persistent tasks to whatever state they were in at the time the snapshot was taken has been the case ever since persistent tasks were introduced. So this is an effect that production users can experience, but ML config migration doesn't change it one way or the other.) The tests don't run in parallel so you may think that they should not interact in this way. But The possible solutions I can think of are:
* I don't think this is a good option, as meddling with persistent tasks could be affecting other tests too. For example, #32773 might be caused by the same side effects of |
For future reference, the log message that alerted me to what was happening was |
For test purposes, option 2 seems the easiest/safest. Since we know it could drastically alter other tests, keeping it isolated seems best. Option 1 and 4 would both work but seem more fragile? Option 5 would probably be a nice feature to have in general. I could see people wanting to selectively restore different parts of the cluster state, potentially. But probably bigger impact change, and wouldn't help retroactively. |
Apologies, this was a red herring. The real culprit was unrelated to this issue
which causes all tests that implement |
Pinging @elastic/ml-core (Team:ML) |
This issue has been closed because it has been open for too long with no activity. Any muted tests that were associated with this issue have been unmuted. If the tests begin failing again, a new issue will be opened, and they may be muted again. |
This issue is getting re-opened because there are still AwaitsFix mutes for the given test. It will likely be closed again in the future. |
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/867/console
Reproduces for me with a variety of seeds.
I'm going to mute the test.
The text was updated successfully, but these errors were encountered: