-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Failures on 6.x for 60_ml_config_migration/Test old cluster jobs and datafeeds and delete them #36810
Comments
Pinging @elastic/ml-core |
@davidkyle I assigned this to you since it seems to have come from #36698. |
I think this is a case of an old bug, but made far more likely to happen due to the timing changes of having to do a search for job configuration rather than getting it from the cluster state. This goes back to the age-old inconsistency that when jobs open for the first time they're in the We have a check that datafeed persistent tasks cannot be assigned to a node until the associated job is assigned to a node. We also have a check that an assigned datafeed cannot be run until the associated job is in the Two possible solutions are:
|
The piece of code I referred to in Lines 488 to 491 in 0000e7c
|
Although option 1 (setting the job state to |
We already had logic to stop datafeeds running against jobs that were OPENING, but a job that relocates from one node to another while OPENED stays OPENED, and this could cause the datafeed to fail when it sent data to the OPENED job on its new node before it had a corresponding autodetect process. This change extends the check to stop datafeeds running when their job is OPENING _or_ stale (i.e. has not had its status reset since relocating to a different node). Relates elastic#36810
This problem has been dealt with elsewhere via the concept of "stale" task status. The task status of a job task stores an allocation ID as well as the task itself, and if these differ then we know the task has relocated. Then we can interpret the reported status value differently. This tactic has already been used elsewhere. I extended use of it to the datafeed manager in #37227. #37227 should stop the problem occurring in the fully upgraded cluster. In the mixed cluster it may still happen if a job gets relocated to a node on the old version that doesn't have the fix. I'll think more carefully about this before unmuting the test. |
The test that is muted only runs in the fully upgraded cluster, so can definitely be unmuted when #37227 is merged. The same underlying problem does sometimes happen in the mixed cluster, for example it happened in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+bwc-tests/49/console. However, the test that fails in this case is not currently muted. I'll monitor similar failures once #37227 is merged to see if the mixed cluster failures completely go away. If not then I'll adjust what tests are run in the mixed cluster. |
We already had logic to stop datafeeds running against jobs that were OPENING, but a job that relocates from one node to another while OPENED stays OPENED, and this could cause the datafeed to fail when it sent data to the OPENED job on its new node before it had a corresponding autodetect process. This change extends the check to stop datafeeds running when their job is OPENING _or_ stale (i.e. has not had its status reset since relocating to a different node). Relates #36810
We already had logic to stop datafeeds running against jobs that were OPENING, but a job that relocates from one node to another while OPENED stays OPENED, and this could cause the datafeed to fail when it sent data to the OPENED job on its new node before it had a corresponding autodetect process. This change extends the check to stop datafeeds running when their job is OPENING _or_ stale (i.e. has not had its status reset since relocating to a different node). Relates #36810
We already had logic to stop datafeeds running against jobs that were OPENING, but a job that relocates from one node to another while OPENED stays OPENED, and this could cause the datafeed to fail when it sent data to the OPENED job on its new node before it had a corresponding autodetect process. This change extends the check to stop datafeeds running when their job is OPENING _or_ stale (i.e. has not had its status reset since relocating to a different node). Relates #36810
The same test is now failing in CI in a slightly different way: |
I found out what is happening. Whether a persistent task goes stale when a node leaves and rejoins the cluster depends on whether there is a cluster state update while the cluster has a master node and the node is out of the cluster. If a node leaves and rejoins the cluster while the cluster has no master node then a persistent task running on it will not go stale. Maybe this was intentional, but it's annoying for ML in the case of a node running ML jobs being restarted. In the rolling upgrade tests all 3 nodes are master eligible and minimum master nodes is 3, so while any node is restarted the cluster has no master. When the upgraded node rejoins the cluster it has a master again. At this point a flurry of cluster state updates occur. Sometimes the ML persistent tasks that were running on the restarted node are reallocated to a different node and sometimes they stay on the restarted node. If they stay on the restarted node they're not considered stale. This wouldn't happen in a cluster with minimum master nodes less than the number of nodes in the cluster. Then if a single ML node was restarted the remaining nodes would still be able to form a cluster with a master and the ML persistent tasks would either get allocated to other nodes or become unallocated while the node being restarted was out of the cluster. This would guarantee that they were stale. The problem can be fixed by checking explicitly whether an autodetect process is running for a datafeed's job before the datafeed does any work. The extent to which this is a test fix or a production fix depends on whether anyone runs in production with minimum master nodes set to the number of nodes in the cluster. It seems a bit crazy because then a single node failure cripples the cluster. |
Some more failures today. Here is one: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/937/console |
I reverted the commit that unmuted the test on 6.x as we had about 10 failures today. |
This is a reinforcement of elastic#37227. It turns out that persistent tasks are not made stale if the node they were running on is restarted and the master node does not notice this. The main scenario where this happens is when minimum master nodes is the same as the number of nodes in the cluster, so the cluster cannot elect a master node when any node is restarted. When an ML node restarts we need the datafeeds for any jobs that were running on that node to not just wait until the jobs are allocated, but to wait for the autodetect process of the job to start up. In the case of reassignment of the job persistent task this was dealt with by the stale status test. But in the case where a node restarts but its persistent tasks are not reassigned we need a deeper test. Fixes elastic#36810
This is a reinforcement of #37227. It turns out that persistent tasks are not made stale if the node they were running on is restarted and the master node does not notice this. The main scenario where this happens is when minimum master nodes is the same as the number of nodes in the cluster, so the cluster cannot elect a master node when any node is restarted. When an ML node restarts we need the datafeeds for any jobs that were running on that node to not just wait until the jobs are allocated, but to wait for the autodetect process of the job to start up. In the case of reassignment of the job persistent task this was dealt with by the stale status test. But in the case where a node restarts but its persistent tasks are not reassigned we need a deeper test. Fixes #36810
This is a reinforcement of #37227. It turns out that persistent tasks are not made stale if the node they were running on is restarted and the master node does not notice this. The main scenario where this happens is when minimum master nodes is the same as the number of nodes in the cluster, so the cluster cannot elect a master node when any node is restarted. When an ML node restarts we need the datafeeds for any jobs that were running on that node to not just wait until the jobs are allocated, but to wait for the autodetect process of the job to start up. In the case of reassignment of the job persistent task this was dealt with by the stale status test. But in the case where a node restarts but its persistent tasks are not reassigned we need a deeper test. Fixes #36810
This is a reinforcement of #37227. It turns out that persistent tasks are not made stale if the node they were running on is restarted and the master node does not notice this. The main scenario where this happens is when minimum master nodes is the same as the number of nodes in the cluster, so the cluster cannot elect a master node when any node is restarted. When an ML node restarts we need the datafeeds for any jobs that were running on that node to not just wait until the jobs are allocated, but to wait for the autodetect process of the job to start up. In the case of reassignment of the job persistent task this was dealt with by the stale status test. But in the case where a node restarts but its persistent tasks are not reassigned we need a deeper test. Fixes #36810
This failed again in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu8,nodes=virtual&&linux/47/console, which includes f141f3a, which I believe was intended to fix it. It appears to be happening much less now, so I'm not going to re-mute the test, but I'm not sure the problem is fully fixed. |
1180687 is the commit before f141f3a, so the failing build did not contain the fix. (At the time the build ran f141f3a was committed but had not got past an intake build so was not being used in the non-intake builds.) https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+bwc-tests/281/consoleText is another build that failed over the weekend due to this problem but again it used a "last successful commit" from before the fix. The only other build over the weekend where this test failed was https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/972/consoleText, but a whole load of other tests also failed in that build and the root cause looks like a problem setting up security that meant the test harness couldn't connect to the test cluster. Therefore I'll close this issue again. |
This has failed at least twice since #36698 was merged to 6.x
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/693/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/127/console
I get failures locally as well but they have a different assertion failure.
I'm going to mute this on 6.x
The text was updated successfully, but these errors were encountered: