-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daily highstate fails after 2019.2 upgrade #53152
Comments
I am also experiencing this problem. Does this effect Windows minions more so than other ones? Are there any suggestions for a best workaround until this is fixed (such as using cmd run with salt-call)? |
I think it reasonable to conjecture that this is bug is the same as, or is highly related to #53135 |
I'm not seeing a lot of similarity with #53135, but it's possible I'm missing something. My minions are all on CentOS 7; no Windows. |
Anyway i'm trying out cmd.run with "salt-call state.apply" as the argument in my schedule. I'll let folks know if that's an adequate workaround |
Thanks! |
@jbeaird I still get the same problem when I do it with salt-call. Either I downgrade or just schedule restarts. I'm going to schedule restarts now. This isn't ideal but I guess it won't make that much difference to me. Note that the highstate seems to work right after a minion start (I do mine as part of an orchestration reaction). |
Thanks @H20-17 . Are you just going to add cron jobs (or the Windows equivalent, if the minions are Windows) that run salt-call on each box? |
@jbeaird I'm using pillar based scheduling and and orchestration that runs highstate (on the minion) is reaction to the minion start event. I think this will be ok for my purposes but there are a limited number of orchestrations that will be ran at once (10 is the default I think) so it could be a problem but it won't be a big deal for me for my use case. |
Thanks for the info. |
One of the labels is 2019.2.2, but since this isn't closed I figure it hasn't been fixed yet. |
@jbeaird What were those errors exactly? Trying to reproduce the bug I got this error on repeat :
|
Thanks, @Akm0d . Sorry for not clarifiying--the errors were state.apply failures, not errors per se. I should note that this has followed me into CentOS 7.7, and persists even with the latest 2019.2 versions. |
@jbeaird This is my top priority right now and I'm trying to fix it. Do you have any logs related to the issue? |
No indication of the problem yet. I don't think I can reproduce this any more. I'll comment again in the morning. |
I am unable to reproduce the bug now. From my end it appears to have been fixed somewhere along the line. If I see it come up again I'll resume trying to triangulate it. Thanks. |
Thanks @H20-17 for all the work. I need to upgrade my master to the latest update, which I will do today. I'll upgrade a bunch of minions, too, and let you know what happens. In my experience, running the highstate oftener than every 8-12 hours will keep the bug from showing up. So I'm going to leave it at every 24 hours for a few days. Thanks, @Akm0d for putting this at the top of your list. |
Thanks. I did not know that the time interval made a difference. That's an interesting find. |
I am now testing with a 24 hour interval (debug level logging). |
@H20-17, is that just a matter of running in the foreground on minions and master? |
@jbeaird I'm running the minion in the foreground with the following command: salt-minion.bat --log-file-level=debug. I'm not doing any master side debugging as of yet. I'm going to see what happens first. |
Thanks. I'm running in the foreground on a few minions, too. |
Alright. I'm hoping this is somewhat helpful or at least might give us a direction to move in. recap: all minions that server one master and that master have been upgraded to the latest version. Yesterday I set three minions to run in the foreground and the master. One of my states that runs at highstate is to make sure that the salt-minion service is running, so on the three minions I was running in the foreground, the highstate ran and told the salt-minion service to start. So those boxes now look like this:
Drilling down a bit, on this particular host, we have another job scheduled, not highstate, to run every morning, but it started this morning and has kept running every few minutes. Here's the pillar schedule file:
And here's the debug data for what has been running on that minion every few mintues since this morning:
I don't know (a) why that scheduled job keeps running every few minutes, (b) if it could be caused by having more than one minion process running, or (c) if this is at all related to the original problem for which this issue was opened. On some other minions on which I am not running in the foreground, I can see a highstate run that repeated over and over again this morning, even though there were no failures in the run. The salt-minion logs on those minions are empty, so I don't have any data to report from them. It seems like it could be a similar problem, though, to the one for which I /do/ have debug data here, even though this is not for a highstate run. Hope this helps somehow. Let me know what to try next. Thanks much! |
I'm going to try and reproduce the issue with what we got, I'll let you know if I need more info, thank you! |
Was not able to test today. Will try to have a test result by Monday. |
I'm pretty sure a bunch of my minions didn't run highstate today but am having a hard time verifying that. I've added some whitespace to a file.managed file and downgraded some of the minions back to 2018.3.3-1 to see what happens tomorrow morning, whether the highstate will run on some of them and not on others. |
Sorry guys, I have had no luck reproducing the issue with my test minion, giving it a daily scheduled highstate. When I had the problem I was highstating approx 20 or so mionions on any given day (at a specific time with a 120 second splay [I believe the splay should have been much bigger. that must have been some oversight]. Every PC in my fleet gets highstated once a week, but the day of the depends on which minion it is. I'm going to stop testing this now. I will go back to weekly highstates of my minions. If I observe anything I will let you guys know and then I will do more testing involving more minions. |
The machines I rolled back to 2018.3.3-1 are back to what I consider "normal" behavior--highstate running daily and, when there's a state.apply failure, it only attempts to run once. On those downgraded minions, the file I added whitespace to have a timestamp of Saturday morning, which would have been the next highstate run. I also have returner output for that Saturday morning run. On the minions running 2019.2.3-1, the file was not changed and no highstate output via the returner, so no daily highstate run. |
After rebooting one of the machines that hadn't run highstate since I upgraded the minion, the highstate ran at the expected time, about 15 hours after the reboot. |
Catching up on this and I might have been able to reproduce it. Need to do a bit more testing. |
Still digging into the fix for this but the cause is a conflict between |
Thanks @garethgreenaway ! Can you think of any simple workarounds we can use instead of splay? |
@jbeaird If you're able to live patch your minions, this patch against a 2019.2.3 install will fix the issue:
|
Thanks again, @garethgreenaway ! |
closing as fixed in 3000.1 |
After upgrading masters and minions to 2019.02, observed this behavior:
This happens on two different masters.
Here's is a pillar/schedule.sls for one of the masters:
Downgrading salt and salt-minion on the minions fixed the problem.
Here is one of the upgraded minions:
Here is the master:
Thanks.
The text was updated successfully, but these errors were encountered: