Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daily highstate fails after 2019.2 upgrade #53152

Closed
jbeaird opened this issue May 21, 2019 · 38 comments
Closed

daily highstate fails after 2019.2 upgrade #53152

jbeaird opened this issue May 21, 2019 · 38 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Pillar Regression The issue is a bug that breaks functionality known to work in previous releases. severity-critical top severity, seen by most users, serious issues severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around v3000.1 vulnerable version
Milestone

Comments

@jbeaird
Copy link

jbeaird commented May 21, 2019

After upgrading masters and minions to 2019.02, observed this behavior:

  • daily scheduled highstate fails to run UNLESS a manual pillar has been run recently (last 12 hours or so), but this doesn't always make it run
  • when the highstate runs, if there are any failures, I receive 20 or so reports of failure per minion via the returner, as opposed to just one with 2018.03.4; it appears that the minion is trying state.apply over and over again for about two hours, after which it stops

This happens on two different masters.

Here's is a pillar/schedule.sls for one of the masters:

schedule:
  highstate:
    function: state.highstate
    when: 4:30am
    splay: 600
    maxrunning: 1
    returner: highstate

Downgrading salt and salt-minion on the minions fixed the problem.

Here is one of the upgraded minions:

Salt Version:
           Salt: 2019.2.0
 
Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 1.5
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.31.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: 1.2.5
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.5 (default, Apr  9 2019, 14:30:50)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.2.1
            ZMQ: 4.1.4
 
System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-957.12.2.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core

Here is the master:

Salt Version:
           Salt: 2019.2.0
 
Dependency Versions:
           cffi: 1.6.0
       cherrypy: Not Installed
       dateutil: 1.5
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.7.2
        libgit2: 0.26.3
        libnacl: Not Installed
       M2Crypto: 0.31.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: 1.2.5
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: 0.26.4
         Python: 2.7.5 (default, Apr  9 2019, 14:30:50)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: 0.2.4
        Tornado: 4.2.1
            ZMQ: 4.1.4
 
System Versions:
           dist: centos 7.6.1810 Core
         locale: UTF-8
        machine: x86_64
        release: 3.10.0-957.12.2.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core

Thanks.

@twangboy twangboy added Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Pillar Regression The issue is a bug that breaks functionality known to work in previous releases. Transport severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around labels May 21, 2019
@twangboy twangboy added this to the Approved milestone May 21, 2019
@twangboy twangboy added v2019.2.2 unsupported version severity-critical top severity, seen by most users, serious issues ZRELEASED - Neon retired label labels May 21, 2019
@H20-17
Copy link

H20-17 commented Jun 12, 2019

I am also experiencing this problem. Does this effect Windows minions more so than other ones? Are there any suggestions for a best workaround until this is fixed (such as using cmd run with salt-call)?

@H20-17
Copy link

H20-17 commented Jun 12, 2019

I think it reasonable to conjecture that this is bug is the same as, or is highly related to #53135

@jbeaird
Copy link
Author

jbeaird commented Jun 12, 2019

I'm not seeing a lot of similarity with #53135, but it's possible I'm missing something.

My minions are all on CentOS 7; no Windows.

@H20-17
Copy link

H20-17 commented Jun 12, 2019

Anyway i'm trying out cmd.run with "salt-call state.apply" as the argument in my schedule. I'll let folks know if that's an adequate workaround

@jbeaird
Copy link
Author

jbeaird commented Jun 12, 2019

Thanks!

@H20-17
Copy link

H20-17 commented Jun 14, 2019

@jbeaird I still get the same problem when I do it with salt-call. Either I downgrade or just schedule restarts. I'm going to schedule restarts now. This isn't ideal but I guess it won't make that much difference to me.

Note that the highstate seems to work right after a minion start (I do mine as part of an orchestration reaction).

@jbeaird
Copy link
Author

jbeaird commented Jun 14, 2019

Thanks @H20-17 . Are you just going to add cron jobs (or the Windows equivalent, if the minions are Windows) that run salt-call on each box?

@H20-17
Copy link

H20-17 commented Jun 14, 2019

@jbeaird I'm using pillar based scheduling and and orchestration that runs highstate (on the minion) is reaction to the minion start event. I think this will be ok for my purposes but there are a limited number of orchestrations that will be ran at once (10 is the default I think) so it could be a problem but it won't be a big deal for me for my use case.

@jbeaird
Copy link
Author

jbeaird commented Jun 14, 2019

Thanks for the info.

@H20-17
Copy link

H20-17 commented Oct 30, 2019

One of the labels is 2019.2.2, but since this isn't closed I figure it hasn't been fixed yet.

@sagetherage sagetherage changed the title daily highstate fails after 2019.02 upgrade daily highstate fails after 2019.2 upgrade Jan 17, 2020
@Akm0d
Copy link
Contributor

Akm0d commented Jan 18, 2020

* when the highstate runs, if there are any failures, I receive 20 or so reports of failure per minion via the returner, as opposed to just one with 2018.03.4; it appears that the minion is trying state.apply over and over again for about two hours, after which it stops

@jbeaird What were those errors exactly? Trying to reproduce the bug I got this error on repeat :

[INFO    ] Beacon load configuration invalid, not running.
Configuration for load beacon must be a list.

@garethgreenaway garethgreenaway added ZRelease-Sodium retired label and removed v2019.2.2 unsupported version ZRELEASED - Neon retired label Transport labels Jan 20, 2020
@jbeaird
Copy link
Author

jbeaird commented Jan 21, 2020

Thanks, @Akm0d . Sorry for not clarifiying--the errors were state.apply failures, not errors per se.

I should note that this has followed me into CentOS 7.7, and persists even with the latest 2019.2 versions.

@Akm0d
Copy link
Contributor

Akm0d commented Jan 21, 2020

@jbeaird This is my top priority right now and I'm trying to fix it. Do you have any logs related to the issue?

@H20-17
Copy link

H20-17 commented Jan 22, 2020

No indication of the problem yet. I don't think I can reproduce this any more. I'll comment again in the morning.

@H20-17
Copy link

H20-17 commented Jan 22, 2020

I am unable to reproduce the bug now. From my end it appears to have been fixed somewhere along the line. If I see it come up again I'll resume trying to triangulate it. Thanks.

@jbeaird
Copy link
Author

jbeaird commented Jan 22, 2020

Thanks @H20-17 for all the work.

I need to upgrade my master to the latest update, which I will do today.

I'll upgrade a bunch of minions, too, and let you know what happens.

In my experience, running the highstate oftener than every 8-12 hours will keep the bug from showing up. So I'm going to leave it at every 24 hours for a few days.

Thanks, @Akm0d for putting this at the top of your list.

@H20-17
Copy link

H20-17 commented Jan 22, 2020

Thanks. I did not know that the time interval made a difference. That's an interesting find.

@H20-17
Copy link

H20-17 commented Jan 22, 2020

I am now testing with a 24 hour interval (debug level logging).

@jbeaird
Copy link
Author

jbeaird commented Jan 22, 2020

@H20-17, is that just a matter of running in the foreground on minions and master?

@H20-17
Copy link

H20-17 commented Jan 22, 2020

@jbeaird I'm running the minion in the foreground with the following command: salt-minion.bat --log-file-level=debug. I'm not doing any master side debugging as of yet. I'm going to see what happens first.

@jbeaird
Copy link
Author

jbeaird commented Jan 22, 2020

Thanks. I'm running in the foreground on a few minions, too.

@jbeaird
Copy link
Author

jbeaird commented Jan 23, 2020

Alright. I'm hoping this is somewhat helpful or at least might give us a direction to move in.

recap: all minions that server one master and that master have been upgraded to the latest version. Yesterday I set three minions to run in the foreground and the master.

One of my states that runs at highstate is to make sure that the salt-minion service is running, so on the three minions I was running in the foreground, the highstate ran and told the salt-minion service to start. So those boxes now look like this:

[root@host ~]# ps auxwww | grep salt
root 2646 0.0 0.0 112712 964 pts/1 S+ 10:05 0:00 grep --color=auto salt
root 6889 0.0 0.5 314180 21768 pts/0 S+ Jan22 0:00 /usr/bin/python /usr/bin/salt-minion -l debug
root 6892 0.1 1.3 791584 51616 pts/0 Sl+ Jan22 1:16 /usr/bin/python /usr/bin/salt-minion -l debug
root 6894 0.0 0.5 395456 20256 pts/0 S+ Jan22 0:00 /usr/bin/python /usr/bin/salt-minion -l debug
root 25321 0.0 0.5 314188 21760 ? Ss 04:34 0:00 /usr/bin/python /usr/bin/salt-minion
root 25328 0.1 1.3 711484 53772 ? Sl 04:34 0:25 /usr/bin/python /usr/bin/salt-minion
root 25333 0.0 0.5 395456 20236 ? S 04:34 0:00 /usr/bin/python /usr/bin/salt-minion
root 31842 0.0 0.5 395456 20236 ? S Jan21 0:00 /usr/bin/python /usr/bin/salt-minion

Drilling down a bit, on this particular host, we have another job scheduled, not highstate, to run every morning, but it started this morning and has kept running every few minutes. Here's the pillar schedule file:

schedule:
oh-my-zsh:
function: state.apply
args:
- library.name.user
# run between 4:10-4:20am
# - pull in the updates to the oh-my-zsh repo
# - do not use the highstate returner (we don't need email alerts about the
# active development we pull in daily from this repo)
# - complete before scheduled nightly state.apply starts at 4:30am
# - minimize chances that the scheduled nightly state.apply will have changes
# because of a fresh commit to the oh-my-zsh repo
when: 4:10am
# splay job start time over 600 seconds (10 minutes)
splay: 600

And here's the debug data for what has been running on that minion every few mintues since this morning:

[DEBUG ] schedule: Job oh-my-zsh was scheduled with jid_include, adding to cache (jid_include defaults to True)
[DEBUG ] schedule: Job oh-my-zsh was scheduled with a max number of 1
[INFO ] Running scheduled job: oh-my-zsh (runtime missed by 0.594065 seconds)
[DEBUG ] Subprocess SignalHandlingMultiprocessingProcess-1:101-Schedule-oh-my-zsh added
[DEBUG ] schedule.handle_func: adding this job to the jobcache with data {u'fun_args': [], u'jid': u'20200123095749634377', u'schedule': 'oh-my-zsh', u'pid': 1456, u'fun': 'state.apply', u'id': u'host.wheaton.edu'}
[DEBUG ] LazyLoaded state.apply
[DEBUG ] LazyLoaded saltutil.is_running
[DEBUG ] LazyLoaded grains.get
[DEBUG ] LazyLoaded config.get
[DEBUG ] key: test, ret: _|-
[DEBUG ] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506', u'aes')
[DEBUG ] Initializing new AsyncAuth for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://172.20.1.132:4506
[DEBUG ] Trying to connect to: tcp://172.20.1.132:4506
[DEBUG ] Gathering pillar data for state run
[DEBUG ] Determining pillar cache
[DEBUG ] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506', u'aes')
[DEBUG ] Initializing new AsyncAuth for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://172.20.1.132:4506
[DEBUG ] Trying to connect to: tcp://172.20.1.132:4506
[DEBUG ] salt.crypt.get_rsa_key: Loading private key
[DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG ] Closing AsyncZeroMQReqChannel instance
[DEBUG ] Finished gathering pillar data for state run
[INFO ] Loading fresh modules for state activity
[DEBUG ] LazyLoaded jinja.render
[DEBUG ] LazyLoaded yaml.render
[DEBUG ] In saltenv 'base', looking at rel_path 'library/name/user.sls' to resolve 'salt://library/name/user.sls'
[DEBUG ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/library/name/user.sls' to resolve 'salt://library/name/user.sls'
[DEBUG ] compile template: /var/cache/salt/minion/files/base/library/name/user.sls
[DEBUG ] Jinja search path: [u'/var/cache/salt/minion/files/base']
[DEBUG ] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506', u'aes')
[DEBUG ] Initializing new AsyncAuth for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://172.20.1.132:4506
[DEBUG ] Trying to connect to: tcp://172.20.1.132:4506
[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/base/library/name/user.sls' using 'jinja' renderer: 0.0152320861816
[DEBUG ] Rendered data from file: /var/cache/salt/minion/files/base/library/name/user.sls:
Install and configure ZSH
install ZSH:
pkg.installed:
- pkgs:
- zsh

install oh-my-zsh:
git.latest:
- name: https://github.com/robbyrussell/oh-my-zsh.git
- target: /home/username/.oh-my-zsh
- user: username
- require:
- cmd: create name's home dir
- pkg: install ZSH

manage oh-my-zsh permissions:
file.directory:
- name: /home/username/.oh-my-zsh
- user: username
- group: domain users
- dir_mode: 700
- recurse:
- user
- group
- require:
- git: install oh-my-zsh

install antigen:
file.managed:
- name: /home/username/.antigen.zsh
- user: username
- group: domain users
- source: salt://library/name/antigen.zsh
- require:
- git: install oh-my-zsh

set .zshrc:
file.managed:
- name: /home/username/.zshrc
- user: username
- group: domain users
- source: salt://library/name/.zshrc
- require:
- file: install antigen

Make sure there is my user
create name's home dir:
cmd.run:
- name: /usr/sbin/mkhomedir_helper username 0077
- creates: /home/username

make sure there is a user .ssh directory
ssh directory:
file.directory:
- name: /home/username/.ssh
- user: username
- group: domain users
- dir_mode: 700
- require:
- cmd: create name's home dir

ssh public key
make username's public key:
file.managed:
- name: /home/username/.ssh/authorized_keys
- user: username
- group: domain users
- mode: 600
- source: salt://library/name/authorized_keys
- require:
- file: ssh directory
[DEBUG ] Results of YAML rendering:
OrderedDict([(u'install ZSH', OrderedDict([(u'pkg.installed', [OrderedDict([(u'pkgs', [u'zsh'])])])])), (u'install oh-my-zsh', OrderedDict([(u'git.latest', [OrderedDict([(u'name', u'https://github.com/robbyrussell/oh-my-zsh.git')]), OrderedDict([(u'target', u'/home/username/.oh-my-zsh')]), OrderedDict([(u'user', u'username')]), OrderedDict([(u'require', [OrderedDict([(u'cmd', u"create name's home dir")]), OrderedDict([(u'pkg', u'install ZSH')])])])])])), (u'manage oh-my-zsh permissions', OrderedDict([(u'file.directory', [OrderedDict([(u'name', u'/home/username/.oh-my-zsh')]), OrderedDict([(u'user', u'username')]), OrderedDict([(u'group', u'domain users')]), OrderedDict([(u'dir_mode', 700)]), OrderedDict([(u'recurse', [u'user', u'group'])]), OrderedDict([(u'require', [OrderedDict([(u'git', u'install oh-my-zsh')])])])])])), (u'install antigen', OrderedDict([(u'file.managed', [OrderedDict([(u'name', u'/home/username/.antigen.zsh')]), OrderedDict([(u'user', u'username')]), OrderedDict([(u'group', u'domain users')]), OrderedDict([(u'source', u'salt://library/name/antigen.zsh')]), OrderedDict([(u'require', [OrderedDict([(u'git', u'install oh-my-zsh')])])])])])), (u'set .zshrc', OrderedDict([(u'file.managed', [OrderedDict([(u'name', u'/home/username/.zshrc')]), OrderedDict([(u'user', u'username')]), OrderedDict([(u'group', u'domain users')]), OrderedDict([(u'source', u'salt://library/name/.zshrc')]), OrderedDict([(u'require', [OrderedDict([(u'file', u'install antigen')])])])])])), (u"create name's home dir", OrderedDict([(u'cmd.run', [OrderedDict([(u'name', u'/usr/sbin/mkhomedir_helper username 0077')]), OrderedDict([(u'creates', u'/home/username')])])])), (u'ssh directory', OrderedDict([(u'file.directory', [OrderedDict([(u'name', u'/home/username/.ssh')]), OrderedDict([(u'user', u'username')]), OrderedDict([(u'group', u'domain users')]), OrderedDict([(u'dir_mode', 700)]), OrderedDict([(u'require', [OrderedDict([(u'cmd', u"create name's home dir")])])])])])), (u"make username's public key", OrderedDict([(u'file.managed', [OrderedDict([(u'name', u'/home/username/.ssh/authorized_keys')]), OrderedDict([(u'user', u'username')]), OrderedDict([(u'group', u'domain users')]), OrderedDict([(u'mode', 600)]), OrderedDict([(u'source', u'salt://library/name/authorized_keys')]), OrderedDict([(u'require', [OrderedDict([(u'file', u'ssh directory')])])])])]))])
[PROFILE ] Time (in seconds) to render '/var/cache/salt/minion/files/base/library/name/user.sls' using 'yaml' renderer: 0.0042450428009
[DEBUG ] LazyLoaded config.option
[DEBUG ] LazyLoaded pkg.install
[DEBUG ] LazyLoaded pkg.installed
[DEBUG ] schedule.handle_func: Adding splay of 328 seconds to next run.
[DEBUG ] key: ifttt.secret_key, ret: |-
[DEBUG ] key: ifttt:secret_key, ret: |-
[DEBUG ] key: pushbullet.api_key, ret: |-
[DEBUG ] key: pushbullet:api_key, ret: |-
[DEBUG ] LazyLoaded systemd.booted
[DEBUG ] You should upgrade pyOpenSSL to at least 0.14.1 to enable the use of X509 extensions in the tls module
[DEBUG ] key: victorops.api_key, ret: |-
[DEBUG ] key: victorops:api_key, ret: |-
[DEBUG ] DSC: Only available on Windows systems
[DEBUG ] Module PSGet: Only available on Windows systems
[DEBUG ] LazyLoaded platform.is_windows
[DEBUG ] Could not LazyLoad pkg.ex_mod_init: 'pkg.ex_mod_init' is not available.
[INFO ] Running state [install ZSH] at time 09:57:51.526328
[INFO ] Executing state pkg.installed for [install ZSH]
[DEBUG ] Could not LazyLoad pkg.resolve_capabilities: 'pkg.resolve_capabilities' is not available.
[INFO ] Executing command [u'rpm', u'-qa', u'--queryformat', u'%{NAME}|-%{EPOCH}|-%{VERSION}|-%{RELEASE}|-%{ARCH}|-(none)|-%{INSTALLTIME}'] in directory '/root'
[INFO ] All specified packages are already installed
[INFO ] Completed state [install ZSH] at time 09:57:53.265552 (duration_in_ms=1739.224)
[INFO ] Executing command [u'git', u'--version'] in directory '/root'
[DEBUG ] stdout: git version 1.8.3.1
[DEBUG ] LazyLoaded git.latest
[DEBUG ] LazyLoaded cmd.run
[INFO ] Running state [/usr/sbin/mkhomedir_helper username 0077] at time 09:57:53.285793
[INFO ] Executing state cmd.run for [/usr/sbin/mkhomedir_helper username 0077]
[INFO ] /home/username exists
[INFO ] Completed state [/usr/sbin/mkhomedir_helper username 0077] at time 09:57:53.287441 (duration_in_ms=1.648)
[INFO ] Running state [https://github.com/robbyrussell/oh-my-zsh.git] at time 09:57:53.287824
[INFO ] Executing state git.latest for [https://github.com/robbyrussell/oh-my-zsh.git]
[INFO ] Executing command [u'git', u'config', u'--global', u'--get-regexp', u'filter\.lfs\.'] in directory '/root'
[DEBUG ] retcode: 1
[INFO ] Checking remote revision for https://github.com/robbyrussell/oh-my-zsh.git
[INFO ] Executing command [u'git', u'ls-remote', u'https://github.com/robbyrussell/oh-my-zsh.git'] as user 'username' in directory '/home/username'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: 4e45e12dc355e3ba34e7e40ce4936fb222f0155c HEAD
4e45e12dc355e3ba34e7e40ce4936fb222f0155c refs/heads/master
...
I'm cutting a bunch of these out for brevity's sake...
...
974b111e08e9845cda5bea17212c523eeef45985 refs/pull/999/merge
[INFO ] Executing command [u'git', u'for-each-ref', u'--format', u'%(refname:short)', u'refs/heads/'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: master
[INFO ] Executing command [u'git', u'for-each-ref', u'--format', u'%(refname:short)', u'refs/tags/'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[INFO ] Checking local revision for /home/username/.oh-my-zsh
[INFO ] Executing command [u'git', u'rev-parse', u'HEAD'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: 4e45e12dc355e3ba34e7e40ce4936fb222f0155c
[INFO ] Checking local branch for /home/username/.oh-my-zsh
[INFO ] Executing command [u'git', u'rev-parse', u'--abbrev-ref', u'HEAD'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: master
[INFO ] Executing command [u'git', u'remote', u'--verbose'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: origin https://github.com/robbyrussell/oh-my-zsh.git (fetch)
origin https://github.com/robbyrussell/oh-my-zsh.git (push)
[INFO ] Executing command [u'git', u'diff', u'HEAD'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[INFO ] Executing command [u'git', u'merge-base', u'--is-ancestor', u'4e45e12dc355e3ba34e7e40ce4936fb222f0155c', u'4e45e12dc355e3ba34e7e40ce4936fb222f0155c'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[INFO ] Executing command [u'git', u'rev-parse', u'--abbrev-ref', u'master@{upstream}'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: origin/master
[INFO ] Executing command [u'git', u'ls-remote', u'--tags', u'origin'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[INFO ] Executing command [u'git', u'rev-parse', u'HEAD'] as user 'username' in directory '/home/username/.oh-my-zsh'
[DEBUG ] env command: (u'su', u'-s', '/bin/bash', u'-', u'username', u'-c', '/usr/bin/python')
[DEBUG ] stdout: 4e45e12dc355e3ba34e7e40ce4936fb222f0155c
[INFO ] Repository /home/username/.oh-my-zsh is up-to-date
[INFO ] Completed state [https://github.com/robbyrussell/oh-my-zsh.git] at time 09:57:55.771382 (duration_in_ms=2483.557)
[DEBUG ] LazyLoaded file.directory
[INFO ] Running state [/home/username/.oh-my-zsh] at time 09:57:55.778111
[INFO ] Executing state file.directory for [/home/username/.oh-my-zsh]
[INFO ] The directory /home/username/.oh-my-zsh is in the correct state
[INFO ] Completed state [/home/username/.oh-my-zsh] at time 09:57:57.057594 (duration_in_ms=1279.483)
[INFO ] Running state [/home/username/.antigen.zsh] at time 09:57:57.058069
[INFO ] Executing state file.managed for [/home/username/.antigen.zsh]
[DEBUG ] Initializing new AsyncZeroMQReqChannel for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506', u'aes')
[DEBUG ] Initializing new AsyncAuth for (u'/etc/salt/pki/minion', u'host.wheaton.edu', u'tcp://172.20.1.132:4506')
[DEBUG ] Connecting the Minion to the Master URI (for the return server): tcp://172.20.1.132:4506
[DEBUG ] Trying to connect to: tcp://172.20.1.132:4506
[DEBUG ] In saltenv 'base', looking at rel_path 'library/name/antigen.zsh' to resolve 'salt://library/name/antigen.zsh'
[DEBUG ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/library/name/antigen.zsh' to resolve 'salt://library/name/antigen.zsh'
[INFO ] File /home/username/.antigen.zsh is in the correct state
[INFO ] Completed state [/home/username/.antigen.zsh] at time 09:57:57.092618 (duration_in_ms=34.549)
[INFO ] Running state [/home/username/.zshrc] at time 09:57:57.093287
[INFO ] Executing state file.managed for [/home/username/.zshrc]
[DEBUG ] In saltenv 'base', looking at rel_path 'library/name/.zshrc' to resolve 'salt://library/name/.zshrc'
[DEBUG ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/library/name/.zshrc' to resolve 'salt://library/name/.zshrc'
[INFO ] File /home/username/.zshrc is in the correct state
[INFO ] Completed state [/home/username/.zshrc] at time 09:57:57.109653 (duration_in_ms=16.366)
[INFO ] Running state [/home/username/.ssh] at time 09:57:57.110128
[INFO ] Executing state file.directory for [/home/username/.ssh]
[INFO ] The directory /home/username/.ssh is in the correct state
[INFO ] Completed state [/home/username/.ssh] at time 09:57:57.111696 (duration_in_ms=1.567)
[INFO ] Running state [/home/username/.ssh/authorized_keys] at time 09:57:57.112288
[INFO ] Executing state file.managed for [/home/username/.ssh/authorized_keys]
[DEBUG ] In saltenv 'base', looking at rel_path 'library/name/authorized_keys' to resolve 'salt://library/name/authorized_keys'
[DEBUG ] In saltenv 'base', ** considering ** path '/var/cache/salt/minion/files/base/library/name/authorized_keys' to resolve 'salt://library/name/authorized_keys'
[INFO ] File /home/username/.ssh/authorized_keys is in the correct state
[INFO ] Completed state [/home/username/.ssh/authorized_keys] at time 09:57:57.129189 (duration_in_ms=16.9)
[DEBUG ] File /var/cache/salt/minion/accumulator/139640369767056 does not exist, no need to cleanup
[DEBUG ] LazyLoaded state.check_result
[DEBUG ] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_f84fe5becb_pub.ipc
[DEBUG ] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_f84fe5becb_pull.ipc
[DEBUG ] Sending event: tag = schedule_return; data = {u'fun_args': ['library.name.user'], u'jid': u'req', u'return': {u'file|-install antigen|-/home/username/.antigen.zsh_|-managed': {u'comment': u'File /home/username/.antigen.zsh is in the correct state', u'name': u'/home/username/.antigen.zsh', u'start_time': '09:57:57.058069', u'result': True, u'duration': 34.549, u'run_num': 4, u'sls': u'library.name.user', u'changes': {}, u'id': u'install antigen'}, u'file_|-manage oh-my-zsh permissions_|-/home/username/.oh-my-zsh_|-directory': {u'comment': u'The directory /home/username/.oh-my-zsh is in the correct state', u'name': u'/home/username/.oh-my-zsh', u'start_time': '09:57:55.778111', u'result': True, u'duration': 1279.483, u'run_num': 3, u'sls': u'library.name.user', u'changes': {}, u'id': u'manage oh-my-zsh permissions'}, u'file_|-ssh directory_|-/home/username/.ssh_|-directory': {u'comment': u'The directory /home/username/.ssh is in the correct state', u'name': u'/home/username/.ssh', u'start_time': '09:57:57.110129', u'result': True, u'duration': 1.567, u'run_num': 6, u'sls': u'library.name.user', u'changes': {}, u'id': u'ssh directory'}, u'git_|-install oh-my-zsh_|-https://github.com/robbyrussell/oh-my-zsh.git_|-latest': {u'comment': u'Repository /home/username/.oh-my-zsh is up-to-date', u'name': u'https://github.com/robbyrussell/oh-my-zsh.git', u'start_time': '09:57:53.287825', u'result': True, u'duration': 2483.557, u'run_num': 2, u'sls': u'library.name.user', u'changes': {}, u'id': u'install oh-my-zsh'}, u'pkg_|-install ZSH_|-install ZSH_|-installed': {u'comment': u'All specified packages are already installed', u'name': u'zsh', u'start_time': '09:57:51.526328', u'result': True, u'duration': 1739.224, u'run_num': 0, u'sls': u'library.name.user', u'changes': {}, u'id': u'install ZSH'}, u"file_|-make username's public key_|-/home/username/.ssh/authorized_keys_|-managed": {u'comment': u'File /home/username/.ssh/authorized_keys is in the correct state', u'name': u'/home/username/.ssh/authorized_keys', u'start_time': '09:57:57.112289', u'result': True, u'duration': 16.9, u'run_num': 7, u'sls': u'library.name.user', u'changes': {}, u'id': u"make username's public key"}, u'file_|-set .zshrc_|-/home/username/.zshrc_|-managed': {u'comment': u'File /home/username/.zshrc is in the correct state', u'name': u'/home/username/.zshrc', u'start_time': '09:57:57.093287', u'result': True, u'duration': 16.366, u'run_num': 5, u'sls': u'library.name.user', u'changes': {}, u'id': u'set .zshrc'}, u"cmd_|-create name's home dir_|-/usr/sbin/mkhomedir_helper username 0077_|-run": {u'comment': u'/home/username exists', u'id': u"create name's home dir", u'name': u'/usr/sbin/mkhomedir_helper username 0077', u'duration': 1.648, u'run_num': 1, u'start_time': '09:57:53.285793', u'sls': u'library.name.user', u'changes': {}, u'result': True}}, u'retcode': 0, u'success': True, u'schedule': 'oh-my-zsh', u'cmd': u'_return', u'pid': 1456, u'_stamp': '2020-01-23T15:57:57.135470', u'fun': 'state.apply', u'id': u'host.wheaton.edu'}
[DEBUG ] schedule.handle_func: Removing /var/cache/salt/minion/proc/20200123095749634377
[DEBUG ] Subprocess SignalHandlingMultiprocessingProcess-1:101-Schedule-oh-my-zsh cleaned up

I don't know (a) why that scheduled job keeps running every few minutes, (b) if it could be caused by having more than one minion process running, or (c) if this is at all related to the original problem for which this issue was opened. On some other minions on which I am not running in the foreground, I can see a highstate run that repeated over and over again this morning, even though there were no failures in the run. The salt-minion logs on those minions are empty, so I don't have any data to report from them. It seems like it could be a similar problem, though, to the one for which I /do/ have debug data here, even though this is not for a highstate run.

Hope this helps somehow. Let me know what to try next. Thanks much!

@Akm0d
Copy link
Contributor

Akm0d commented Jan 23, 2020

I'm going to try and reproduce the issue with what we got, I'll let you know if I need more info, thank you!

@H20-17
Copy link

H20-17 commented Jan 23, 2020

Was not able to test today. Will try to have a test result by Monday.

@jbeaird
Copy link
Author

jbeaird commented Jan 24, 2020

I'm pretty sure a bunch of my minions didn't run highstate today but am having a hard time verifying that.

I've added some whitespace to a file.managed file and downgraded some of the minions back to 2018.3.3-1 to see what happens tomorrow morning, whether the highstate will run on some of them and not on others.

@H20-17
Copy link

H20-17 commented Jan 25, 2020

Sorry guys, I have had no luck reproducing the issue with my test minion, giving it a daily scheduled highstate. When I had the problem I was highstating approx 20 or so mionions on any given day (at a specific time with a 120 second splay [I believe the splay should have been much bigger. that must have been some oversight]. Every PC in my fleet gets highstated once a week, but the day of the depends on which minion it is. I'm going to stop testing this now. I will go back to weekly highstates of my minions. If I observe anything I will let you guys know and then I will do more testing involving more minions.

@jbeaird
Copy link
Author

jbeaird commented Jan 27, 2020

The machines I rolled back to 2018.3.3-1 are back to what I consider "normal" behavior--highstate running daily and, when there's a state.apply failure, it only attempts to run once.

On those downgraded minions, the file I added whitespace to have a timestamp of Saturday morning, which would have been the next highstate run. I also have returner output for that Saturday morning run. On the minions running 2019.2.3-1, the file was not changed and no highstate output via the returner, so no daily highstate run.

@jbeaird
Copy link
Author

jbeaird commented Jan 30, 2020

After rebooting one of the machines that hadn't run highstate since I upgraded the minion, the highstate ran at the expected time, about 15 hours after the reboot.

@garethgreenaway
Copy link
Contributor

Catching up on this and I might have been able to reproduce it. Need to do a bit more testing.

@garethgreenaway
Copy link
Contributor

Still digging into the fix for this but the cause is a conflict between when and splay. If you remove splay, not ideal obviously, but I'm confident that you'll see your scheduled state runs run when expected and only once per day.

@jbeaird
Copy link
Author

jbeaird commented Feb 12, 2020

Thanks @garethgreenaway !

Can you think of any simple workarounds we can use instead of splay?

@garethgreenaway
Copy link
Contributor

garethgreenaway commented Feb 13, 2020

@jbeaird If you're able to live patch your minions, this patch against a 2019.2.3 install will fix the issue:

diff --git a/salt/utils/schedule.py b/salt/utils/schedule.py
index 3ec12f576d..b32c2690bc 100644
--- a/salt/utils/schedule.py
+++ b/salt/utils/schedule.py
@@ -1003,7 +1003,7 @@ class Schedule(object):
 
                     if when < now - loop_interval and \
                             not data.get('_run', False) and \
-                            not data.get('run', False) and \
+                            not run and \
                             not data['_splay']:
                         data['_next_fire_time'] = None
                         data['_continue'] = True
@@ -1059,7 +1059,7 @@ class Schedule(object):
 
                 if when < now - loop_interval and \
                         not data.get('_run', False) and \
-                        not data.get('run', False) and \
+                        not run and \
                         not data['_splay']:
                     data['_next_fire_time'] = None
                     data['_continue'] = True

@jbeaird
Copy link
Author

jbeaird commented Feb 20, 2020

Thanks again, @garethgreenaway !

@sagetherage sagetherage added v3000.1 vulnerable version and removed ZRelease-Sodium retired label labels Apr 16, 2020
@sagetherage
Copy link
Contributor

closing as fixed in 3000.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Pillar Regression The issue is a bug that breaks functionality known to work in previous releases. severity-critical top severity, seen by most users, serious issues severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around v3000.1 vulnerable version
Projects
None yet
Development

No branches or pull requests

6 participants