Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic KeyError: 'runner' in reactor #61416

Open
jbg-uwaterloo opened this issue Jan 4, 2022 · 7 comments
Open

Sporadic KeyError: 'runner' in reactor #61416

jbg-uwaterloo opened this issue Jan 4, 2022 · 7 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Reactor VMware
Milestone

Comments

@jbg-uwaterloo
Copy link

Description
3002 salt master on RHEL7. We have a reactor watching and alerting via email on failures. The reactor generally works and has generated 7 valid and 2 errors today. On Jan 01 it produced 3 valid and one error.

Python trace below

2022-01-04 08:05:29,651 [salt.utils.reactor:489 ][ERROR   ][2058] Reactor 'email-on-fail' failed to execute runner 'process_minion_data.email_errors'
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 468, in run
    ret = l_fun(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 496, in runner
    return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs))
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 70, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'runner'

Setup
Reactor config (email addresses sanitized)

email-on-fail:
  runner.process_minion_data.email_errors:
    - fromaddr: XXXXXXXXXXX
    - toaddrs: XXXXXXXXXXXXX
    - data_str: {{ data|yaml_dquote }}
    - smtp_server: 127.0.0.1

(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

Please be as specific as possible and give set-up details.
RHEL7 on-prem VM.

Steps to Reproduce the behavior
(Include debug logs if possible and relevant)

Expected behavior
No python errors reported in salt master log

Screenshots

Versions Report

Salt Version:
          Salt: 3002.7
 
Dependency Versions:
          cffi: Not Installed
      cherrypy: unknown
      dateutil: 2.4.2
     docker-py: Not Installed
         gitdb: 0.6.4
     gitpython: 1.0.1
        Jinja2: 2.11.1
       libgit2: Not Installed
      M2Crypto: 0.35.2
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: 2.6.1
  pycryptodome: Not Installed
        pygit2: Not Installed
        Python: 3.6.8 (default, Aug 13 2020, 07:46:32)
  python-gnupg: Not Installed
        PyYAML: 3.13
         PyZMQ: 17.0.0
         smmap: 0.9.0
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.1.4
 
System Versions:
          dist: rhel 7.9 Maipo
        locale: UTF-8
       machine: x86_64
       release: 3.10.0-1160.49.1.el7.x86_64
        system: Linux
       version: Red Hat Enterprise Linux Server 7.9 Maipo
@jbg-uwaterloo jbg-uwaterloo added Bug broken, incorrect, or confusing behavior needs-triage labels Jan 4, 2022
@jbg-uwaterloo
Copy link
Author

Similar to #52961 (salt 3000�)

@waynew
Copy link
Contributor

waynew commented Jan 7, 2022

Looking at this, I'm almost positive that the issue, for some reason, is that the reaction_type isn't in self.client_cache - there's code in salt/utils/reactor that removes the item from the cache if it's expired. In my limited testing, I couldn't get it to exhibit this behavior, even with a ridiculously small TTL (2s, vs default 60s). It would delete the runner from the cache & retry.

It's entirely possible, though, that there's something else at play here with the interplay of more than one runner, the time it takes to actually run that runner, etc.

If someone has some spare cycles to try and track that down, that would be .... oh nope, just figured it out!

In salt/utils/reactor.py, it calls self.populate_client_cache(low), and then it gets the runner function, and calls it with l_fun(*args, **kwargs) (seen in the stack trace)

With my 2s timer, I added this:

            import time; time.sleep(5)
            ret = l_fun(*args, **kwargs)

100% of the time I see this error. So it looks like what's happening is between grabbing runner and actually making this call the TTL expires.

One potential fix would be to retrieve the runner from client_cache and pass that to the runner function instead. It looks like wheel would suffer the same problem. Another option would be to catch the exception and maybe repopulate the cache? I'm not currently familiar with why the cache/TTL is even happening, so there may be a better way to handle that. My gut says that if we hit salt/utils/reactor::ReactWrap.run and there's a live function at the start of the call it should be useful when it's finally called... but I'm not positive.

Anyway - thanks again for the report!

@waynew waynew added this to the Approved milestone Jan 7, 2022
@waynew waynew removed their assignment Jan 7, 2022
@gbunt
Copy link

gbunt commented Oct 12, 2022

We're seeing the same thing in v3002.2:

2022-10-11 13:25:08,469 [salt.template    :31  ][PROFILE ][1401253] Time (in seconds) to render '/srv/salt/reactor/sync_dns.sls' using 'yaml' renderer: 0.0008473396301269531
2022-10-11 13:25:08,471 [salt.utils.reactor:489 ][ERROR   ][1401253] Reactor 'sync_dns' failed to execute runner 'state.orchestrate'
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 442, in run
    react_fun = self.client_cache[low["state"]].functions.get(
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 70, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'runner'

On average, i'd say 1% of the reactor calls runs into this race condition. In our case, that's mission critical. Given there's no follow up on this issue i can only assume it isn't fixed in v3005, though we will upgrade to a latest version soon.

fwiw, i've not been able to reproduce it anymore after enabling retry for the reactor runner:

sync_dns:
  runner.state.orchestrate:
    - retry:
        attempts: 3
        interval: 1
    - args:
      - mods: sync_dns

Looking at the reactor code i would say that's purely coincidental but we've not hit the exception anymore after a few thousand test runs.

@vkotarov
Copy link

vkotarov commented Feb 7, 2023

Started seeing this 2 days ago and now I have 900+ log messages, each throwing the following:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 439, in run
    ret = l_fun(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 467, in runner
    return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs))
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 64, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'runner'

I have several events configured to run orchestrate runners and this happens with all of them.
v3004.2 here.

@gmarzot
Copy link

gmarzot commented Jul 11, 2023

seen also: 3004.2

[ERROR ] Reactor 'handle_stats' failed to execute runner 'salt.cmd'
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/salt/utils/reactor.py", line 439, in run
ret = l_fun(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/salt/utils/reactor.py", line 467, in runner
return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs))
File "/usr/lib/python3.10/site-packages/salt/utils/cache.py", line 64, in getitem
return dict.getitem(self, key)
KeyError: 'runner'

happening all the time

i also tried the retry approach from above - no observable effect

@doesitblend
Copy link
Collaborator

@garethgreenaway This appears similar to #52961. I have additional logs available if required to dig into this further.

@wasabi222
Copy link
Contributor

I was able to fix this using @waynew's suggestion above. Here's a patch:

--- utils/reactor.py	2024-09-30 09:55:06
+++ utils/reactor.py	2024-09-30 09:54:47
@@ -461,6 +461,10 @@
         """
         Wrap RunnerClient for executing :ref:`runner modules <all-salt.runners>`
         """
+        if "runner" not in self.client_cache:
+            log.debug("reactor edge case: re-populating client_cache")
+            low = {"state": "runner"}
+            self.populate_client_cache(low)
         return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs))

     def wheel(self, fun, **kwargs):

This should be expanded to include other file_clients as well. I would submit a PR but I still don't understand how salt's testing stack works, but maybe somebody can run with this.

@dmurphy18 dmurphy18 self-assigned this Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Reactor VMware
Projects
None yet
Development

No branches or pull requests

9 participants