Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Nonce verification error with TCP transport on slower network connections #65114

Closed
terryd-imh opened this issue Sep 5, 2023 · 42 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Transport VMware

Comments

@terryd-imh
Copy link

Description

salt-call state.apply gives:

salt.exceptions.SaltClientError: Nonce verification error
[ERROR   ] An un-handled exception was caught by Salt's global exception handler:
SaltClientError: Nonce verification error
Traceback (most recent call last):
  File "/usr/bin/salt-call", line 11, in <module>
    sys.exit(salt_call())
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/scripts.py", line 444, in salt_call
    client.run()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/cli/call.py", line 50, in run
    caller.run()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/cli/caller.py", line 95, in run
    ret = self.call()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/cli/caller.py", line 202, in call
    ret["return"] = self.minion.executors[fname](
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 1232, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 1247, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/executors/direct_call.py", line 10, in execute
    return func(*args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 1232, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/loader/lazy.py", line 1247, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/modules/state.py", line 834, in apply_
    return highstate(**kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/modules/state.py", line 1183, in highstate
    ret = st_.call_highstate(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/state.py", line 4756, in call_highstate
    high, errors = self.render_highstate(matches)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/state.py", line 4616, in render_highstate
    state, errors = self.render_state(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/state.py", line 4402, in render_state
    nstate, err = self.render_state(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/state.py", line 4402, in render_state
    nstate, err = self.render_state(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/state.py", line 4250, in render_state
    state_data = self.client.get_state(sls, saltenv)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/fileclient.py", line 400, in get_state
    dest = self.cache_file(path, saltenv, cachedir=cachedir)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/fileclient.py", line 186, in cache_file
    return self.get_url(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/fileclient.py", line 517, in get_url
    result = self.get_file(url, dest, makedirs, saltenv, cachedir=cachedir)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/fileclient.py", line 1166, in get_file
    hash_server, stat_server = self.hash_and_stat_file(path, saltenv)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/fileclient.py", line 1403, in hash_and_stat_file
    hash_result = self.hash_file(path, saltenv)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/fileclient.py", line 1396, in hash_file
    return self.__hash_and_stat_file(path, saltenv)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/fileclient.py", line 1388, in __hash_and_stat_file
    return self.channel.send(load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 131, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
    return future_cell[0].result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 295, in send
    ret = yield self._crypted_transfer(load, timeout=timeout, raw=raw)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 252, in _crypted_transfer
    ret = yield _do_transfer()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
    yielded = self.gen.send(value)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 242, in _do_transfer
    data = self.auth.crypticle.loads(data, raw, nonce=nonce)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/crypt.py", line 1533, in loads
    raise SaltClientError("Nonce verification error")
salt.exceptions.SaltClientError: Nonce verification error

This happens occasionally on lower latency minions and almost always on high latency minions geographically separated from the syndic.

We have over 700 states applied across over 30 formulas and over 400 minions per syndic, so there's a lot of connections but the syndics' load is consistently low. The issue happens more frequently on minions with higher latency/jitter. If I add -l trace to see what state it was working on before hitting the exception, it'll be a different state each time, so while the number of states being applied might be affecting it, I think network latency and jitter is the bigger factor. (Or, maybe the combination of that + lots of connections?)

I edited crypt.Crypticle.loads() to collect more information, so at the top of that function, it looks like this:

        if nonce:
            ret_nonce = data[:32].decode()
            data = data[32:]
            payload = salt.payload.loads(data, raw=raw)
            print(f"match={ret_nonce == nonce} {nonce} {ret_nonce} {payload=}")
            if ret_nonce != nonce:
                pass
                #raise SaltClientError("Nonce verification error")

All payloads decode properly, so it isn't being mangled in transit, but these two lines hint at what's probably happening:

match=False 54afef3ef7a34d9995f75bee09824a12 0a599f19ee714a6a9adcd1a3a71159ea payload=''
match=False 9a92817586714c8a891ef566cf8110c4 54afef3ef7a34d9995f75bee09824a12 payload=''

Note the nonce of one request matches the ret_nonce of a later request. Something in ext.tornado might be mixing up requests when received out of order maybe?

Setup

Onedir setup pinned to the minor version 3006.2 + these modules pip installed:

arrow==1.2.3
PyMySQL==1.0.*
pyroute2==0.7.*
netifaces==0.10.*
netaddr==0.8.*
dnspython==2.0.*
pynetbox==6.5.*

layout: 1 master of masters -> 2 syndics -> minions which only report to only one (their geographically closest) syndic

Masters/Syndics are on CentOS 7. Minions a mixed bag of CentOS7 / CloudLinux 7 / AlmaLinux8 / Ubuntu20 / Ubuntu22.

Masters/Syndics are on OpenVZ. Minions are a mixture VM types and physical machines - but the issue happens on physical minions, so I don't think that's related.

We use gitfs for pillar and states, and TCP for transport.

Steps to Reproduce the behavior

If I sign into a minion that doesn't have the issue, I can cause it to happen by simulating some network delay:

tc qdisc add dev eth0 root netem delay 100ms 20ms distribution normal
salt-call state.test

Expected behavior

Highstate applied without crashing with the nonce exception.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

Versions on minions/master/syndics are identical. This is collected from an affected minion:

Salt Version:
          Salt: 3006.2

Python Version:
        Python: 3.10.12 (main, Aug  3 2023, 21:47:10) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: 18.6.1
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: 1.3.0
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: 1.7.0
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.13.3
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: cloudlinux 7.9 Boris Yegorov
        locale: utf-8
       machine: x86_64
       release: 3.10.0-1160.el7.x86_64
        system: Linux
       version: CloudLinux 7.9 Boris Yegorov
@terryd-imh terryd-imh added Bug broken, incorrect, or confusing behavior needs-triage labels Sep 5, 2023
@welcome
Copy link

welcome bot commented Sep 5, 2023

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar.
If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

@dwoz dwoz removed the needs-triage label Sep 5, 2023
@dwoz
Copy link
Contributor

dwoz commented Sep 5, 2023

@terryd-imh I think your assumption about requests being received out of order is likely accurate. How likely are you to be able to test this issue against the current HEAD of the master branch?

@terryd-imh
Copy link
Author

Just now, on both a syndic and a hardware minion using that same syndic, I backed up /opt/saltstack, then /opt/saltstack/salt/bin/python3 setup.py install 'd master ( 5878e22 ), then restarted each salt daemon, and got the same result.

@anilsil anilsil added this to the Sulfur v3006.4 milestone Sep 12, 2023
@johhnry
Copy link

johhnry commented Sep 15, 2023

Hi,

I also get this Nonce verification error on a local network when launching a state.apply on a Linux server. I don't know how it's related to any slow TCP connection because we have other machines on the same network...

It's quite blocking right now because I can't deploy anything on that machine. Is there a way to bypass that?

Thanks!

@Oloremo
Copy link
Contributor

Oloremo commented Sep 15, 2023

Got the same issue, all works just fine on 3006.1 and I have that issue on 3006.3 - no other changes.

0mq protocol, local network, no syndic

@dwoz
Copy link
Contributor

dwoz commented Sep 19, 2023

Just now, on both a syndic and a hardware minion using that same syndic, I backed up /opt/saltstack, then /opt/saltstack/salt/bin/python3 setup.py install 'd master ( 5878e22 ), then restarted each salt daemon, and got the same result.

The minions were still running the old code?

@itscaro
Copy link

itscaro commented Sep 20, 2023

@Oloremo thanks for your information.

I tried 3006.1, 3006.2 and 3006.3, indeed the issue is only observed in 3006.3

@Oloremo
Copy link
Contributor

Oloremo commented Sep 20, 2023

@dwoz happy to help debugging this. I think my configuration is way more simple comparing to @terryd-imh.

Just need some pointers.

@Oloremo
Copy link
Contributor

Oloremo commented Sep 20, 2023

Our network is 5ms RTT, as LAN as it could be but we have high timeout settings.

I can reproduce it easily. The whole deployment is done via code.

@dwoz
Copy link
Contributor

dwoz commented Sep 20, 2023

I was able to reproduce this using the TCP transport but have not yet reproduced it with ZeroMQ. That said, I'm fairly certain it is a similar problem that can be addressed in a similar way as I have done for TCP.

@terryd-imh The changes in #65247 should resolve this issue for the TCP transport. Can you verify?

@dwoz
Copy link
Contributor

dwoz commented Sep 20, 2023

@Oloremo I've pushed a potential fix for the ZeroMQ transport in #65247. Are you able to verify this resolve the issue?

@Oloremo
Copy link
Contributor

Oloremo commented Sep 21, 2023

@dwoz I can confirm it fixed the issue with zeromq on 3006.3

@Oloremo
Copy link
Contributor

Oloremo commented Sep 21, 2023

@dwoz

UPD: Spoke too soon. Seems like now it's flapping.
And I still see a

2023-09-21 15:40:55,438 [salt.loaded.int.returner.local_cache] [ERROR] [JID: 20230921154008033658] An extra return was detected from minion test-run-kproskurin-03, please verify the minion, this could be a replay attack

In logs.

@dwoz
Copy link
Contributor

dwoz commented Sep 21, 2023

@Oloremo You still saw the Nonce verification error?

An extra return can happen anytime 'return_retry_tries` is set to more than 1.

@Oloremo
Copy link
Contributor

Oloremo commented Sep 22, 2023

@dwoz
Did a bunch of checks.

Main context:

  1. We have global failhard param set to true
  2. return_retry_tries is not defined in our configs so it should a default value
  3. We run all commands with -b 50 and queue=true as a safe defaults
  4. State formula.salt_utils.common.sync_grains we run just does grains.append and grains.present based on Pillars data.
  5. The test cluster consists of 5 nodes running on AWS m5.large, Master and Minions are 3006.3 onedir + patch from that discussion.

Experiments:

[centos@test-run-kproskurin-rm ~]$ salt  -b 50 \* state.apply formula.salt_utils.common.sync_grains queue=true

Executing run on ['test-run-kproskurin-01', 'test-run-kproskurin-02', 'test-run-kproskurin-03', 'test-run-kproskurin-rm', 'test-run-kproskurin-rms']

jid:
    20230921214108010348
retcode:
    1
test-run-kproskurin-rm:
    ERROR executing 'state.apply': Nonce verification error
[ERROR   ] Minion test-run-kproskurin-rm returned with non-zero exit code. Batch run stopped due to failhard
[centos@test-run-kproskurin-rm ~]$  sudo salt -b 50 \* state.apply formula.salt_utils.common.sync_grains queue=true

Executing run on ['test-run-kproskurin-01', 'test-run-kproskurin-02', 'test-run-kproskurin-03', 'test-run-kproskurin-rm', 'test-run-kproskurin-rms']

minion test-run-kproskurin-03 was already deleted from tracker, probably a duplicate key
minion test-run-kproskurin-03 was already deleted from tracker, probably a duplicate key
minion test-run-kproskurin-rm was already deleted from tracker, probably a duplicate key
minion test-run-kproskurin-rm was already deleted from tracker, probably a duplicate key
jid:
    20230921214228117019
retcode:
    1
test-run-kproskurin-rms:
    ERROR executing 'state.apply': Nonce verification error
[ERROR   ] Minion test-run-kproskurin-rms returned with non-zero exit code. Batch run stopped due to failhard
[centos@test-run-kproskurin-rm ~]$  sudo  salt -b 50 \* state.apply formula.salt_utils.common.sync_grains queue=true

Executing run on ['test-run-kproskurin-01', 'test-run-kproskurin-02', 'test-run-kproskurin-03', 'test-run-kproskurin-rm', 'test-run-kproskurin-rms']

jid:
    20230921230425013235
retcode:
    1
test-run-kproskurin-02:
    ERROR executing 'state.apply': Nonce verification error
[ERROR   ] Minion test-run-kproskurin-02 returned with non-zero exit code. Batch run stopped due to failhard

So 3 runs in a row we have consistent results.

Next - I'm removing the queue argument

[centos@test-run-kproskurin-rm ~]$  sudo salt -b 50 \* state.apply formula.salt_utils.common.sync_grains

Executing run on ['test-run-kproskurin-01', 'test-run-kproskurin-02', 'test-run-kproskurin-03', 'test-run-kproskurin-rm', 'test-run-kproskurin-rms']

minion test-run-kproskurin-02 was already deleted from tracker, probably a duplicate key
minion test-run-kproskurin-02 was already deleted from tracker, probably a duplicate key
jid:
    20230921230533174079
retcode:
    1
test-run-kproskurin-02:
    ERROR executing 'state.apply': Nonce verification error
[ERROR   ] Minion test-run-kproskurin-02 returned with non-zero exit code. Batch run stopped due to failhard

No changes. Now I'm removing the batching:

[centos@test-run-kproskurin-rm ~]$  sudo  salt  \* state.apply formula.salt_utils.common.sync_grains
test-run-kproskurin-02:
    ERROR executing 'state.apply': Nonce verification error
test-run-kproskurin-rm:
----------
          ID: formula.salt_utils.common.sync_grains_common
    Function: grains.present
        Name: roles
      Result: True
     Comment: Set grain roles to ['vmagent', 'salt-master', 'salt-minion', 'vault',  [LIST CUT]
     Started: 23:08:26.388956
    Duration: 14.813 ms
     Changes:
              ----------
              roles:
                  - vmagent
                  - salt-master
                  - salt-minion
                  - vault
                  [LIST CUT]
----------
          ID: formula.salt_utils.sync_grains_currentgen
    Function: grains.append
        Name: roles
      Result: True
     Comment: Value ['supervisord', 'java', 'vector', 'docker',  [LIST CUT]] was added to grain roles
     Started: 23:08:26.404043
    Duration: 433.315 ms
     Changes:
              ----------
              added:
                  - supervisord
                  - java
                  - vector
                  - docker
                   [LIST CUT]
  Name: rack_id - Function: grains.present - Result: Clean - Started: 23:08:26.839484 - Duration: 326.544 ms

Summary for test-run-kproskurin-rm
------------
Succeeded: 3 (changed=2)
Failed:    0
------------
Total states run:     3
Total run time: 774.672 ms
test-run-kproskurin-02:
    ERROR executing 'state.apply': Nonce verification error
test-run-kproskurin-03:
    ERROR executing 'state.apply': Nonce verification error
test-run-kproskurin-01:
    ERROR executing 'state.apply': Nonce verification error
test-run-kproskurin-02:
    ERROR executing 'state.apply': Nonce verification error
test-run-kproskurin-rms:
----------
          ID: formula.salt_utils.common.sync_grains_common
    Function: grains.present
        Name: roles
      Result: True
     Comment: Set grain roles to ['vmagent', 'salt-minion', 'vault',  [LIST CUT]]
     Started: 23:08:52.211474
    Duration: 6.907 ms
     Changes:
              ----------
              roles:
                  - vmagent
                  - salt-minion
                  - vault
                   [LIST CUT]
----------
          ID: formula.salt_utils.sync_grains_currentgen
    Function: grains.append
        Name: roles
      Result: True
     Comment: Value ['supervisord', 'java', 'vector', 'docker',  [LIST CUT]] was added to grain roles
     Started: 23:08:52.218617
    Duration: 387.106 ms
     Changes:
              ----------
              added:
                  - supervisord
                  - java
                  - vector
                  - docker
                   [LIST CUT]
----------
          ID: formula.salt.minions.set_racks
    Function: grains.present
        Name: rack_id
      Result: True
     Comment: Set grain rack_id to rack-01
     Started: 23:08:52.607148
    Duration: 334.67 ms
     Changes:
              ----------
              rack_id:
                  rack-01

Summary for test-run-kproskurin-rms
------------
Succeeded: 3 (changed=3)
Failed:    0
------------
Total states run:     3
Total run time: 728.683 ms


-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 5
# of minions returned: 5
# of minions that did not return: 0
# of minions with errors: 3
-------------------------------------------
ERROR: Minions returned with non-zero exit code

Huh? So it could be related to batch logic.

@jtraub91
Copy link
Contributor

@terryd-imh I see you have pygit2 in your versions report. Are you using the gitfs fileserver backend? I am seeing this error in an environment using gitfs and the error goes away after removing that configuration. This observation suggests a potential root cause, but at least is an example of a situation that gets this issue to manifest.

@dwoz
Copy link
Contributor

dwoz commented Sep 28, 2023

@Oloremo Can you test the new changes in #65247 when you have time please?

@Oloremo
Copy link
Contributor

Oloremo commented Sep 28, 2023

@dwoz after applying the patch to both master and minon - minions can no longer connect to master:

2023-09-28 13:29:17,291 [salt.cli.daemons] [INFO]  Starting up the Salt Minion
2023-09-28 13:29:17,292 [salt.utils.event] [INFO]  Starting pull socket on /opt/salt/run/minion/minion_event_79882f540b_pull.ipc
2023-09-28 13:29:17,763 [salt.minion] [INFO]  Creating minion process manager
2023-09-28 13:30:17,816 [salt.minion] [ERROR]  Minion unable to successfully connect to a Salt Master.

@dwoz
Copy link
Contributor

dwoz commented Sep 28, 2023

@dwoz after applying the patch to both master and minon - minions can no longer connect to master:

2023-09-28 13:29:17,291 [salt.cli.daemons] [INFO]  Starting up the Salt Minion
2023-09-28 13:29:17,292 [salt.utils.event] [INFO]  Starting pull socket on /opt/salt/run/minion/minion_event_79882f540b_pull.ipc
2023-09-28 13:29:17,763 [salt.minion] [INFO]  Creating minion process manager
2023-09-28 13:30:17,816 [salt.minion] [ERROR]  Minion unable to successfully connect to a Salt Master.

Something must be off. The test suite ran with only one failure. If the minions can no longer connect to the master I'd expect to see more failures.

@terryd-imh
Copy link
Author

@terryd-imh I see you have pygit2 in your versions report. Are you using the gitfs fileserver backend? I am seeing this error in an environment using gitfs and the error goes away after removing that configuration. This observation suggests a potential root cause, but at least is an example of a situation that gets this issue to manifest.

Yes - we use gitfs which lets us test changes way easier.

@terryd-imh The changes in #65247 should resolve this issue for the TCP transport. Can you verify?

Yes - on the test server I was using, I got the Nonce error on every run, and with those changes I ran it multiple times without getting it once.

@terryd-imh
Copy link
Author

close->reopen: whoops

@Oloremo
Copy link
Contributor

Oloremo commented Oct 13, 2023

and since everything is in code, I repeat the same from scratch deployment but with 3006.1 and it works just fine.

@dwoz
Copy link
Contributor

dwoz commented Oct 13, 2023

@dwoz Sorry for the late response. I recreated env from scratch and got the same issue.

@Oloremo The same Nonce verification issue? I'm not seeing that in your log snippet.

@Oloremo
Copy link
Contributor

Oloremo commented Oct 13, 2023

@dwoz no, after the latest patch set its minion cannot connect to the master.

@dwoz
Copy link
Contributor

dwoz commented Oct 14, 2023

@Oloremo

  • What does the minion's config look like?
  • How many minions are there and are all of them un-able to connect?

@nicholasmhughes
Copy link
Collaborator

Can confirm that the changes in #65247 resolve the nonce error for a customer. However, we are still seeing higher incidence of timeouts as reported by @Oloremo and also in #65397. @dwoz Perhaps we call this issue fixed and troubleshoot the remaining timeouts in the other issue?

@dwoz
Copy link
Contributor

dwoz commented Oct 16, 2023

@nicholasmhughes I'm good with that assuming @Oloremo agrees.

@Oloremo
Copy link
Contributor

Oloremo commented Oct 19, 2023

@dwoz
Sorry for the delays.

What does the minion's config look like?

$ egrep -v '^#|^$' minion
root_dir: /
default_include: minion.d/*.conf
master: 172.31.19.130
master_port: 4506
pidfile: /opt/salt/run/minion/salt-minion.pid
conf_file: /opt/salt/minion
pki_dir: /opt/salt/pki/minion
id: test-run-kproskurin-rm
cachedir: /opt/salt/cache-minion
extension_modules: /opt/salt/cache-minion/extmods
module_dirs:
 - /opt/salt/library/modules
verify_env: True
sock_dir: /opt/salt/run/minion
random_reauth_delay: 5
random_startup_delay: 0
auth_timeout: 60
auth_tries: 10
master_tries: -1
ping_interval: 1
acceptance_wait_time: 5
grains_deep_merge: True
grains_refresh_every: 5
grains_cache: False
minion_pillar_cache: False
grains_cache_expiration: 60
renderer: yaml_jinja
failhard: True
pillar_raise_on_missing: True
startup_states: 'sls'
sls_list:
  - formula.salt_utils.common.sync_grains
file_roots:
  base:
    - /opt/salt/states
hash_type: md5
pillar_roots:
  base:
    - /opt/salt/pillar
autosign_grains:
  - owner
mine_interval: 1
log_file: /opt/salt/logs/minion.log
key_logfile: /opt/salt/logs/minion-key.log
log_level: info
log_fmt_logfile: '%(asctime)s,%(msecs)03d [%(name)s] [%(levelname)s] %(jid)s %(message)s'
use_superseded:
  - module.run
disable_grains:
  - esxi
disable_modules:
  - vsphere

How many minions are there and are all of them un-able to connect?

5 minions. All of them are unable to connect.

@darkpixel
Copy link
Contributor

Out of ~425 minions, around 150 of them (randomly) error out on every run regardless of it being state.highstate or a state.sls call.

Of those that error out, about half are ERROR executing 'state.sls': Nonce verification error, and the rest are Unable to manage file: File client timed out after 1698411537.

The files are coming from a CDN and I don' t get timeouts with previous versions of salt.

@nicholasmhughes
Copy link
Collaborator

@darkpixel , have you tried to patch w/ the fix from #65247 ?

@darkpixel
Copy link
Contributor

Negative @nicholasmhughes.

It salt ever going to get nightly builds? ;)

It would take a lot of effort on my part to package up Salt for Windows, Debian, RedHat, and FreeBSD and get it pushed out for testing.

@whytewolf
Copy link
Collaborator

Negative @nicholasmhughes.

It salt ever going to get nightly builds? ;)

It would take a lot of effort on my part to package up Salt for Windows, Debian, RedHat, and FreeBSD and get it pushed out for testing.

nightlies are at https://repo.saltproject.io/salt-dev/ in the 3006.x and master directory. use at your own risk. it is based on the last successful build. so might not be every night.

@darkpixel
Copy link
Contributor

Damn! I must have missed that browsing through the github actions.

Thanks @whytewolf

@phoerious
Copy link

phoerious commented Nov 14, 2023

Same here with 3006.4. I noticed that occasionally, I also get two ore more responses to test.ping. Meanwhile, minion throw errors like these:

Nov 14 18:58:32 XXX salt-minion[2484481]: [ERROR   ] Uncaught exception in ZMQStream callback
Nov 14 18:58:32 XXX salt-minion[2484481]: Traceback (most recent call last):
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 556, in _run_callback
Nov 14 18:58:32 XXX salt-minion[2484481]:     callback(*args, **kwargs)
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
Nov 14 18:58:32 XXX salt-minion[2484481]:     return fn(*args, **kwargs)
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 628, in handle_reply
Nov 14 18:58:32 XXX salt-minion[2484481]:     future.set_result(data)
Nov 14 18:58:32 XXX salt-minion[2484481]: AttributeError: 'NoneType' object has no attribute 'set_result'
Nov 14 18:58:32 XXX salt-minion[2484481]: [ERROR   ] Uncaught exception in zmqstream callback
Nov 14 18:58:32 XXX salt-minion[2484481]: Traceback (most recent call last):
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 577, in _handle_events
Nov 14 18:58:32 XXX salt-minion[2484481]:     self._handle_recv()
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 606, in _handle_recv
Nov 14 18:58:32 XXX salt-minion[2484481]:     self._run_callback(callback, msg)
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 556, in _run_callback
Nov 14 18:58:32 XXX salt-minion[2484481]:     callback(*args, **kwargs)
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
Nov 14 18:58:32 XXX salt-minion[2484481]:     return fn(*args, **kwargs)
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 628, in handle_reply
Nov 14 18:58:32 XXX salt-minion[2484481]:     future.set_result(data)
Nov 14 18:58:32 XXX salt-minion[2484481]: AttributeError: 'NoneType' object has no attribute 'set_result'
Nov 14 18:58:32 XXX salt-minion[2484481]: [ERROR   ] Exception in callback (<zmq.Socket(zmq.REQ) at 0x7f0416721240>, <function wrap.<locals>.null_wrapper at 0x7f0416724af0>)
Nov 14 18:58:32 XXX salt-minion[2484481]: Traceback (most recent call last):
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 890, in start
Nov 14 18:58:32 XXX salt-minion[2484481]:     handler_func(fd_obj, events)
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
Nov 14 18:58:32 XXX salt-minion[2484481]:     return fn(*args, **kwargs)
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 577, in _handle_events
Nov 14 18:58:32 XXX salt-minion[2484481]:     self._handle_recv()
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 606, in _handle_recv
Nov 14 18:58:32 XXX salt-minion[2484481]:     self._run_callback(callback, msg)
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 556, in _run_callback
Nov 14 18:58:32 XXX salt-minion[2484481]:     callback(*args, **kwargs)
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
Nov 14 18:58:32 XXX salt-minion[2484481]:     return fn(*args, **kwargs)
Nov 14 18:58:32 XXX salt-minion[2484481]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 628, in handle_reply
Nov 14 18:58:32 XXX salt-minion[2484481]:     future.set_result(data)
Nov 14 18:58:32 XXX salt-minion[2484481]: AttributeError: 'NoneType' object has no attribute 'set_result'

@afletch
Copy link
Contributor

afletch commented Dec 4, 2023

Also seeing the same combination of Nonce verification error and zmq error that @phoerious mentions above. Out of several hundred minions this appears to be impacting the majority since updating to 3006.4. I notice that the milestone was pushed to 3006.5 @twangboy is this firm? Also, does this address the zmq errors too?

@Oloremo
Copy link
Contributor

Oloremo commented Dec 4, 2023

A reminder that we were able to reproduce it with both 0mq and TCP(assuming changing transport param on master and minion is enough to switch to TCP), and it fails for us in both cases, and it's 100% reproducible in our CI.

I am happy to do any debugging to help with the issue.

@phoerious
Copy link

I haven't seen the nonce verification error in a while, but the master log gets spammed with this:

An extra return was detected from minion MINIONNAME, please verify the minion, this could be a replay attack

Perhaps that's related.

@dwoz
Copy link
Contributor

dwoz commented Dec 16, 2023

I believe this is resolved in 3006.5.

@dmurphy18
Copy link
Contributor

Closing this since fixed by #65247, and mentioned in Salt 3006.5 release notes, must have be missed to close

Bajzathd added a commit to hortonworks/cloudbreak-images that referenced this issue Jan 19, 2024
Salt 3004+ introduced additional systemctl checks for it service module, which is not compatible with the systemctl we have on YARN images. As a workaround, these were replaced with equivalent cmd.run commands for YARN images.
Similar issue with the suggested workaround: saltstack/salt#62311 (comment)
Also had to bump Salt version to 3006.5: saltstack/salt#65114 (comment)
Bajzathd added a commit to hortonworks/cloudbreak-images that referenced this issue Jan 19, 2024
Salt 3004+ introduced additional systemctl checks for it service module, which is not compatible with the systemctl we have on YARN images. As a workaround, these were replaced with equivalent cmd.run commands for YARN images.
Similar issue with the suggested workaround: saltstack/salt#62311 (comment)
Also had to bump Salt version to 3006.5: saltstack/salt#65114 (comment)
Bajzathd added a commit to hortonworks/cloudbreak-images that referenced this issue Jan 24, 2024
Salt 3004+ introduced additional systemctl checks for it service module, which is not compatible with the systemctl we have on YARN images. As a workaround, these were replaced with equivalent cmd.run commands for YARN images.
Similar issue with the suggested workaround: saltstack/salt#62311 (comment)
Also had to bump Salt version to 3006.5: saltstack/salt#65114 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Transport VMware
Projects
None yet
Development

No branches or pull requests