-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Nonce verification error with TCP transport on slower network connections #65114
Comments
Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey.
There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. |
@terryd-imh I think your assumption about requests being received out of order is likely accurate. How likely are you to be able to test this issue against the current HEAD of the master branch? |
Just now, on both a syndic and a hardware minion using that same syndic, I backed up /opt/saltstack, then |
Hi, I also get this It's quite blocking right now because I can't deploy anything on that machine. Is there a way to bypass that? Thanks! |
Got the same issue, all works just fine on 3006.1 and I have that issue on 3006.3 - no other changes. 0mq protocol, local network, no syndic |
The minions were still running the old code? |
@Oloremo thanks for your information. I tried 3006.1, 3006.2 and 3006.3, indeed the issue is only observed in 3006.3 |
@dwoz happy to help debugging this. I think my configuration is way more simple comparing to @terryd-imh. Just need some pointers. |
Our network is 5ms RTT, as LAN as it could be but we have high timeout settings. I can reproduce it easily. The whole deployment is done via code. |
I was able to reproduce this using the TCP transport but have not yet reproduced it with ZeroMQ. That said, I'm fairly certain it is a similar problem that can be addressed in a similar way as I have done for TCP. @terryd-imh The changes in #65247 should resolve this issue for the TCP transport. Can you verify? |
@dwoz I can confirm it fixed the issue with zeromq on 3006.3 |
UPD: Spoke too soon. Seems like now it's flapping.
In logs. |
@Oloremo You still saw the Nonce verification error? An extra return can happen anytime 'return_retry_tries` is set to more than 1. |
@dwoz Main context:
Experiments:
So 3 runs in a row we have consistent results. Next - I'm removing the
No changes. Now I'm removing the batching:
Huh? So it could be related to batch logic. |
@terryd-imh I see you have |
@dwoz after applying the patch to both master and minon - minions can no longer connect to master:
|
Something must be off. The test suite ran with only one failure. If the minions can no longer connect to the master I'd expect to see more failures. |
Yes - we use gitfs which lets us test changes way easier.
Yes - on the test server I was using, I got the Nonce error on every run, and with those changes I ran it multiple times without getting it once. |
close->reopen: whoops |
and since everything is in code, I repeat the same from scratch deployment but with 3006.1 and it works just fine. |
@dwoz no, after the latest patch set its minion cannot connect to the master. |
|
@nicholasmhughes I'm good with that assuming @Oloremo agrees. |
@dwoz
5 minions. All of them are unable to connect. |
Out of ~425 minions, around 150 of them (randomly) error out on every run regardless of it being Of those that error out, about half are The files are coming from a CDN and I don' t get timeouts with previous versions of salt. |
@darkpixel , have you tried to patch w/ the fix from #65247 ? |
Negative @nicholasmhughes. It salt ever going to get nightly builds? ;) It would take a lot of effort on my part to package up Salt for Windows, Debian, RedHat, and FreeBSD and get it pushed out for testing. |
nightlies are at https://repo.saltproject.io/salt-dev/ in the 3006.x and master directory. use at your own risk. it is based on the last successful build. so might not be every night. |
Damn! I must have missed that browsing through the github actions. Thanks @whytewolf |
Same here with 3006.4. I noticed that occasionally, I also get two ore more responses to
|
Also seeing the same combination of |
A reminder that we were able to reproduce it with both 0mq and TCP(assuming changing I am happy to do any debugging to help with the issue. |
I haven't seen the nonce verification error in a while, but the master log gets spammed with this:
Perhaps that's related. |
I believe this is resolved in |
Closing this since fixed by #65247, and mentioned in Salt 3006.5 release notes, must have be missed to close |
Salt 3004+ introduced additional systemctl checks for it service module, which is not compatible with the systemctl we have on YARN images. As a workaround, these were replaced with equivalent cmd.run commands for YARN images. Similar issue with the suggested workaround: saltstack/salt#62311 (comment) Also had to bump Salt version to 3006.5: saltstack/salt#65114 (comment)
Salt 3004+ introduced additional systemctl checks for it service module, which is not compatible with the systemctl we have on YARN images. As a workaround, these were replaced with equivalent cmd.run commands for YARN images. Similar issue with the suggested workaround: saltstack/salt#62311 (comment) Also had to bump Salt version to 3006.5: saltstack/salt#65114 (comment)
Salt 3004+ introduced additional systemctl checks for it service module, which is not compatible with the systemctl we have on YARN images. As a workaround, these were replaced with equivalent cmd.run commands for YARN images. Similar issue with the suggested workaround: saltstack/salt#62311 (comment) Also had to bump Salt version to 3006.5: saltstack/salt#65114 (comment)
Description
salt-call state.apply gives:
salt.exceptions.SaltClientError: Nonce verification error
This happens occasionally on lower latency minions and almost always on high latency minions geographically separated from the syndic.
We have over 700 states applied across over 30 formulas and over 400 minions per syndic, so there's a lot of connections but the syndics' load is consistently low. The issue happens more frequently on minions with higher latency/jitter. If I add -l trace to see what state it was working on before hitting the exception, it'll be a different state each time, so while the number of states being applied might be affecting it, I think network latency and jitter is the bigger factor. (Or, maybe the combination of that + lots of connections?)
I edited crypt.Crypticle.loads() to collect more information, so at the top of that function, it looks like this:
All payloads decode properly, so it isn't being mangled in transit, but these two lines hint at what's probably happening:
Note the nonce of one request matches the ret_nonce of a later request. Something in ext.tornado might be mixing up requests when received out of order maybe?
Setup
Onedir setup pinned to the minor version 3006.2 + these modules pip installed:
layout: 1 master of masters -> 2 syndics -> minions which only report to only one (their geographically closest) syndic
Masters/Syndics are on CentOS 7. Minions a mixed bag of CentOS7 / CloudLinux 7 / AlmaLinux8 / Ubuntu20 / Ubuntu22.
Masters/Syndics are on OpenVZ. Minions are a mixture VM types and physical machines - but the issue happens on physical minions, so I don't think that's related.
We use gitfs for pillar and states, and TCP for transport.
Steps to Reproduce the behavior
If I sign into a minion that doesn't have the issue, I can cause it to happen by simulating some network delay:
Expected behavior
Highstate applied without crashing with the nonce exception.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Versions on minions/master/syndics are identical. This is collected from an affected minion:
The text was updated successfully, but these errors were encountered: