Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 3007.X scheduler_before_connect throws TCP Publish Client encountered an exception, no start event is seen #66932

Open
amalaguti opened this issue Sep 27, 2024 · 3 comments
Labels
Bug broken, incorrect, or confusing behavior needs-triage

Comments

@amalaguti
Copy link

Description
3007.1 Windows minions with option scheduler_before_connect: True set throws exception TCP Publish Client encountered an exception while connecting to 127.0.0.1:4510: TimeoutError() on start, and does not generate /start event on master event bus

[TRACE   ] IPCServer: Handling connection to address: ('127.0.0.1', 52006)
[DEBUG   ] The functions from module 'mine' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded mine.update
[DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'id': 'minion-win-1', 'fun': 'mine.update', 'fun_args': [], 'schedule': '__mine_interval', 'jid': '20240927221656953118', 'pid': 420}
[TRACE   ] ==== evaluating schedule now None =====
[TRACE   ] Client disconnected from IPC 127.0.0.1:4511
[WARNING ] TCP Publish Client encountered an exception while connecting to 127.0.0.1:4510: TimeoutError(), will reconnect in 1 seconds -   File "C:\Program Files\Salt Project\Salt\Lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,

  File "C:\Program Files\Salt Project\Salt\Lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)

  File "C:\Program Files\Salt Project\Salt\salt-minion.exe\__main__.py", line 7, in <module>
    sys.exit(salt_minion())

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\scripts.py", line 185, in salt_minion
    minion.start()

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\cli\daemons.py", line 344, in start
    self._real_start()

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\cli\daemons.py", line 356, in _real_start
    self.minion.tune_in()

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\minion.py", line 1193, in tune_in
    self._bind()

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\minion.py", line 1064, in _bind
    self.event = salt.utils.event.get_event(

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\utils\event.py", line 135, in get_event
    return SaltEvent(

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\utils\event.py", line 265, in __init__
    self.connect_pub()

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\utils\event.py", line 348, in connect_pub
    self.subscriber = salt.transport.ipc_publish_client(

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\transport\base.py", line 210, in ipc_publish_client
    return publish_client(opts, io_loop, **kwargs)

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\transport\base.py", line 152, in publish_client
    return salt.transport.tcp.PublishClient(

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\transport\tcp.py", line 220, in __init__
    super().__init__(opts, io_loop, **kwargs)

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\transport\base.py", line 398, in __init__
    super().__init__()

[TRACE   ] TCP PubServer sending payload: topic_list=None '/salt/minion/minion_schedule_delete_complete\n\n��completeèschedule��__mine_interval��enabledèfunction�mine.update�jid_includeêmaxrunning\x02�minutes<�name�__mine_interval�return_job£runìrun_on_startåsplay��_stamp�2024-09-27T22:16:56.937494'
[TRACE   ] TCP PubServer finished publishing payload
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: C:\ProgramData\Salt Project\Salt\conf\pki\minion\minion.pem

No /start event is shown, only auth

salt/auth	{
    "_stamp": "2024-09-27T22:22:51.680278",
    "act": "accept",
    "id": "minion-win-1",
    "pub": "-----BEGIN PUBLIC KEY-----REDACTED-----END PUBLIC KEY-----",
    "result": true
}
minion/refresh/minion-win-1	{
    "Minion data cache refresh": "minion-win-1",
    "_stamp": "2024-09-27T22:22:51.961685"
}

The minion seems responsive after start, but not having the /start event seems to be a big issue.

The scheduler_before_connect option is required in this use case and works fine in 3006.x versions.

Setup
3007.x Windos minion with scheduler_before_connect: True

Expected behavior
Minion should start without the exception, the scheduler_before_connect should work as expected and /start event should be seeing on the event bus

@amalaguti amalaguti added Bug broken, incorrect, or confusing behavior needs-triage labels Sep 27, 2024
@amalaguti
Copy link
Author

amalaguti commented Sep 30, 2024

I discovered a little difference in behavior, if the minion is configured in multimaster (active/active), when the minion is started as service, net start salt-minion, there is no /start event.

But when the minion is started in foreground, salt-minion -l debug, I can randomly get a /start event in one or the other master, but not in both.

With the scheduler_before_connect option removed, both masters get the /start event

@amalaguti
Copy link
Author

amalaguti commented Oct 30, 2024

The problem seems to be async/await functions not being called properly, something we've seen recently in another minion connection related issues.

@dwoz @twangboy
I would like to ask your review on this core changes due the complexities of this part of the code and general minion connection process, I'm not sure if this are how you want thigs to work for the minion connection setup.

The problem seems to be in the
# Main Minion Tune In
def tune_in(self, start=True):
HERE

Inside this function, the function self.setup_scheduler() is called which triggers the exception.

I simply changed the setup_scheduler() function to be async async def setup_scheduler(self, before_connect=False):. Just with this change the exception is not longer thrown.
I also changed the call to this function to use await: await self.setup_scheduler().

In addition, (I will mention this in a new issue), I've observed that this section of the code inside the #Main minion tune in def tune_in(self, start=True):
HERE
which is suppossed to setup_beacons and setup_scheduler when the options beacons_before_connect or scheduler_before_connect are true, does not seem to be ever triggered during the minion start up.

I added some logging messages and never seem to be called with start: True. There's another function async def _connect_minion(self, minion) from which minion.tune_in() is called with start=False.

if start:
            if self.opts.get("beacons_before_connect", False):
                self.setup_beacons(before_connect=True)
            if self.opts.get("scheduler_before_connect", False):
                self.setup_scheduler(before_connect=True)

I tried this with multimaster setup, two masters, the tune_in() function seems to be executed once per master connected.

@amalaguti
Copy link
Author

@barneysowood fyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior needs-triage
Projects
None yet
Development

No branches or pull requests

1 participant