Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wazuh-metrics CLI bug when child processes restart #2416

Merged
merged 4 commits into from
Jan 24, 2022

Conversation

vicferpoy
Copy link
Contributor

Related issue
closes #2401

Description

As reported, the wazuh-metrics CLI would not monitor child processes that could restart. Currently, this issue only happens with the Wazuh cluster and API, as they use multiprocessing. An example of this bug can be seen in #2401 .

To fix this, the CLI has been reworked to add a monitor healthcheck that is able to shutdown unused monitors and launch new ones when processes restart. They key for this to work is that we retrieve child PIDs using the following line:

return [parent_pid] + [child.pid for child in psutil.Process(parent_pid).children(recursive=True)]

This will always return a list of PIDs ordered from oldest (parent process) to most recent if there are any child processes. In addition, both the cluster and API processes always spawn their child in the same order, asserting that although the PIDs may change, the process name will remain the same in the CSV file.

Another key change is this block:

for i, pid in enumerate(process_pids):
# Attempt to create new monitor instances for the process
p_name = process if i == 0 else f'{process}_child_{i}'
monitor = Monitor(process_name=p_name, pid=pid, value_unit=options.data_unit,
time_step=options.sleep_time,
version=options.version, dst_dir=options.store_path)
monitor.start()
try:
# Replace old monitors for new ones
ACTIVE_MONITORS[process][i] = monitor
except IndexError:
ACTIVE_MONITORS[process].append(monitor)

As this CLI is used for the first time after outer healthchecks confirm that everything is working as intended, the first monitor list that we retrieve will have the exact number of expected processes. Thus, replacing each monitor with a new one after scanning the PIDs instead of simply removing and adding new monitor instances will leave failed monitors that will be checked again. This mechanism will keep failing until all the expected child processes are being monitored again.

Configuration options

After the rework, two new options have been added to the CLI:

  • Healthcheck time (-H, --healthcheck-time): Time in seconds between each health check. Default value 10 seconds.
  • Retries (-r, --retries): Number of reconnection retries before aborting the monitoring process. Default value 5 retries.

Logs example

  • Launching the CLI with the following parameters: wazuh-metrics -p wazuh-apid wazuh-clusterd
  • There is a script constantly adding agents to the environment using the API.
  • A service wazuh-manager restart was executed shortly after beginning the monitoring task.

image

wazuh-metrics log
2021/12/28 15:08:22 INFO: Started new session: /tmp/process_metrics/28-12-2021/1640704102
2021/12/28 15:08:22 INFO: Started monitoring process wazuh-apid (441500)
2021/12/28 15:08:22 INFO: Started monitoring process wazuh-apid_child_1 (441523)
2021/12/28 15:08:22 INFO: Started monitoring process wazuh-apid_child_2 (441526)
2021/12/28 15:08:22 INFO: Started monitoring process wazuh-clusterd (442354)
2021/12/28 15:08:22 INFO: Started monitoring process wazuh-clusterd_child_1 (442708)
2021/12/28 15:08:22 INFO: Started monitoring process wazuh-clusterd_child_2 (442711)
2021/12/28 15:08:34 WARNING: Lost PID for wazuh-clusterd_child_2
2021/12/28 15:08:34 WARNING: Lost PID for wazuh-clusterd
2021/12/28 15:08:34 WARNING: Lost PID for wazuh-clusterd_child_1
2021/12/28 15:08:36 WARNING: Lost PID for wazuh-apid_child_2
2021/12/28 15:08:36 WARNING: Lost PID for wazuh-apid_child_1
2021/12/28 15:08:37 WARNING: Lost PID for wazuh-apid
2021/12/28 15:08:42 WARNING: Monitoring of wazuh-apid failed. Attempting to create new monitor instances
2021/12/28 15:08:42 INFO: Started monitoring process wazuh-apid (496700)
2021/12/28 15:08:42 INFO: Started monitoring process wazuh-apid_child_1 (496723)
2021/12/28 15:08:42 INFO: Started monitoring process wazuh-apid_child_2 (496726)
2021/12/28 15:08:42 WARNING: Monitoring of wazuh-clusterd failed. Attempting to create new monitor instances
2021/12/28 15:08:42 WARNING: Could not create new monitor instances for wazuh-clusterd
2021/12/28 15:08:52 WARNING: Monitoring of wazuh-clusterd failed. Attempting to create new monitor instances
2021/12/28 15:08:52 INFO: Started monitoring process wazuh-clusterd (497470)
2021/12/28 15:08:52 INFO: Started monitoring process wazuh-clusterd_child_1 (499336)
2021/12/28 15:08:52 INFO: Started monitoring process wazuh-clusterd_child_2 (499347)
2021/12/28 15:09:05 INFO: Attempting to shutdown all monitor threads
2021/12/28 15:09:08 INFO: Process finished gracefully

Tests

  • Proven that tests pass when they have to pass.
  • Proven that tests fail when they have to fail.
  • Python codebase satisfies PEP-8 style style guide. pycodestyle --max-line-length=120 --show-source --show-pep8 file.py.
  • Python codebase is documented following the Google Style for Python docstrings.
  • The test is documented in wazuh-qa/docs.
  • provision_documentation.sh generate the docs without errors.

@vicferpoy vicferpoy self-assigned this Dec 28, 2021
@davidjiglesias davidjiglesias requested a review from snaow January 12, 2022 09:14
@snaow snaow merged commit 027eb9a into master Jan 24, 2022
@snaow snaow deleted the 2401-wazuh-metrics-reconnecting branch January 24, 2022 16:34
@snaow snaow mentioned this pull request Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix wazuh-metrics for reconnecting daemons using multiprocessing
4 participants