Fix wazuh-metrics
CLI bug when child processes restart
#2416
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
As reported, the
wazuh-metrics
CLI would not monitor child processes that could restart. Currently, this issue only happens with the Wazuh cluster and API, as they use multiprocessing. An example of this bug can be seen in #2401 .To fix this, the CLI has been reworked to add a monitor healthcheck that is able to shutdown unused monitors and launch new ones when processes restart. They key for this to work is that we retrieve child PIDs using the following line:
wazuh-qa/deps/wazuh_testing/wazuh_testing/tools/performance/binary.py
Line 95 in e1e3621
This will always return a list of PIDs ordered from oldest (parent process) to most recent if there are any child processes. In addition, both the cluster and API processes always spawn their child in the same order, asserting that although the PIDs may change, the process name will remain the same in the CSV file.
Another key change is this block:
wazuh-qa/deps/wazuh_testing/wazuh_testing/scripts/wazuh_metrics.py
Lines 80 to 92 in e1e3621
As this CLI is used for the first time after outer healthchecks confirm that everything is working as intended, the first monitor list that we retrieve will have the exact number of expected processes. Thus, replacing each monitor with a new one after scanning the PIDs instead of simply removing and adding new monitor instances will leave failed monitors that will be checked again. This mechanism will keep failing until all the expected child processes are being monitored again.
Configuration options
After the rework, two new options have been added to the CLI:
-H
,--healthcheck-time
): Time in seconds between each health check. Default value 10 seconds.-r
,--retries
): Number of reconnection retries before aborting the monitoring process. Default value 5 retries.Logs example
wazuh-metrics -p wazuh-apid wazuh-clusterd
service wazuh-manager restart
was executed shortly after beginning the monitoring task.wazuh-metrics log
Tests
pycodestyle --max-line-length=120 --show-source --show-pep8 file.py
.provision_documentation.sh
generate the docs without errors.