Fix `wazuh-metrics` CLI bug when child processes restart #2416

vicferpoy · 2021-12-28T15:18:56Z

Related issue
closes #2401

Description

As reported, the wazuh-metrics CLI would not monitor child processes that could restart. Currently, this issue only happens with the Wazuh cluster and API, as they use multiprocessing. An example of this bug can be seen in #2401 .

To fix this, the CLI has been reworked to add a monitor healthcheck that is able to shutdown unused monitors and launch new ones when processes restart. They key for this to work is that we retrieve child PIDs using the following line:

wazuh-qa/deps/wazuh_testing/wazuh_testing/tools/performance/binary.py

Line 95 in e1e3621

    
           return [parent_pid] + [child.pid for child in psutil.Process(parent_pid).children(recursive=True)]

This will always return a list of PIDs ordered from oldest (parent process) to most recent if there are any child processes. In addition, both the cluster and API processes always spawn their child in the same order, asserting that although the PIDs may change, the process name will remain the same in the CSV file.

Another key change is this block:

wazuh-qa/deps/wazuh_testing/wazuh_testing/scripts/wazuh_metrics.py

Lines 80 to 92 in e1e3621

    
           for i, pid in enumerate(process_pids): 
        
               # Attempt to create new monitor instances for the process 
        
               p_name = process if i == 0 else f'{process}_child_{i}' 
        
               monitor = Monitor(process_name=p_name, pid=pid, value_unit=options.data_unit, 
        
                                 time_step=options.sleep_time, 
        
                                 version=options.version, dst_dir=options.store_path) 
        
               monitor.start() 
        
               try: 
        
                   # Replace old monitors for new ones 
        
                   ACTIVE_MONITORS[process][i] = monitor 
        
               except IndexError: 
        
                   ACTIVE_MONITORS[process].append(monitor)

As this CLI is used for the first time after outer healthchecks confirm that everything is working as intended, the first monitor list that we retrieve will have the exact number of expected processes. Thus, replacing each monitor with a new one after scanning the PIDs instead of simply removing and adding new monitor instances will leave failed monitors that will be checked again. This mechanism will keep failing until all the expected child processes are being monitored again.

Configuration options

After the rework, two new options have been added to the CLI:

Healthcheck time (-H, --healthcheck-time): Time in seconds between each health check. Default value 10 seconds.
Retries (-r, --retries): Number of reconnection retries before aborting the monitoring process. Default value 5 retries.

Logs example

Launching the CLI with the following parameters: wazuh-metrics -p wazuh-apid wazuh-clusterd
There is a script constantly adding agents to the environment using the API.
A service wazuh-manager restart was executed shortly after beginning the monitoring task.

wazuh-metrics log

2021/12/28 15:08:22 INFO: Started new session: /tmp/process_metrics/28-12-2021/1640704102
2021/12/28 15:08:22 INFO: Started monitoring process wazuh-apid (441500)
2021/12/28 15:08:22 INFO: Started monitoring process wazuh-apid_child_1 (441523)
2021/12/28 15:08:22 INFO: Started monitoring process wazuh-apid_child_2 (441526)
2021/12/28 15:08:22 INFO: Started monitoring process wazuh-clusterd (442354)
2021/12/28 15:08:22 INFO: Started monitoring process wazuh-clusterd_child_1 (442708)
2021/12/28 15:08:22 INFO: Started monitoring process wazuh-clusterd_child_2 (442711)
2021/12/28 15:08:34 WARNING: Lost PID for wazuh-clusterd_child_2
2021/12/28 15:08:34 WARNING: Lost PID for wazuh-clusterd
2021/12/28 15:08:34 WARNING: Lost PID for wazuh-clusterd_child_1
2021/12/28 15:08:36 WARNING: Lost PID for wazuh-apid_child_2
2021/12/28 15:08:36 WARNING: Lost PID for wazuh-apid_child_1
2021/12/28 15:08:37 WARNING: Lost PID for wazuh-apid
2021/12/28 15:08:42 WARNING: Monitoring of wazuh-apid failed. Attempting to create new monitor instances
2021/12/28 15:08:42 INFO: Started monitoring process wazuh-apid (496700)
2021/12/28 15:08:42 INFO: Started monitoring process wazuh-apid_child_1 (496723)
2021/12/28 15:08:42 INFO: Started monitoring process wazuh-apid_child_2 (496726)
2021/12/28 15:08:42 WARNING: Monitoring of wazuh-clusterd failed. Attempting to create new monitor instances
2021/12/28 15:08:42 WARNING: Could not create new monitor instances for wazuh-clusterd
2021/12/28 15:08:52 WARNING: Monitoring of wazuh-clusterd failed. Attempting to create new monitor instances
2021/12/28 15:08:52 INFO: Started monitoring process wazuh-clusterd (497470)
2021/12/28 15:08:52 INFO: Started monitoring process wazuh-clusterd_child_1 (499336)
2021/12/28 15:08:52 INFO: Started monitoring process wazuh-clusterd_child_2 (499347)
2021/12/28 15:09:05 INFO: Attempting to shutdown all monitor threads
2021/12/28 15:09:08 INFO: Process finished gracefully

Tests

Proven that tests pass when they have to pass.
Proven that tests fail when they have to fail.

Python codebase satisfies PEP-8 style style guide. pycodestyle --max-line-length=120 --show-source --show-pep8 file.py.
Python codebase is documented following the Google Style for Python docstrings.
The test is documented in wazuh-qa/docs.
provision_documentation.sh generate the docs without errors.

Víctor Fernández Poyatos added 4 commits December 28, 2021 11:00

Add health status to Monitor class

27750be

Implement reconnecting mechanism to wazuh-metrics CLI

e2f2a7a

Remove Monitor healthy attribute

28e0e9f

Refactor wazuh-metrics, new options and docstrings

e1e3621

vicferpoy requested a review from davidjiglesias December 28, 2021 15:18

vicferpoy self-assigned this Dec 28, 2021

davidjiglesias approved these changes Dec 29, 2021

View reviewed changes

Rebits added the subteam/qa-hurricane label Dec 29, 2021

davidjiglesias approved these changes Jan 12, 2022

View reviewed changes

davidjiglesias requested a review from snaow January 12, 2022 09:14

snaow approved these changes Jan 24, 2022

View reviewed changes

snaow merged commit 027eb9a into master Jan 24, 2022

snaow deleted the 2401-wazuh-metrics-reconnecting branch January 24, 2022 16:34

snaow mentioned this pull request Jan 27, 2022

QA Release - Rev 430031 #2500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `wazuh-metrics` CLI bug when child processes restart #2416

Fix `wazuh-metrics` CLI bug when child processes restart #2416

vicferpoy commented Dec 28, 2021

	for i, pid in enumerate(process_pids):
	# Attempt to create new monitor instances for the process
	p_name = process if i == 0 else f'{process}_child_{i}'
	monitor = Monitor(process_name=p_name, pid=pid, value_unit=options.data_unit,
	time_step=options.sleep_time,
	version=options.version, dst_dir=options.store_path)
	monitor.start()

	try:
	# Replace old monitors for new ones
	ACTIVE_MONITORS[process][i] = monitor
	except IndexError:
	ACTIVE_MONITORS[process].append(monitor)

Fix wazuh-metrics CLI bug when child processes restart #2416

Fix wazuh-metrics CLI bug when child processes restart #2416

Conversation

vicferpoy commented Dec 28, 2021

Description

Configuration options

Logs example

Tests

Fix `wazuh-metrics` CLI bug when child processes restart #2416

Fix `wazuh-metrics` CLI bug when child processes restart #2416