Support dynamic sub-processes for metrics collection #708

joseph-v · 2021-09-22T09:31:42Z

What this PR does / why we need it:
Support dynamic sub-processes for metrics collection

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

Release note:

Dynamic subprocess is an optimization to better use of Node resources by spawning multiple python processess in the same node, for the collection.

Following are the configuration options are available for this feature and their corresponding default values. These values may be changed in the 'TELEMETRY' section of delfin.conf file

enable_dynamic_subprocess = False
process_cleanup_interval = 60
task_cleanup_delay = 10
group_change_detect_interval = 30
max_storages_in_child = 5
max_childs_in_node = 100000

This feature can be enabled with below lines in delfin.conf file,

[TELEMETRY]
enable_dynamic_subprocess = True

If the storages handled in a subprocess increases to more than configured value (max_storages_in_child) a new subprocess is spawn by the metrics manager to handle storages

task_cleanup_delay is the minimum delay in seconds before stopping the subprocess by metrics manager, so that it can handle the remove job/remove failed job

process_cleanup_interval is the interval in seconds a clean up function executes to remove unused subprocesses

group_change_detect_interval is interval in seconds, watcher for the process join and process leave callbacks are checked

max_childs_in_node number process that can be created in node before raising exception. Large buffer is needed as process removal may be delayed.

Test cases for this feature is available in link: https://docs.google.com/spreadsheets/d/1uy7B4nVSI_T9qM_Sc66A7nK_lr6RIEkC/edit#gid=1006830486

Tested the feature in single node environment
Test report
https://docs.google.com/spreadsheets/d/1X9igJZnjzx9viI6wmFpqnJN1njO6JGJe/edit#gid=525327673

codecov · 2021-09-22T09:38:23Z

Codecov Report

Merging #708 (dc82566) into master (26426d8) will decrease coverage by 0.29%.
The diff coverage is 45.33%.

@@            Coverage Diff             @@
##           master     #708      +/-   ##
==========================================
- Coverage   70.81%   70.51%   -0.30%     
==========================================
  Files         161      163       +2     
  Lines       15600    15871     +271     
  Branches     1934     1972      +38     
==========================================
+ Hits        11047    11192     +145     
- Misses       3908     4020     +112     
- Partials      645      659      +14

Impacted Files	Coverage Δ
delfin/common/config.py	`95.00% <ø> (ø)`
delfin/task_manager/metrics_rpcapi.py	`50.00% <ø> (ø)`
delfin/task_manager/scheduler/schedule_manager.py	`66.66% <0.00%> (ø)`
delfin/task_manager/subprocess_manager.py	`0.00% <0.00%> (ø)`
delfin/coordination.py	`60.22% <43.58%> (-4.37%)`	⬇️
delfin/service.py	`26.82% <50.00%> (+0.77%)`	⬆️
delfin/task_manager/metrics_manager.py	`55.34% <51.75%> (+55.34%)`	⬆️
delfin/task_manager/subprocess_rpcapi.py	`53.33% <53.33%> (ø)`
...ager/scheduler/schedulers/telemetry/job_handler.py	`71.32% <100.00%> (+0.95%)`	⬆️
delfin/drivers/fake_storage/__init__.py	`94.79% <0.00%> (-0.28%)`	⬇️
... and 4 more

NajmudheenCT · 2021-09-23T05:48:07Z

delfin/coordination.py

+            LOG.info("GROUP {0} already exist".format(group))
+
+    def delete_group(self, group):
+        # Create the group


wrong comment for the function!

Thanks, removed.

NajmudheenCT · 2021-09-23T05:48:20Z

delfin/coordination.py

+
+    def leave_group(self, group):
+        try:
+            # Join the group


wrong comment!

NajmudheenCT · 2021-09-23T05:48:29Z

delfin/coordination.py

+
+    def get_members(self, group):
+        try:
+            # Join the group


NajmudheenCT

LGTM

sushanthakumar · 2021-09-27T02:58:21Z

LGTM

NajmudheenCT

LGTM

kumarashit

LGTM

ThisIsClark · 2021-09-27T14:14:40Z

delfin/common/config.py

+    cfg.IntOpt('max_storages_in_child',
+               default=5,
+               help='Max storages handled by one local executor process'),
+    cfg.IntOpt('max_childs_in_node',


Is this default value ok? It means allowing 100000 process?

Currently, we do not restrict number of processes created. Used a large number as default, before raising exception. Also, delete of process, when having no storages to handle takes about 90 seconds. So large number will provide a buffer, if we create and delete storage frequently

@ThisIsClark This can be customised based on user enthronement based on their deployment configuration. For example this can be set to number of cores available in a node.

if storages is greater than limitations, what will happen

We are not restricting storages, we allow storages to get registered.

ThisIsClark · 2021-09-27T14:16:03Z

delfin/coordination.py

+
+    def create_group(self, group):
+        try:
+            self.coordinator.create_group(group.encode()).get()


Why call get after called create_group

create_group() is async call, following get() ensure that create group is completed

ThisIsClark · 2021-09-27T14:17:59Z

delfin/coordination.py

+        try:
+            self.coordinator.delete_group(group.encode()).get()
+        except coordination.GroupNotCreated:
+            LOG.info("GROUP {0} Group not created".format(group))


First letter of first word should be in upper case, others should be in lower case.
Same as other log

ThisIsClark · 2021-09-27T14:20:58Z

delfin/task_manager/metrics_manager.py

@@ -14,18 +14,30 @@
 """
 periodical task manager for metric collection tasks**
 """
+import datetime


Please optimize the import order

ThisIsClark · 2021-09-27T14:39:43Z

delfin/task_manager/metrics_manager.py

+                    self.rpcapi.assign_failed_job_local(
+                        context, f_task['id'], executor_topic)
+
+    def process_cleanup(self):


Except the periodically cleanup, do we have another mechanism to cleanup it forwardly?

When a storage is deleted and the process do not handle any other storages, we want stop the processes after a delay (for the process to handle remove storage). We send the remove storage message, wait for the message to be handled and cleanup later.

wisererik

LGTM

NajmudheenCT

LGTM

joseph-v force-pushed the optional-dynamic branch from 55eca6d to 1b84996 Compare September 22, 2021 09:36

joseph-v force-pushed the optional-dynamic branch 5 times, most recently from 70ab846 to 45cecb5 Compare September 23, 2021 05:43

NajmudheenCT reviewed Sep 23, 2021

View reviewed changes

joseph-v force-pushed the optional-dynamic branch 6 times, most recently from d994e10 to 81eafe9 Compare September 23, 2021 10:28

NajmudheenCT approved these changes Sep 27, 2021

View reviewed changes

NajmudheenCT previously approved these changes Sep 27, 2021

View reviewed changes

joseph-v dismissed NajmudheenCT’s stale review via 9ade00d September 27, 2021 03:19

joseph-v force-pushed the optional-dynamic branch 2 times, most recently from 9ade00d to 6baa608 Compare September 27, 2021 05:36

NajmudheenCT previously approved these changes Sep 27, 2021

View reviewed changes

kumarashit previously approved these changes Sep 27, 2021

View reviewed changes

ThisIsClark reviewed Sep 27, 2021

View reviewed changes

joseph-v added 2 commits September 28, 2021 10:03

Support sub-processes for metrics collection

d21b6bd

Update

dc82566

joseph-v dismissed stale reviews from kumarashit and NajmudheenCT via dc82566 September 28, 2021 04:41

joseph-v force-pushed the optional-dynamic branch from 6baa608 to dc82566 Compare September 28, 2021 04:41

wisererik approved these changes Sep 28, 2021

View reviewed changes

NajmudheenCT approved these changes Sep 28, 2021

View reviewed changes

NajmudheenCT merged commit ca4b819 into sodafoundation:master Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dynamic sub-processes for metrics collection #708

Support dynamic sub-processes for metrics collection #708

joseph-v commented Sep 22, 2021 •

edited by NajmudheenCT

Loading

codecov bot commented Sep 22, 2021 •

edited

Loading

NajmudheenCT Sep 23, 2021

joseph-v Sep 23, 2021

NajmudheenCT Sep 23, 2021

NajmudheenCT Sep 23, 2021

NajmudheenCT left a comment

sushanthakumar commented Sep 27, 2021

NajmudheenCT left a comment

kumarashit left a comment

ThisIsClark Sep 27, 2021

joseph-v Sep 28, 2021

NajmudheenCT Sep 28, 2021

wisererik Sep 28, 2021

joseph-v Sep 28, 2021

ThisIsClark Sep 27, 2021

joseph-v Sep 28, 2021

ThisIsClark Sep 27, 2021

joseph-v Sep 28, 2021

ThisIsClark Sep 27, 2021

joseph-v Sep 28, 2021

ThisIsClark Sep 27, 2021

joseph-v Sep 28, 2021

wisererik left a comment

NajmudheenCT left a comment

Support dynamic sub-processes for metrics collection #708

Support dynamic sub-processes for metrics collection #708

Conversation

joseph-v commented Sep 22, 2021 • edited by NajmudheenCT Loading

codecov bot commented Sep 22, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NajmudheenCT left a comment

Choose a reason for hiding this comment

sushanthakumar commented Sep 27, 2021

NajmudheenCT left a comment

Choose a reason for hiding this comment

kumarashit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wisererik left a comment

Choose a reason for hiding this comment

NajmudheenCT left a comment

Choose a reason for hiding this comment

joseph-v commented Sep 22, 2021 •

edited by NajmudheenCT

Loading

codecov bot commented Sep 22, 2021 •

edited

Loading