-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dynamic sub-processes for metrics collection #708
Support dynamic sub-processes for metrics collection #708
Conversation
55eca6d
to
1b84996
Compare
Codecov Report
@@ Coverage Diff @@
## master #708 +/- ##
==========================================
- Coverage 70.81% 70.51% -0.30%
==========================================
Files 161 163 +2
Lines 15600 15871 +271
Branches 1934 1972 +38
==========================================
+ Hits 11047 11192 +145
- Misses 3908 4020 +112
- Partials 645 659 +14
|
70ab846
to
45cecb5
Compare
delfin/coordination.py
Outdated
LOG.info("GROUP {0} already exist".format(group)) | ||
|
||
def delete_group(self, group): | ||
# Create the group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong comment for the function!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, removed.
delfin/coordination.py
Outdated
|
||
def leave_group(self, group): | ||
try: | ||
# Join the group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong comment!
delfin/coordination.py
Outdated
|
||
def get_members(self, group): | ||
try: | ||
# Join the group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here!
d994e10
to
81eafe9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM |
9ade00d
to
6baa608
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cfg.IntOpt('max_storages_in_child', | ||
default=5, | ||
help='Max storages handled by one local executor process'), | ||
cfg.IntOpt('max_childs_in_node', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this default value ok? It means allowing 100000 process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we do not restrict number of processes created. Used a large number as default, before raising exception. Also, delete of process, when having no storages to handle takes about 90 seconds. So large number will provide a buffer, if we create and delete storage frequently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ThisIsClark This can be customised based on user enthronement based on their deployment configuration. For example this can be set to number of cores available in a node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if storages is greater than limitations, what will happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not restricting storages, we allow storages to get registered.
|
||
def create_group(self, group): | ||
try: | ||
self.coordinator.create_group(group.encode()).get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why call get
after called create_group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create_group() is async call, following get() ensure that create group is completed
delfin/coordination.py
Outdated
try: | ||
self.coordinator.delete_group(group.encode()).get() | ||
except coordination.GroupNotCreated: | ||
LOG.info("GROUP {0} Group not created".format(group)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First letter of first word should be in upper case, others should be in lower case.
Same as other log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -14,18 +14,30 @@ | |||
""" | |||
periodical task manager for metric collection tasks** | |||
""" | |||
import datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please optimize the import order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
self.rpcapi.assign_failed_job_local( | ||
context, f_task['id'], executor_topic) | ||
|
||
def process_cleanup(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except the periodically cleanup, do we have another mechanism to cleanup it forwardly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a storage is deleted and the process do not handle any other storages, we want stop the processes after a delay (for the process to handle remove storage). We send the remove storage message, wait for the message to be handled and cleanup later.
dc82566
6baa608
to
dc82566
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What this PR does / why we need it:
Support dynamic sub-processes for metrics collection
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #Special notes for your reviewer:
Release note:
Dynamic subprocess is an optimization to better use of Node resources by spawning multiple python processess in the same node, for the collection.
Following are the configuration options are available for this feature and their corresponding default values. These values may be changed in the 'TELEMETRY' section of delfin.conf file
This feature can be enabled with below lines in delfin.conf file,
If the storages handled in a subprocess increases to more than configured value (
max_storages_in_child
) a new subprocess is spawn by the metrics manager to handle storagestask_cleanup_delay
is the minimum delay in seconds before stopping the subprocess by metrics manager, so that it can handle the remove job/remove failed jobprocess_cleanup_interval
is the interval in seconds a clean up function executes to remove unused subprocessesgroup_change_detect_interval
is interval in seconds, watcher for the process join and process leave callbacks are checkedmax_childs_in_node
number process that can be created in node before raising exception. Large buffer is needed as process removal may be delayed.Test cases for this feature is available in link: https://docs.google.com/spreadsheets/d/1uy7B4nVSI_T9qM_Sc66A7nK_lr6RIEkC/edit#gid=1006830486
Tested the feature in single node environment
Test report
https://docs.google.com/spreadsheets/d/1X9igJZnjzx9viI6wmFpqnJN1njO6JGJe/edit#gid=525327673