-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Failed job handling and telemetry job removal #689
Improving Failed job handling and telemetry job removal #689
Conversation
Codecov Report
@@ Coverage Diff @@
## perf_coll_fw_enhance #689 +/- ##
========================================================
- Coverage 70.88% 70.76% -0.12%
========================================================
Files 160 161 +1
Lines 15251 15526 +275
Branches 1869 1925 +56
========================================================
+ Hits 10810 10987 +177
- Misses 3822 3895 +73
- Partials 619 644 +25
|
@@ -123,3 +137,11 @@ def recover_job(self): | |||
distributor = TaskDistributor(self.ctx) | |||
for task in all_tasks: | |||
distributor.distribute_new_job(task['id']) | |||
|
|||
def recover_failed_job(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where this will be called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When node boot up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated , thanks
failed_tasks = db.failed_task_get_all(self.ctx) | ||
for failed_task in failed_tasks: | ||
# Get the parent task executor | ||
task = db.task_get(self.ctx, failed_task['task_id']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the failed_task id to get the normal task, so failed task and normal task have the same id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
task_id field in failed_task represent parent task_id, we use this to get the parent task for the failed_task
@@ -123,3 +137,11 @@ def recover_job(self): | |||
distributor = TaskDistributor(self.ctx) | |||
for task in all_tasks: | |||
distributor.distribute_new_job(task['id']) | |||
|
|||
def recover_failed_job(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When node boot up?
LGTM |
… latest versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Make job scheduler local to task process (#674) * Make job scheduler local to task process * Notify distributor when a new task added (#678) * Remove db-scan for new task creation (#680) * Use consistent hash to manage the topic (#681) * Remove the periodically call from task distributor (#686) * Start one historic collection immediate when a job is rescheduled (#685) * Start one historic collection immediate when a job is rescheduled * Remove failed task distributor (#687) * Improving Failed job handling and telemetry job removal (#689) Co-authored-by: ThisIsClark <liuyuchibubao@gmail.com> Co-authored-by: Ashit Kumar <akopensrc@gmail.com>
What this PR does / why we need it:
After removing periodic scan of task and failed task tables we need to handle below scenarios
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #NATest notes:
Following TC are executed with this PR
Expected results
expected results:
When a storage is deleted:
When a node is down:
When a node joins
Release note: