Improving Failed job handling and telemetry job removal #689

NajmudheenCT · 2021-09-07T16:25:00Z

What this PR does / why we need it:
After removing periodic scan of task and failed task tables we need to handle below scenarios

When a node restarts get all jobs assigned to it and schedule
When a leader switch happens reschedule failed_tasks
when a node down or up reschedule failed tasks
when a storage is deleted remove corresponding jobs from scheduler

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #NA

Test notes:
Following TC are executed with this PR

make a telemetry collection failed :

Expected results

        1.  An entry in failed_task table is created .
        2.  Scheduler receives a job to schedule failed_task again
        3.  Scheduler runs the failed job 
        4.  Success /Failure of result is updated in failed_task table

make a failed_telemetry job fail continuously:

expected results:

    1. for each failure retry count is being updated
    2. after reaching max_retry the job has been removed and failed_task entry is also removed

When a storage is deleted:

  1.  All task and failed_task entry for that storage is deleted
  2.  All scheduled job for corresponding storage has been deleted

When a node is down:

1.  All task and failed_task assigned for this job is being re_distributed
2.  new executor got assigned for the task
3.  same  executor as parent task is assigned for the failed_task
4.  assigned executor start collection for these tasks and failed_tasks

When a node joins

    1.  all tasks and failed_task has got executors assinged
    2.  same  executor as parent task is assigned for the failed_task
    3.  assigned executor start collection for these tasks and failed_tasks
    4. same  executor as parent task is assigned for the failed_task

Release note:

codecov · 2021-09-07T17:01:30Z

Codecov Report

Merging #689 (098d8d1) into perf_coll_fw_enhance (38adb19) will decrease coverage by 0.11%.
The diff coverage is 31.48%.

@@                   Coverage Diff                    @@
##           perf_coll_fw_enhance     #689      +/-   ##
========================================================
- Coverage                 70.88%   70.76%   -0.12%     
========================================================
  Files                       160      161       +1     
  Lines                     15251    15526     +275     
  Branches                   1869     1925      +56     
========================================================
+ Hits                      10810    10987     +177     
- Misses                     3822     3895      +73     
- Partials                    619      644      +25

Impacted Files	Coverage Δ
delfin/api/v1/storages.py	`76.02% <ø> (-0.49%)`	⬇️
delfin/task_manager/manager.py	`0.00% <ø> (ø)`
delfin/task_manager/perf_job_controller.py	`61.29% <0.00%> (-20.20%)`	⬇️
delfin/task_manager/tasks/telemetry.py	`71.42% <ø> (-3.58%)`	⬇️
...in/leader_election/distributor/task_distributor.py	`68.57% <12.50%> (-16.62%)`	⬇️
delfin/task_manager/scheduler/schedule_manager.py	`66.66% <28.57%> (-17.16%)`	⬇️
...ager/scheduler/schedulers/telemetry/job_handler.py	`70.37% <33.33%> (-1.89%)`	⬇️
...telemetry/failed_performance_collection_handler.py	`100.00% <100.00%> (ø)`
...dulers/telemetry/performance_collection_handler.py	`100.00% <100.00%> (ø)`
... and 9 more

sushanthakumar · 2021-09-07T18:39:58Z

delfin/task_manager/scheduler/schedule_manager.py

@@ -123,3 +137,11 @@ def recover_job(self):
        distributor = TaskDistributor(self.ctx)
        for task in all_tasks:
            distributor.distribute_new_job(task['id'])
+
+    def recover_failed_job(self):


where this will be called

When node boot up?

updated , thanks

ThisIsClark · 2021-09-08T03:17:33Z

delfin/task_manager/scheduler/schedule_manager.py

+        failed_tasks = db.failed_task_get_all(self.ctx)
+        for failed_task in failed_tasks:
+            # Get the parent task executor
+            task = db.task_get(self.ctx, failed_task['task_id'])


Use the failed_task id to get the normal task, so failed task and normal task have the same id?

task_id field in failed_task represent parent task_id, we use this to get the parent task for the failed_task

ThisIsClark · 2021-09-08T03:18:12Z

delfin/task_manager/scheduler/schedule_manager.py

@@ -123,3 +137,11 @@ def recover_job(self):
        distributor = TaskDistributor(self.ctx)
        for task in all_tasks:
            distributor.distribute_new_job(task['id'])
+
+    def recover_failed_job(self):


When node boot up?

sushanthakumar · 2021-09-08T10:47:06Z

LGTM

… latest versions

ThisIsClark

LGTM

* Make job scheduler local to task process (#674) * Make job scheduler local to task process * Notify distributor when a new task added (#678) * Remove db-scan for new task creation (#680) * Use consistent hash to manage the topic (#681) * Remove the periodically call from task distributor (#686) * Start one historic collection immediate when a job is rescheduled (#685) * Start one historic collection immediate when a job is rescheduled * Remove failed task distributor (#687) * Improving Failed job handling and telemetry job removal (#689) Co-authored-by: ThisIsClark <liuyuchibubao@gmail.com> Co-authored-by: Ashit Kumar <akopensrc@gmail.com>

NajmudheenCT added 2 commits September 7, 2021 21:42

Improving Failed job handling

020a790

Improving Failed job handling

6973b64

sushanthakumar reviewed Sep 7, 2021

View reviewed changes

ThisIsClark reviewed Sep 8, 2021

View reviewed changes

Addressing review comments

6e45bb5

ensuring tzlocal version to avoid timezone issues which is seen with…

098d8d1

… latest versions

ThisIsClark approved these changes Sep 8, 2021

View reviewed changes

ThisIsClark merged commit 6bcb7b7 into sodafoundation:perf_coll_fw_enhance Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Failed job handling and telemetry job removal #689

Improving Failed job handling and telemetry job removal #689

NajmudheenCT commented Sep 7, 2021

codecov bot commented Sep 7, 2021 •

edited

Loading

sushanthakumar Sep 7, 2021

ThisIsClark Sep 8, 2021

NajmudheenCT Sep 8, 2021

ThisIsClark Sep 8, 2021

NajmudheenCT Sep 8, 2021

ThisIsClark Sep 8, 2021

sushanthakumar commented Sep 8, 2021

ThisIsClark left a comment

Improving Failed job handling and telemetry job removal #689

Improving Failed job handling and telemetry job removal #689

Conversation

NajmudheenCT commented Sep 7, 2021

codecov bot commented Sep 7, 2021 • edited Loading

Codecov Report

sushanthakumar Sep 7, 2021

Choose a reason for hiding this comment

ThisIsClark Sep 8, 2021

Choose a reason for hiding this comment

NajmudheenCT Sep 8, 2021

Choose a reason for hiding this comment

ThisIsClark Sep 8, 2021

Choose a reason for hiding this comment

NajmudheenCT Sep 8, 2021

Choose a reason for hiding this comment

ThisIsClark Sep 8, 2021

Choose a reason for hiding this comment

sushanthakumar commented Sep 8, 2021

ThisIsClark left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 7, 2021 •

edited

Loading