Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Failed job handling and telemetry job removal #689

Conversation

NajmudheenCT
Copy link
Member

What this PR does / why we need it:
After removing periodic scan of task and failed task tables we need to handle below scenarios

  1. When a node restarts get all jobs assigned to it and schedule
  2. When a leader switch happens reschedule failed_tasks
  3. when a node down or up reschedule failed tasks
  4. when a storage is deleted remove corresponding jobs from scheduler

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #NA

Test notes:
Following TC are executed with this PR

  • make a telemetry collection failed :

Expected results

        1.  An entry in failed_task table is created .
        2.  Scheduler receives a job to schedule failed_task again
        3.  Scheduler runs the failed job 
        4.  Success /Failure of result is updated in failed_task table
  • make a failed_telemetry job fail continuously:

expected results:

    1. for each failure retry count is being updated
    2. after reaching max_retry the job has been removed and failed_task entry is also removed
  • When a storage is deleted:

      1.  All task and failed_task entry for that storage is deleted
      2.  All scheduled job for corresponding storage has been deleted 
    
  • When a node is down:

    1.  All task and failed_task assigned for this job is being re_distributed
    2.  new executor got assigned for the task
    3.  same  executor as parent task is assigned for the failed_task
    4.  assigned executor start collection for these tasks and failed_tasks
    
  • When a node joins

        1.  all tasks and failed_task has got executors assinged
        2.  same  executor as parent task is assigned for the failed_task
        3.  assigned executor start collection for these tasks and failed_tasks
        4. same  executor as parent task is assigned for the failed_task
    

Release note:

@codecov
Copy link

codecov bot commented Sep 7, 2021

Codecov Report

Merging #689 (098d8d1) into perf_coll_fw_enhance (38adb19) will decrease coverage by 0.11%.
The diff coverage is 31.48%.

@@                   Coverage Diff                    @@
##           perf_coll_fw_enhance     #689      +/-   ##
========================================================
- Coverage                 70.88%   70.76%   -0.12%     
========================================================
  Files                       160      161       +1     
  Lines                     15251    15526     +275     
  Branches                   1869     1925      +56     
========================================================
+ Hits                      10810    10987     +177     
- Misses                     3822     3895      +73     
- Partials                    619      644      +25     
Impacted Files Coverage Δ
delfin/api/v1/storages.py 76.02% <ø> (-0.49%) ⬇️
delfin/task_manager/manager.py 0.00% <ø> (ø)
delfin/task_manager/perf_job_controller.py 61.29% <0.00%> (-20.20%) ⬇️
delfin/task_manager/tasks/telemetry.py 71.42% <ø> (-3.58%) ⬇️
...in/leader_election/distributor/task_distributor.py 68.57% <12.50%> (-16.62%) ⬇️
delfin/task_manager/scheduler/schedule_manager.py 66.66% <28.57%> (-17.16%) ⬇️
...ager/scheduler/schedulers/telemetry/job_handler.py 70.37% <33.33%> (-1.89%) ⬇️
...telemetry/failed_performance_collection_handler.py 100.00% <100.00%> (ø)
...dulers/telemetry/performance_collection_handler.py 100.00% <100.00%> (ø)
... and 9 more

@@ -123,3 +137,11 @@ def recover_job(self):
distributor = TaskDistributor(self.ctx)
for task in all_tasks:
distributor.distribute_new_job(task['id'])

def recover_failed_job(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where this will be called

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When node boot up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated , thanks

failed_tasks = db.failed_task_get_all(self.ctx)
for failed_task in failed_tasks:
# Get the parent task executor
task = db.task_get(self.ctx, failed_task['task_id'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the failed_task id to get the normal task, so failed task and normal task have the same id?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task_id field in failed_task represent parent task_id, we use this to get the parent task for the failed_task

@@ -123,3 +137,11 @@ def recover_job(self):
distributor = TaskDistributor(self.ctx)
for task in all_tasks:
distributor.distribute_new_job(task['id'])

def recover_failed_job(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When node boot up?

@sushanthakumar
Copy link
Collaborator

LGTM

Copy link
Collaborator

@ThisIsClark ThisIsClark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ThisIsClark ThisIsClark merged commit 6bcb7b7 into sodafoundation:perf_coll_fw_enhance Sep 8, 2021
kumarashit added a commit that referenced this pull request Sep 14, 2021
* Make job scheduler local to task process (#674)

* Make job scheduler local to task process

* Notify distributor when a new task added (#678)

* Remove db-scan for new task creation (#680)

* Use consistent hash to manage the topic (#681)

* Remove the periodically call from task distributor (#686)

* Start one historic collection immediate when a job is rescheduled (#685)

* Start one historic collection immediate when a job is rescheduled

* Remove failed task distributor (#687)

* Improving Failed job handling and telemetry job removal (#689)

Co-authored-by: ThisIsClark <liuyuchibubao@gmail.com>
Co-authored-by: Ashit Kumar <akopensrc@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants