Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove failed task distributor #687

Conversation

ThisIsClark
Copy link
Collaborator

What this PR does / why we need it:
Remove failed task distributor

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

Release note:

@codecov
Copy link

codecov bot commented Sep 6, 2021

Codecov Report

Merging #687 (5a889c2) into perf_coll_fw_enhance (cd62e0b) will decrease coverage by 0.04%.
The diff coverage is 57.14%.

@@                   Coverage Diff                    @@
##           perf_coll_fw_enhance     #687      +/-   ##
========================================================
- Coverage                 70.96%   70.92%   -0.05%     
========================================================
  Files                       161      160       -1     
  Lines                     15290    15240      -50     
  Branches                   1872     1868       -4     
========================================================
- Hits                      10851    10809      -42     
+ Misses                     3821     3814       -7     
+ Partials                    618      617       -1     
Impacted Files Coverage Δ
...in/leader_election/distributor/task_distributor.py 85.18% <ø> (-0.53%) ⬇️
delfin/task_manager/scheduler/schedule_manager.py 83.82% <25.00%> (+12.05%) ⬆️
...dulers/telemetry/performance_collection_handler.py 100.00% <100.00%> (+3.70%) ⬆️
delfin/task_manager/metrics_rpcapi.py 50.00% <0.00%> (-14.71%) ⬇️
delfin/context.py 74.00% <0.00%> (-6.00%) ⬇️
delfin/rpc.py 73.68% <0.00%> (-2.64%) ⬇️
delfin/db/sqlalchemy/api.py 71.72% <0.00%> (-0.13%) ⬇️

@sushanthakumar
Copy link
Collaborator

LGTM

@@ -108,3 +108,6 @@ def _handle_task_failure(self, start_time, end_time):
FailedTask.retry_count.name: 0,
FailedTask.executor.name: self.executor}
db.failed_task_create(self.ctx, failed_task)
self.metric_task_rpcapi.assign_failed_job(self.ctx,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message will help to assign failed job in same executor.. But with removal failed_job distributor scanning there no way to assign failed_job in case of a node down/restart .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we should add another mechanism to recover the failed job which belongs to this executor

Copy link
Member

@NajmudheenCT NajmudheenCT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NajmudheenCT NajmudheenCT merged commit 38adb19 into sodafoundation:perf_coll_fw_enhance Sep 7, 2021
kumarashit added a commit that referenced this pull request Sep 14, 2021
* Make job scheduler local to task process (#674)

* Make job scheduler local to task process

* Notify distributor when a new task added (#678)

* Remove db-scan for new task creation (#680)

* Use consistent hash to manage the topic (#681)

* Remove the periodically call from task distributor (#686)

* Start one historic collection immediate when a job is rescheduled (#685)

* Start one historic collection immediate when a job is rescheduled

* Remove failed task distributor (#687)

* Improving Failed job handling and telemetry job removal (#689)

Co-authored-by: ThisIsClark <liuyuchibubao@gmail.com>
Co-authored-by: Ashit Kumar <akopensrc@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants