Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vine: check the availability of staged inputs #3994

Merged
merged 10 commits into from
Dec 2, 2024

Conversation

JinZhou5042
Copy link
Member

@JinZhou5042 JinZhou5042 commented Nov 26, 2024

Proposed Changes

Fix #3993

So the problem was that when a worker crashes when sending back permanent outputs, all children tasks fail because of input missing.

It turned out that the return value of retrieve_output and vine_manager_get_output_files are not correctly received and handled.

In the original version, result is initialized as vine_result_code_t result = VINE_SUCCESS; which is 0 in integer:

typedef enum {
	VINE_SUCCESS = 0,
	VINE_WORKER_FAILURE,
	VINE_APP_FAILURE,
	VINE_MGR_FAILURE,
	VINE_END_OF_LIST,
} vine_result_code_t;

And result &= xxx will never change its initial value, this results in the manager assuming that the outputs have always been successfully retrieved.

I was able to reproduce the stated error and with these changes that issue disappeared.

Merge Checklist

The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update: Update the manual to reflect user-visible changes.
  • Type Labels: Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels: Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM: Mark your PR as ready to merge.

@JinZhou5042 JinZhou5042 requested a review from dthain November 26, 2024 19:30
@JinZhou5042 JinZhou5042 self-assigned this Nov 26, 2024
@JinZhou5042 JinZhou5042 added bug For modifications that fix a flaw in the code. TaskVine labels Nov 26, 2024
Copy link
Member

@dthain dthain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think this is the right solution.

Regular files should not be "lost" because they should exist at the manager once the original is done.

The problem is evident in the log describing the problem: an output transfer is interrupted, and the corresponding task is (incorrectly) declared as complete.

The correct solution is for that original task to go back into the queue.

@JinZhou5042
Copy link
Member Author

Thanks for the clarification, this sounds to be a more feasible way.

@JinZhou5042 JinZhou5042 marked this pull request as draft November 26, 2024 19:56
@JinZhou5042 JinZhou5042 marked this pull request as ready for review November 26, 2024 22:16
@JinZhou5042 JinZhou5042 requested a review from dthain November 26, 2024 22:16
@JinZhou5042
Copy link
Member Author

@dthain Thanks for helping me find this bug!

Copy link
Member

@dthain dthain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, that definitely looks like a bug.
Probably left over from some old code that used the &= pattern.
Good catch!

@dthain dthain requested a review from btovar November 27, 2024 16:52
@dthain
Copy link
Member

dthain commented Nov 27, 2024

@btovar can you double check this key bit of logic?

Copy link
Member

@btovar btovar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JinZhou5042 nice catch!

@btovar btovar merged commit c364db8 into cooperative-computing-lab:master Dec 2, 2024
10 checks passed
@JinZhou5042 JinZhou5042 deleted the input_missing branch December 2, 2024 14:46
btovar pushed a commit that referenced this pull request Dec 14, 2024
* vine: check the availability of staged inputs

* check for created files

* revert changes

* revert changes

* vine: change &= to =

* set t->output_received on success

* break when result is not success

* lint

* remove redund code

* always do vine_manager_get_output_files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For modifications that fix a flaw in the code. TaskVine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vine: non-temporary files lost permanently because of worker crashes
3 participants