Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Handling cells with hardware failure #23

Closed
4 tasks done
fanyangCS opened this issue Jul 3, 2020 · 4 comments
Closed
4 tasks done

Handling cells with hardware failure #23

fanyangCS opened this issue Jul 3, 2020 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@fanyangCS
Copy link

fanyangCS commented Jul 3, 2020

@fanyangCS fanyangCS changed the title Handling cell containing hardware failure Handling cells with hardware failure Jul 3, 2020
@fanyangCS fanyangCS added the enhancement New feature or request label Jul 3, 2020
@abuccts abuccts closed this as completed Jul 15, 2020
@hzhua
Copy link
Contributor

hzhua commented Aug 10, 2020

Here is the psudo-code of the above algorithm.

First, we define

- num_splittable[i]: the number of splittable cells at each level (i=1,...,n). "splittable" means we can split the cell without violating sharing safety guarantee.
- num_free_cell[i] as the number of free cells at level-i. 
- num_vc_unused[i] as the number of unused level-i cells in all VCs.
- HUC[i] as the number of level-i cells we can obtain by splitting a level-(i+1) cell

If level-n is the highest level, then num_splittable[i] can be calculated recursively as follows:

1.	when i = n, num_splittable[i] = num_free_cell[i] - num_vc_unused[i]
2.	when i < n, num_splittable[i] = num_splittable[i+1]*HUC[i] + num_free_cell[i] – num_vc_unused[i]

Thus, the sharing safety guarantee is equivalent to guarantee num_splittable[i] >= 0 for all level i at any time.

Algorithm:
C = BuddyAlloc(r_k) // the level-k cell from BuddyAlloc
If C is heathy:
  Return C
Else: //C is bad
  Add C back to free_list[k]
  While C is bad:
    Sort free_list[k] in desc by health score: percentage of heathy GPUs in the cell
    C = free_list[0]
    If C is bad:
      If there exist level k’ > k that num_splitable[k’] > 0.
        k’ = the minimum level s.t. num_splitable[k’] > 0
        Split a level-k’ cell until we get a level-k cell.			
    Else: //Splitting not possible
      Return (C, Failure)
      // Expose the bad cell with most free GPUs to the requesting VC
Return (C, Success)

We need to prove the above algorithm's correctness and completeness.

[Correctness] The algorithm will not violate sharing safety guarantee.

[Completeness] Using this algorithm, there is no such case that the allocation of a level-k cell request fails but it is possible to obtain a healthy level-k cell without violating sharing safety.

Proof:
The proof of correctness is straightforward, because each splitting at level-i will check if num_splittable[i]>=0.

The proof of compeleteness can be done by contradiction.

If the algorithm is not complete, it means there is a "Case" that:

  • When allocating a level-k cell fails, there exist a level-k’ cell (k’>=k) that num_splittable[k’] > 0 and it contains at least one level-k cell.

If k’ = k, it is impossible that num_splittable[k’] > 0 at level-k because "we will firstly allocate the healthier cell".

If k’ > k, the Case is impossible. Because the algorithm will split all cells that num_splittable[k’’] > 0 (for all k’’ >= k) if the algorithm fails. There will be no cells that num_splittable[k’] > 0, which contradicts with the Case.

@zhypku
Copy link
Contributor

zhypku commented Aug 10, 2020

Here is the psudo-code of the above algorithm.

First, we define

- num_splittable[i]: the number of splittable cells at each level (i=1,...,n). "splittable" means we can split the cell without violating sharing safety guarantee.
- num_free_cell[i] as the number of free cells at level-i. 
- num_vc_unused[i] as the number of unused level-i cells in all VCs.
- HUC[i] as the number of level-i cells we can obtain by splitting a level-(i+1) cell

If level-n is the highest level, then num_splittable[i] can be calculated recursively as follows:

1.	when i = n, num_splittable[i] = num_free_cell[i] - num_vc_unused[i]
2.	when i < n, num_splittable[i] = num_splittable[i+1]*HUC[i] + num_free_cell[i] – num_vc_unused[i]

Thus, the sharing safety guarantee is equivalent to guarantee num_splittable[i] >= 0 for all level i at any time.

Algorithm:
C = BuddyAlloc(r_k) // the level-k cell from BuddyAlloc
If C is heathy:
  Return C
Else: //C is bad
  Add C back to free_list[k]
  While C is bad:
    Sort free_list[k] in desc by health score: percentage of heathy GPUs in the cell
    C = free_list[0]
    If C is bad:
      If there exist level k’ > k that num_splitable[k’] > 0.
        k’ = the minimum level s.t. num_splitable[k’] > 0
        Split a level-k’ cell until we get a level-k cell.			
    Else: //Splitting not possible
      Return (C, Failure)
      // Expose the bad cell with most free GPUs to the requesting VC
Return (C, Success)

We need to prove the above algorithm's correctness and completeness.

[Correctness] The algorithm will not violate sharing safety guarantee.

[Completeness] Using this algorithm, there is no such case that the allocation of a level-k cell request fails but it is possible to obtain a healthy level-k cell without violating sharing safety.

Proof:
The proof of correctness is straightforward, because each splitting at level-i will check if num_splittable[i]>=0.

The proof of compeleteness can be done by contradiction.

If the algorithm is not complete, it means there is a "Case" that:

  • When allocating a level-k cell fails, there exist a level-k’ cell (k’>=k) that num_splittable[k’] > 0 and it contains at least one level-k cell.

If k’ = k, it is impossible that num_splittable[k’] > 0 at level-k because "we will firstly allocate the healthier cell".

If k’ > k, the Case is impossible. Because the algorithm will split all cells that num_splittable[k’’] > 0 (for all k’’ >= k) if the algorithm fails. There will be no cells that num_splittable[k’] > 0, which contradicts with the Case.

Thanks a lot, Zhenhua!
Actually I've been thinking recently that this algorithm is not necessarily for only fault-tolerance. It is a general algorithm to break buddy cell allocation's rule without breaking safety.
This is in fact an algorithm to reach the "necessary" side of cell allocation; buddy cell allocation is only sufficient, but not necessary.
While this algorithm touches the necessity by explicit safety checking.
Fault-tolerance is just an example of cases where we want to break the buddy allocation rule for a while. Maybe we can try to extend it to other cases (not in the implementation I mean, perhaps in the paper, if we can find some interesting use cases :) )
(Oh but for the paper it seems to conflict with the simplicity principle of HiveD though 😂 anyway...)

@hzhua
Copy link
Contributor

hzhua commented Aug 10, 2020

Yes. The safety-check provides the necessary condition of allocation. A possible use case is to avoid unnecessary preemption of low-priority jobs. More generally, if we need to support customized physical scheduling policy, we need to provide an interface of this condition in some form (e.g., as a list of splittable/allocable cells at all levels).

@fanyangCS
Copy link
Author

Agree that the safety-check provides a necessary condition of arbitrary cell allocation. Looks like we can generalize the algorithm and prove it necessary and sufficient, and looks like we can reuse the proof you guys did a long time ago.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants