Handling cells with hardware failure #23

fanyangCS · 2020-07-03T03:56:58Z

1. When allocating a level-k cell, we will firstly allocate the "healthier" cell in the free cell list in level-k. (healthiness is defined by # of good GPUs/ # of total GPUs) @zhypku
2. When all free level-k cells are bad cell, check if we get a new level-k cell by splitting a higher level cell (by leveraging the initial cell assignment for each VC). @abuccts Split higher level cell when allocated bad cells #27
3. If we cannot get a new level-k cell from step 2, this means all "allocable" level-k cells are bad, we will pre-bind these level-k bad cells to all VCs assigned with level-k cells. @zhypku Infinitely retry in intra-VC scheduler due to inconsistent view of bad cells. #25
4. If we have a set of new level-k cells (buddy cells) from step 2, but they are still all bad cells, repeat step 2. @abuccts Split higher level cell when allocated bad cells #27

hzhua · 2020-08-10T08:41:08Z

Here is the psudo-code of the above algorithm.

First, we define

- num_splittable[i]: the number of splittable cells at each level (i=1,...,n). "splittable" means we can split the cell without violating sharing safety guarantee.
- num_free_cell[i] as the number of free cells at level-i. 
- num_vc_unused[i] as the number of unused level-i cells in all VCs.
- HUC[i] as the number of level-i cells we can obtain by splitting a level-(i+1) cell

If level-n is the highest level, then num_splittable[i] can be calculated recursively as follows:

1.	when i = n, num_splittable[i] = num_free_cell[i] - num_vc_unused[i]
2.	when i < n, num_splittable[i] = num_splittable[i+1]*HUC[i] + num_free_cell[i] – num_vc_unused[i]

Thus, the sharing safety guarantee is equivalent to guarantee num_splittable[i] >= 0 for all level i at any time.

Algorithm:
C = BuddyAlloc(r_k) // the level-k cell from BuddyAlloc
If C is heathy:
  Return C
Else: //C is bad
  Add C back to free_list[k]
  While C is bad:
    Sort free_list[k] in desc by health score: percentage of heathy GPUs in the cell
    C = free_list[0]
    If C is bad:
      If there exist level k’ > k that num_splitable[k’] > 0.
        k’ = the minimum level s.t. num_splitable[k’] > 0
        Split a level-k’ cell until we get a level-k cell.			
    Else: //Splitting not possible
      Return (C, Failure)
      // Expose the bad cell with most free GPUs to the requesting VC
Return (C, Success)

We need to prove the above algorithm's correctness and completeness.

[Correctness] The algorithm will not violate sharing safety guarantee.

[Completeness] Using this algorithm, there is no such case that the allocation of a level-k cell request fails but it is possible to obtain a healthy level-k cell without violating sharing safety.

Proof:
The proof of correctness is straightforward, because each splitting at level-i will check if num_splittable[i]>=0.

The proof of compeleteness can be done by contradiction.

If the algorithm is not complete, it means there is a "Case" that:

When allocating a level-k cell fails, there exist a level-k’ cell (k’>=k) that num_splittable[k’] > 0 and it contains at least one level-k cell.

If k’ = k, it is impossible that num_splittable[k’] > 0 at level-k because "we will firstly allocate the healthier cell".

If k’ > k, the Case is impossible. Because the algorithm will split all cells that num_splittable[k’’] > 0 (for all k’’ >= k) if the algorithm fails. There will be no cells that num_splittable[k’] > 0, which contradicts with the Case.

zhypku · 2020-08-10T13:58:26Z

Here is the psudo-code of the above algorithm.

First, we define
- num_splittable[i]: the number of splittable cells at each level (i=1,...,n). "splittable" means we can split the cell without violating sharing safety guarantee.
- num_free_cell[i] as the number of free cells at level-i. 
- num_vc_unused[i] as the number of unused level-i cells in all VCs.
- HUC[i] as the number of level-i cells we can obtain by splitting a level-(i+1) cell
If level-n is the highest level, then num_splittable[i] can be calculated recursively as follows:
1.	when i = n, num_splittable[i] = num_free_cell[i] - num_vc_unused[i]
2.	when i < n, num_splittable[i] = num_splittable[i+1]*HUC[i] + num_free_cell[i] – num_vc_unused[i]
Thus, the sharing safety guarantee is equivalent to guarantee num_splittable[i] >= 0 for all level i at any time.
Algorithm:
C = BuddyAlloc(r_k) // the level-k cell from BuddyAlloc
If C is heathy:
  Return C
Else: //C is bad
  Add C back to free_list[k]
  While C is bad:
    Sort free_list[k] in desc by health score: percentage of heathy GPUs in the cell
    C = free_list[0]
    If C is bad:
      If there exist level k’ > k that num_splitable[k’] > 0.
        k’ = the minimum level s.t. num_splitable[k’] > 0
        Split a level-k’ cell until we get a level-k cell.			
    Else: //Splitting not possible
      Return (C, Failure)
      // Expose the bad cell with most free GPUs to the requesting VC
Return (C, Success)
We need to prove the above algorithm's correctness and completeness.

[Correctness] The algorithm will not violate sharing safety guarantee.

[Completeness] Using this algorithm, there is no such case that the allocation of a level-k cell request fails but it is possible to obtain a healthy level-k cell without violating sharing safety.

Proof:
The proof of correctness is straightforward, because each splitting at level-i will check if num_splittable[i]>=0.

The proof of compeleteness can be done by contradiction.

If the algorithm is not complete, it means there is a "Case" that:

When allocating a level-k cell fails, there exist a level-k’ cell (k’>=k) that num_splittable[k’] > 0 and it contains at least one level-k cell.

If k’ = k, it is impossible that num_splittable[k’] > 0 at level-k because "we will firstly allocate the healthier cell".

If k’ > k, the Case is impossible. Because the algorithm will split all cells that num_splittable[k’’] > 0 (for all k’’ >= k) if the algorithm fails. There will be no cells that num_splittable[k’] > 0, which contradicts with the Case.

Thanks a lot, Zhenhua!
Actually I've been thinking recently that this algorithm is not necessarily for only fault-tolerance. It is a general algorithm to break buddy cell allocation's rule without breaking safety.
This is in fact an algorithm to reach the "necessary" side of cell allocation; buddy cell allocation is only sufficient, but not necessary.
While this algorithm touches the necessity by explicit safety checking.
Fault-tolerance is just an example of cases where we want to break the buddy allocation rule for a while. Maybe we can try to extend it to other cases (not in the implementation I mean, perhaps in the paper, if we can find some interesting use cases :) )
(Oh but for the paper it seems to conflict with the simplicity principle of HiveD though 😂 anyway...)

hzhua · 2020-08-10T14:11:23Z

Yes. The safety-check provides the necessary condition of allocation. A possible use case is to avoid unnecessary preemption of low-priority jobs. More generally, if we need to support customized physical scheduling policy, we need to provide an interface of this condition in some form (e.g., as a list of splittable/allocable cells at all levels).

fanyangCS · 2020-08-11T00:00:18Z

Agree that the safety-check provides a necessary condition of arbitrary cell allocation. Looks like we can generalize the algorithm and prove it necessary and sufficient, and looks like we can reuse the proof you guys did a long time ago.

fanyangCS changed the title ~~Handling cell containing hardware failure~~ Handling cells with hardware failure Jul 3, 2020

fanyangCS assigned abuccts Jul 3, 2020

fanyangCS added the enhancement New feature or request label Jul 3, 2020

abuccts closed this as completed Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling cells with hardware failure #23

Handling cells with hardware failure #23

fanyangCS commented Jul 3, 2020 •

edited by abuccts

Loading

hzhua commented Aug 10, 2020 •

edited

Loading

zhypku commented Aug 10, 2020 •

edited

Loading

hzhua commented Aug 10, 2020

fanyangCS commented Aug 11, 2020

Handling cells with hardware failure #23

Handling cells with hardware failure #23

Comments

fanyangCS commented Jul 3, 2020 • edited by abuccts Loading

hzhua commented Aug 10, 2020 • edited Loading

zhypku commented Aug 10, 2020 • edited Loading

hzhua commented Aug 10, 2020

fanyangCS commented Aug 11, 2020

fanyangCS commented Jul 3, 2020 •

edited by abuccts

Loading

hzhua commented Aug 10, 2020 •

edited

Loading

zhypku commented Aug 10, 2020 •

edited

Loading