Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

[Hived]: Move Hived from OpenPAI to dedicated repo #4319

Merged
merged 2 commits into from
Mar 23, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 0 additions & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,5 @@ jobs:
curl -L https://git.io/misspell | sudo bash -s -- -b /bin
- name: Check spelling
run: |
rm -rf ./subprojects/GOPATH/src/github.com/microsoft/hivedscheduler/vendor/
rm -rf ./src/watchdog/GOPATH/src/github.com/microsoft/watchdog/vendor/
misspell -error .
2 changes: 1 addition & 1 deletion docs/hivedscheduler/devops.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Update Scheduling Config, such as CRUD for virtual clusters and gpu types.
virtualClusters: <your_virtual_clusters_config>
```

For how to config them, please check [Config HivedScheduler](../../subprojects/GOPATH/src/github.com/microsoft/hivedscheduler/doc/user-manual.md#Config)
For how to config them, please check [Config HivedScheduler](https://github.com/microsoft/hivedscheduler/blob/master/doc/user-manual.md#config)

3. Push Config and Start Services
```bash
Expand Down
2 changes: 1 addition & 1 deletion docs/system_architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The failure rules can be updated on-the-fly by the cluster operaters. Wheneven n

OpenPAI provides comprehensive [monitoring tools](./grafana/README.md) to users and cluster admins for job and cluster monitoring. OpenPAI also monitors the status of key OpenPAI components in the cluster and is able to send [alerts](./alerting/README.md) (e.g., as in email) if pre-configed conditions have been triggered.

OpenPAI is a modular platform, which is designed to enable various innovations. With the standard k8s scheduling API, OpenPAI introduces [HiveD](../subprojects/hivedscheduler/README.md), an optional but recommended scheduler designed for deep learning workloads in a multi-tenant GPU cluster. HiveD provides various advantages over standard k8s scheduler. For example, it introduces a notion of "virtual cluster", which allows a team of users to run workload in the virtual cluster as if they reserve a private, dedicated (smaller) GPU cluster.
OpenPAI is a modular platform, which is designed to enable various innovations. With the standard k8s scheduling API, OpenPAI introduces [HiveD](https://github.com/microsoft/hivedscheduler), an optional but recommended scheduler designed for deep learning workloads in a multi-tenant GPU cluster. HiveD provides various advantages over standard k8s scheduler. For example, it introduces a notion of "virtual cluster", which allows a team of users to run workload in the virtual cluster as if they reserve a private, dedicated (smaller) GPU cluster.
HiveD's virtual cluster reserves GPU resource not only in terms of quota (i.e., number of GPU), but also in terms of **topology**. For example, with HiveD a virtual cluster can reserve a GPU node, or a rack of GPU nodes within the same InfiniBand domain, instead of a set of GPUs randomly scatters across the cluster. This is important to preserve the training speed for jobs within the virtual cluster.
With HiveD, OpenPAI also provides better topology-aware gang scheduling with no [resource starvation](https://en.wikipedia.org/wiki/Starvation_(computer_science)). HiveD also supports multi-priority jobs and job preemption.

Expand Down
Loading