Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

alert on unhealthy gpu #2209

Merged
merged 8 commits into from
Mar 4, 2019
Merged

alert on unhealthy gpu #2209

merged 8 commits into from
Mar 4, 2019

Conversation

xudifsd
Copy link
Member

@xudifsd xudifsd commented Feb 26, 2019

fixed #2192 , working in progress

  • zombie process
  • gpu memory leak
  • gpu used by external process
  • gpu ecc error

@xudifsd xudifsd requested a review from mzmssg February 26, 2019 07:08
@coveralls
Copy link

coveralls commented Feb 26, 2019

Coverage Status

Coverage remained the same at 52.627% when pulling 815bb9c on dixu/unhealthy-gpu into b64ae5f on master.

@xudifsd xudifsd changed the title [WIP] alert on unhealthy gpu alert on unhealthy gpu Feb 28, 2019
@xudifsd
Copy link
Member Author

xudifsd commented Feb 28, 2019

@fanyangCS @mzmssg ready for review

for line in content.split("\n"):
line = line.strip()
if "pids" in line and "/docker/" in line:
parts = line.split("/docker/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also consider kubernetes pod? Then if we switch to static pod, we won't need to change here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to consider k8s pod?

Copy link
Member

@mzmssg mzmssg Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xudifsd
Not find a clear definition of k8s pod cgroup, we could add a TODO here to remind us.

@xudifsd xudifsd merged commit 50ebc74 into master Mar 4, 2019
@xudifsd xudifsd deleted the dixu/unhealthy-gpu branch March 4, 2019 09:56
sunqinzheng added a commit that referenced this pull request Mar 20, 2019
* add a dashboard in grafana to list all tasks in node (#2197)

* Fix format in issue templates (#2233)

Fix format in issue templates:
- remove trailing spaces
- change chinese colon into english

* Fix auto retries when out of memory. (#1108)

* Distinguish cgroup OOM from dmesg.

* Remove cgroup OOM detection

Make all OOM cause exiting by 5

* Exit 55 when OOM

* Refine homepage for new users (#2155)

Updated first level bullets, to add more content for administrators and users, who is first time touch OpenPAI, or computing platform.

* Fix yarn container failed when docker container exited quickly. (#2256)

* REST server: remove expires in JWT payload of unit test (#2263)

* Deploy: add explicit config field in webportal  plugin (#2251)

* Deploy: add explicit config field in webportal  plugin

* Fix json.dumps

* t

* fix

* Update PLUGINS.md

* Update webportal.md

* alert on unhealthy gpu (#2209)

* Pylon: fix double start query in yarn redirect (#2258)

* Pylon: fix double start query in yarn redirect

* Hide debug info in docker-compose.yaml

* adapt user transfer script to new config (#2266)

* Webportal: add pai-version attribute to <pai-plugin> (#2245)

* Webportal: add pai-version attribute to <pai-plugin>

* Use preprocess to apply window.PAI_VERSION

* set version in layout.html

* Fix ib drivers bug (#2269)

* FIx ib installation script bug (#2271)

* [BUG] Fix hadoop ai build path (#2262)

* fix hadoop ai build bugs

* refine

* Web portal submit job: support init json from sessionStorage. (#2253)

* YARN and HDFS log persistence  (#2244)

* rm log persist

* change log dir to host

* persist nm log to host

* resolve conflict

* persist namenode log

* persist data node log

* add comments

* move log path to common pai storage

* use twisted in yarn-exporter (#2273)

* [Job Debugging] Basic Implement Of Job Debugging. (#2272)

* Refine document for new user to submit job (#2278)

1. add new guidance to submit job for beginners.
2. refine homepage to connect with new guidance.
3. reorganize content of troubleshooting for next refactoring.
4. fix links in md files.

* [Drivers] Fix the issue when installing IB drivers.  (#2275)

* fix can not report zombie process using gpu error (#2279)

* fix external process error

* add debug log

* fix short ID and long ID do not match

* use time based atomic ref to exchange info between threads

* add test case for AtomicRef

* fix bug in file remove (#2288)

* fix hadoop build error (#2296)

* export vc/node related metrics from yarn (#2289)

* 720

* open hdfs explorer in view container
enable tslint rule "ordered-imports"

* add tslint rule for indent

* add home button to hdfs explorer's navigation;
adjust octicon's color

* fix lint error

* [VS Code] Add job list (#2160)

* add job list view to pai extension

* [VS Code] joblist fix (#2185)

* eager load recent jobs when job submitted

* avoid eager getChildren, and let vscode treeview.reveal do it implicitly

* fix lint error
sunqinzheng added a commit that referenced this pull request Mar 28, 2019
* add a dashboard in grafana to list all tasks in node (#2197)

* Fix format in issue templates (#2233)

Fix format in issue templates:
- remove trailing spaces
- change chinese colon into english

* Fix auto retries when out of memory. (#1108)

* Distinguish cgroup OOM from dmesg.

* Remove cgroup OOM detection

Make all OOM cause exiting by 5

* Exit 55 when OOM

* Refine homepage for new users (#2155)

Updated first level bullets, to add more content for administrators and users, who is first time touch OpenPAI, or computing platform.

* Fix yarn container failed when docker container exited quickly. (#2256)

* REST server: remove expires in JWT payload of unit test (#2263)

* Deploy: add explicit config field in webportal  plugin (#2251)

* Deploy: add explicit config field in webportal  plugin

* Fix json.dumps

* t

* fix

* Update PLUGINS.md

* Update webportal.md

* alert on unhealthy gpu (#2209)

* Pylon: fix double start query in yarn redirect (#2258)

* Pylon: fix double start query in yarn redirect

* Hide debug info in docker-compose.yaml

* adapt user transfer script to new config (#2266)

* Webportal: add pai-version attribute to <pai-plugin> (#2245)

* Webportal: add pai-version attribute to <pai-plugin>

* Use preprocess to apply window.PAI_VERSION

* set version in layout.html

* Fix ib drivers bug (#2269)

* FIx ib installation script bug (#2271)

* [BUG] Fix hadoop ai build path (#2262)

* fix hadoop ai build bugs

* refine

* Web portal submit job: support init json from sessionStorage. (#2253)

* YARN and HDFS log persistence  (#2244)

* rm log persist

* change log dir to host

* persist nm log to host

* resolve conflict

* persist namenode log

* persist data node log

* add comments

* move log path to common pai storage

* use twisted in yarn-exporter (#2273)

* [Job Debugging] Basic Implement Of Job Debugging. (#2272)

* Refine document for new user to submit job (#2278)

1. add new guidance to submit job for beginners.
2. refine homepage to connect with new guidance.
3. reorganize content of troubleshooting for next refactoring.
4. fix links in md files.

* [Drivers] Fix the issue when installing IB drivers.  (#2275)

* fix can not report zombie process using gpu error (#2279)

* fix external process error

* add debug log

* fix short ID and long ID do not match

* use time based atomic ref to exchange info between threads

* add test case for AtomicRef

* fix bug in file remove (#2288)

* fix hadoop build error (#2296)

* export vc/node related metrics from yarn (#2289)

* 720

* open hdfs explorer in view container
enable tslint rule "ordered-imports"

* add tslint rule for indent

* add home button to hdfs explorer's navigation;
adjust octicon's color

* fix lint error

* [VS Code] Add job list (#2160)

* add job list view to pai extension

* [VS Code] joblist fix (#2185)

* eager load recent jobs when job submitted

* avoid eager getChildren, and let vscode treeview.reveal do it implicitly

* fix lint error
qfyin pushed a commit that referenced this pull request Apr 1, 2019
* add installation guide for VS code extension (#2223)

* add installation guide for VS code extension

* [VS Code] view container (#2301)

* add a dashboard in grafana to list all tasks in node (#2197)

* Fix format in issue templates (#2233)

Fix format in issue templates:
- remove trailing spaces
- change chinese colon into english

* Fix auto retries when out of memory. (#1108)

* Distinguish cgroup OOM from dmesg.

* Remove cgroup OOM detection

Make all OOM cause exiting by 5

* Exit 55 when OOM

* Refine homepage for new users (#2155)

Updated first level bullets, to add more content for administrators and users, who is first time touch OpenPAI, or computing platform.

* Fix yarn container failed when docker container exited quickly. (#2256)

* REST server: remove expires in JWT payload of unit test (#2263)

* Deploy: add explicit config field in webportal  plugin (#2251)

* Deploy: add explicit config field in webportal  plugin

* Fix json.dumps

* t

* fix

* Update PLUGINS.md

* Update webportal.md

* alert on unhealthy gpu (#2209)

* Pylon: fix double start query in yarn redirect (#2258)

* Pylon: fix double start query in yarn redirect

* Hide debug info in docker-compose.yaml

* adapt user transfer script to new config (#2266)

* Webportal: add pai-version attribute to <pai-plugin> (#2245)

* Webportal: add pai-version attribute to <pai-plugin>

* Use preprocess to apply window.PAI_VERSION

* set version in layout.html

* Fix ib drivers bug (#2269)

* FIx ib installation script bug (#2271)

* [BUG] Fix hadoop ai build path (#2262)

* fix hadoop ai build bugs

* refine

* Web portal submit job: support init json from sessionStorage. (#2253)

* YARN and HDFS log persistence  (#2244)

* rm log persist

* change log dir to host

* persist nm log to host

* resolve conflict

* persist namenode log

* persist data node log

* add comments

* move log path to common pai storage

* use twisted in yarn-exporter (#2273)

* [Job Debugging] Basic Implement Of Job Debugging. (#2272)

* Refine document for new user to submit job (#2278)

1. add new guidance to submit job for beginners.
2. refine homepage to connect with new guidance.
3. reorganize content of troubleshooting for next refactoring.
4. fix links in md files.

* [Drivers] Fix the issue when installing IB drivers.  (#2275)

* fix can not report zombie process using gpu error (#2279)

* fix external process error

* add debug log

* fix short ID and long ID do not match

* use time based atomic ref to exchange info between threads

* add test case for AtomicRef

* fix bug in file remove (#2288)

* fix hadoop build error (#2296)

* export vc/node related metrics from yarn (#2289)

* 720

* open hdfs explorer in view container
enable tslint rule "ordered-imports"

* add tslint rule for indent

* add home button to hdfs explorer's navigation;
adjust octicon's color

* fix lint error

* [VS Code] Add job list (#2160)

* add job list view to pai extension

* [VS Code] joblist fix (#2185)

* eager load recent jobs when job submitted

* avoid eager getChildren, and let vscode treeview.reveal do it implicitly

* fix lint error

* [VS Code] default to generate jsonc job config file  (#2368)

* 720

* open hdfs explorer in view container
enable tslint rule "ordered-imports"

* add tslint rule for indent

* add home button to hdfs explorer's navigation;
adjust octicon's color

* fix lint error

* [VS Code] Add job list (#2160)

* add job list view to pai extension

* [VS Code] joblist fix (#2185)

* eager load recent jobs when job submitted

* avoid eager getChildren, and let vscode treeview.reveal do it implicitly

* default to generate jsonc job config file

* [VS Code] Refine error messages; Fix Cluster Explorer's bug

* [VS Code] changelog and readme (#2429)

* 539

* 712

* 452

* [VS Code] v0.11 compatible issue (#2457)

* 536

* 600

* [VS Code] fix cluster explorer's right-click menu (#2463)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need alarms for unhealthy GPU cases
3 participants