-
Notifications
You must be signed in to change notification settings - Fork 549
Conversation
Jenkinsfile
Outdated
\\"command\\": \\"nodejs index 128\\", | ||
\\"portList\\": [] | ||
} | ||
] | ||
}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Gerhut ,
Could you rebase on master and remove changes on Jenkinsfile?
You'd better test your fix on your own environment.
Jenkinsfile
Outdated
\\"command\\": \\"nodejs index 128\\", | ||
\\"portList\\": [] | ||
} | ||
] | ||
}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too.
Sure, I'll revert the Jenkins test before being merged |
@@ -234,7 +242,7 @@ ln -s /tmp/pai-root/log/$APP_ID/$CONTAINER_ID/DockerContainerDebug.log $LAUNCHER | |||
docker pull {{{ jobData.image }}} \ | |||
|| { echo "Can not pull Docker image"; exit 1; } | |||
docker run --name $docker_name \ | |||
--rm \ | |||
--detach \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So docker container won't be removed automatically.
Where do you clean up stopped containers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 62 in exit_handler
20e97d8
to
003fdc7
Compare
387d6e5
to
1b825af
Compare
46a629e
to
a812230
Compare
You can also refer k8s launcher: |
a812230
to
d1f25e5
Compare
Make all OOM cause exiting by 5
please treat it as higher priority. |
|
||
printf "[DEBUG] Write exit code $rc to file /var/lib/hadoopdata/nm-local-dir/nmPrivate/$APP_ID/$CONTAINER_ID/$CONTAINER_ID.pid.exitcode.\n" | ||
docker container rm $docker_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gerhut
We can't guarantee exit_handle will be executed, all codes here are besteffort. So the container might be left on the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, but there is an abnormal state when exit_handle does not be executed. The legacy containers removing should be rely on cleaner.
@Gerhut, hi, I'm working out on a solution for the same issue. I think there some unaccounted circumstances can occur, which we should handle there. Let's see.
I suggest considering this variant: function get_docker_container_id()
{
docker inspect --format {{.Id}} $1
}
function check_docker_oom_killed()
{
docker inspect --format {{.State.OOMKilled}} $1
}
function check_kernel_oom_killer()
{
dmesg | grep -i "kill" | grep "/docker/$container_id"
}
function exit_handler()
{
rc=$?
printf "[DEBUG] EXIT signal received in yarn container, performing clean up action...\n"
docker stop --time=${PAI_GRACEFUL_EXIT_TIMEOUT:-35} $docker_name >/dev/null
# Here: container (i.e. init process) or the user task have been killed
# 137, 143 - not handled TERM and KILL signals
if [[ "$rc" = "137" || "$rc" = "143" || "$(check_docker_oom_killed $docker_name)" ]]; then
if [ "$(check_docker_oom_killed $docker_name)" = "true" ] || \
check_kernel_oom_killer $(get_docker_container_id $docker_name) >/dev/null; then \
echo "[ERROR] One of the container processes has been killed by the OOM killer"
rc=55 # Some undefined erroneous exit code
fi
fi
docker rm -f $docker_name >/dev/null
echo $rc > "/var/lib/hadoopdata/nm-local-dir/nmPrivate/$APP_ID/$CONTAINER_ID/$CONTAINER_ID.pid.exitcode"
exit $rc
} |
@zhiltsov-max , great suggestion! We also think of some points you raised previously (e.g., oom-score). We plan to do some prototyping after this sprint (March release). Hence this PR is likely a short-term solution. Meanwhile, we would love to hear your comments on #2195. And last but not least, we welcome your contribution. |
* add a dashboard in grafana to list all tasks in node (#2197) * Fix format in issue templates (#2233) Fix format in issue templates: - remove trailing spaces - change chinese colon into english * Fix auto retries when out of memory. (#1108) * Distinguish cgroup OOM from dmesg. * Remove cgroup OOM detection Make all OOM cause exiting by 5 * Exit 55 when OOM * Refine homepage for new users (#2155) Updated first level bullets, to add more content for administrators and users, who is first time touch OpenPAI, or computing platform. * Fix yarn container failed when docker container exited quickly. (#2256) * REST server: remove expires in JWT payload of unit test (#2263) * Deploy: add explicit config field in webportal plugin (#2251) * Deploy: add explicit config field in webportal plugin * Fix json.dumps * t * fix * Update PLUGINS.md * Update webportal.md * alert on unhealthy gpu (#2209) * Pylon: fix double start query in yarn redirect (#2258) * Pylon: fix double start query in yarn redirect * Hide debug info in docker-compose.yaml * adapt user transfer script to new config (#2266) * Webportal: add pai-version attribute to <pai-plugin> (#2245) * Webportal: add pai-version attribute to <pai-plugin> * Use preprocess to apply window.PAI_VERSION * set version in layout.html * Fix ib drivers bug (#2269) * FIx ib installation script bug (#2271) * [BUG] Fix hadoop ai build path (#2262) * fix hadoop ai build bugs * refine * Web portal submit job: support init json from sessionStorage. (#2253) * YARN and HDFS log persistence (#2244) * rm log persist * change log dir to host * persist nm log to host * resolve conflict * persist namenode log * persist data node log * add comments * move log path to common pai storage * use twisted in yarn-exporter (#2273) * [Job Debugging] Basic Implement Of Job Debugging. (#2272) * Refine document for new user to submit job (#2278) 1. add new guidance to submit job for beginners. 2. refine homepage to connect with new guidance. 3. reorganize content of troubleshooting for next refactoring. 4. fix links in md files. * [Drivers] Fix the issue when installing IB drivers. (#2275) * fix can not report zombie process using gpu error (#2279) * fix external process error * add debug log * fix short ID and long ID do not match * use time based atomic ref to exchange info between threads * add test case for AtomicRef * fix bug in file remove (#2288) * fix hadoop build error (#2296) * export vc/node related metrics from yarn (#2289) * 720 * open hdfs explorer in view container enable tslint rule "ordered-imports" * add tslint rule for indent * add home button to hdfs explorer's navigation; adjust octicon's color * fix lint error * [VS Code] Add job list (#2160) * add job list view to pai extension * [VS Code] joblist fix (#2185) * eager load recent jobs when job submitted * avoid eager getChildren, and let vscode treeview.reveal do it implicitly * fix lint error
* add a dashboard in grafana to list all tasks in node (#2197) * Fix format in issue templates (#2233) Fix format in issue templates: - remove trailing spaces - change chinese colon into english * Fix auto retries when out of memory. (#1108) * Distinguish cgroup OOM from dmesg. * Remove cgroup OOM detection Make all OOM cause exiting by 5 * Exit 55 when OOM * Refine homepage for new users (#2155) Updated first level bullets, to add more content for administrators and users, who is first time touch OpenPAI, or computing platform. * Fix yarn container failed when docker container exited quickly. (#2256) * REST server: remove expires in JWT payload of unit test (#2263) * Deploy: add explicit config field in webportal plugin (#2251) * Deploy: add explicit config field in webportal plugin * Fix json.dumps * t * fix * Update PLUGINS.md * Update webportal.md * alert on unhealthy gpu (#2209) * Pylon: fix double start query in yarn redirect (#2258) * Pylon: fix double start query in yarn redirect * Hide debug info in docker-compose.yaml * adapt user transfer script to new config (#2266) * Webportal: add pai-version attribute to <pai-plugin> (#2245) * Webportal: add pai-version attribute to <pai-plugin> * Use preprocess to apply window.PAI_VERSION * set version in layout.html * Fix ib drivers bug (#2269) * FIx ib installation script bug (#2271) * [BUG] Fix hadoop ai build path (#2262) * fix hadoop ai build bugs * refine * Web portal submit job: support init json from sessionStorage. (#2253) * YARN and HDFS log persistence (#2244) * rm log persist * change log dir to host * persist nm log to host * resolve conflict * persist namenode log * persist data node log * add comments * move log path to common pai storage * use twisted in yarn-exporter (#2273) * [Job Debugging] Basic Implement Of Job Debugging. (#2272) * Refine document for new user to submit job (#2278) 1. add new guidance to submit job for beginners. 2. refine homepage to connect with new guidance. 3. reorganize content of troubleshooting for next refactoring. 4. fix links in md files. * [Drivers] Fix the issue when installing IB drivers. (#2275) * fix can not report zombie process using gpu error (#2279) * fix external process error * add debug log * fix short ID and long ID do not match * use time based atomic ref to exchange info between threads * add test case for AtomicRef * fix bug in file remove (#2288) * fix hadoop build error (#2296) * export vc/node related metrics from yarn (#2289) * 720 * open hdfs explorer in view container enable tslint rule "ordered-imports" * add tslint rule for indent * add home button to hdfs explorer's navigation; adjust octicon's color * fix lint error * [VS Code] Add job list (#2160) * add job list view to pai extension * [VS Code] joblist fix (#2185) * eager load recent jobs when job submitted * avoid eager getChildren, and let vscode treeview.reveal do it implicitly * fix lint error
* add installation guide for VS code extension (#2223) * add installation guide for VS code extension * [VS Code] view container (#2301) * add a dashboard in grafana to list all tasks in node (#2197) * Fix format in issue templates (#2233) Fix format in issue templates: - remove trailing spaces - change chinese colon into english * Fix auto retries when out of memory. (#1108) * Distinguish cgroup OOM from dmesg. * Remove cgroup OOM detection Make all OOM cause exiting by 5 * Exit 55 when OOM * Refine homepage for new users (#2155) Updated first level bullets, to add more content for administrators and users, who is first time touch OpenPAI, or computing platform. * Fix yarn container failed when docker container exited quickly. (#2256) * REST server: remove expires in JWT payload of unit test (#2263) * Deploy: add explicit config field in webportal plugin (#2251) * Deploy: add explicit config field in webportal plugin * Fix json.dumps * t * fix * Update PLUGINS.md * Update webportal.md * alert on unhealthy gpu (#2209) * Pylon: fix double start query in yarn redirect (#2258) * Pylon: fix double start query in yarn redirect * Hide debug info in docker-compose.yaml * adapt user transfer script to new config (#2266) * Webportal: add pai-version attribute to <pai-plugin> (#2245) * Webportal: add pai-version attribute to <pai-plugin> * Use preprocess to apply window.PAI_VERSION * set version in layout.html * Fix ib drivers bug (#2269) * FIx ib installation script bug (#2271) * [BUG] Fix hadoop ai build path (#2262) * fix hadoop ai build bugs * refine * Web portal submit job: support init json from sessionStorage. (#2253) * YARN and HDFS log persistence (#2244) * rm log persist * change log dir to host * persist nm log to host * resolve conflict * persist namenode log * persist data node log * add comments * move log path to common pai storage * use twisted in yarn-exporter (#2273) * [Job Debugging] Basic Implement Of Job Debugging. (#2272) * Refine document for new user to submit job (#2278) 1. add new guidance to submit job for beginners. 2. refine homepage to connect with new guidance. 3. reorganize content of troubleshooting for next refactoring. 4. fix links in md files. * [Drivers] Fix the issue when installing IB drivers. (#2275) * fix can not report zombie process using gpu error (#2279) * fix external process error * add debug log * fix short ID and long ID do not match * use time based atomic ref to exchange info between threads * add test case for AtomicRef * fix bug in file remove (#2288) * fix hadoop build error (#2296) * export vc/node related metrics from yarn (#2289) * 720 * open hdfs explorer in view container enable tslint rule "ordered-imports" * add tslint rule for indent * add home button to hdfs explorer's navigation; adjust octicon's color * fix lint error * [VS Code] Add job list (#2160) * add job list view to pai extension * [VS Code] joblist fix (#2185) * eager load recent jobs when job submitted * avoid eager getChildren, and let vscode treeview.reveal do it implicitly * fix lint error * [VS Code] default to generate jsonc job config file (#2368) * 720 * open hdfs explorer in view container enable tslint rule "ordered-imports" * add tslint rule for indent * add home button to hdfs explorer's navigation; adjust octicon's color * fix lint error * [VS Code] Add job list (#2160) * add job list view to pai extension * [VS Code] joblist fix (#2185) * eager load recent jobs when job submitted * avoid eager getChildren, and let vscode treeview.reveal do it implicitly * default to generate jsonc job config file * [VS Code] Refine error messages; Fix Cluster Explorer's bug * [VS Code] changelog and readme (#2429) * 539 * 712 * 452 * [VS Code] v0.11 compatible issue (#2457) * 536 * 600 * [VS Code] fix cluster explorer's right-click menu (#2463)
Opened for CI.