Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Diawang/dockercleaner #2119

Merged
merged 33 commits into from
Feb 13, 2019
Merged

Diawang/dockercleaner #2119

merged 33 commits into from
Feb 13, 2019

Conversation

wangdian
Copy link
Member

@wangdian wangdian commented Feb 1, 2019

Clean logic V0.1

  1. The cleaner will check disk usage on docker's disk every 60 seconds(configurable), if the disk usage is above 94%(configurable), it will stop container that uses largest disk space, use a white list to avoid killing system containers.
  2. Send SIGUSR1(10) to container as termination signal, the container will exit with code 1.
  3. The related job will fail, we can track the reason in job logs.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 52.904% when pulling 3fbc7e8 on diawang/dockercleaner into 746ccb5 on master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 52.904% when pulling 3fbc7e8 on diawang/dockercleaner into 746ccb5 on master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 52.904% when pulling 3fbc7e8 on diawang/dockercleaner into 746ccb5 on master.

@coveralls
Copy link

coveralls commented Feb 1, 2019

Coverage Status

Coverage decreased (-0.1%) to 52.794% when pulling 1353fc9 on diawang/dockercleaner into 746ccb5 on master.

src/cleaner/cleaner_main.py Outdated Show resolved Hide resolved
src/cleaner/cleaner_main.py Outdated Show resolved Hide resolved
@scarlett2018 scarlett2018 requested review from scarlett2018 and removed request for scarlett2018 February 2, 2019 01:48


# Clean logic v1: kill largest container
white_list = ["k8s_kube", "k8s_pylon", "k8s_zookeeper", "k8s_rest-server", "k8s_yarn", "k8s_hadoop", "k8s_job-exporter", "k8s_watchdog", "k8s_grafana", "k8s_node-exporter", "k8s_webportal", "k8s_prometheus", "k8s_nvidia-drivers", "k8s_etcd-container", "k8s_apiserver-container", "k8s_docker-cleaner", "kubelet"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fanyangCS
Seems we should add dev-box here before we complete dev-box management

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please list the restrictions that the white list should follow. It is best to raise these questions on design phase not in PR review.



# Clean logic v1: kill largest container
white_list = ["k8s_kube", "k8s_pylon", "k8s_zookeeper", "k8s_rest-server", "k8s_yarn", "k8s_hadoop", "k8s_job-exporter", "k8s_watchdog", "k8s_grafana", "k8s_node-exporter", "k8s_webportal", "k8s_prometheus", "k8s_nvidia-drivers", "k8s_etcd-container", "k8s_apiserver-container", "k8s_docker-cleaner", "kubelet"]
Copy link
Member

@mzmssg mzmssg Feb 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember we have some container starting with k8s_yarn.

Whatever, I think you could simply regard k8s_ and kubelet as our service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, fogget yarn exporter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a conclusion in the meeting, we decided to list all our core services here and not just use k8s prefix to filter.

2. Add interval as var
3. Kill docker, send signal
function kill_handler()
{
printf "%s %s\n" \
"[INFO]" "Docker container killed due to disk pressure. If your job needs large disk space, please use HDFS or NFS to store your data."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container killed probably due to disk or memory pressure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After experiment, I changed the signal to SIGUSR1(10) and the container can handle the signal as expected. So if we trapped signal 10, means the container is killed due to disk pressure.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants