This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 549
add script to generate reports for OpenPai cluster #2507
Merged
Merged
Changes from all commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
aa84f5b
add script to generate reports for OpenPai cluster
xudifsd 2d3288a
fix according to review
xudifsd 9d4c544
adding cluster
xudifsd c574f04
allow to generate report with specified time
xudifsd ed20c03
use different label when checking different alerts
xudifsd cbe19a0
add part 4
xudifsd 49d3c52
add database and remove cluster
xudifsd 87dc126
use MB in vc_report
xudifsd 4f852ec
add doc
xudifsd cc26347
fix typo
xudifsd 9bc2b23
change format, add more columns to alert.csv
xudifsd 0154580
remove vc usage report
xudifsd 08d763f
add max_mem_usage to raw_job.csv
xudifsd 03e285a
update doc
xudifsd 86e91d6
implement a web server to expose reports
xudifsd b5f3fe8
update doc
xudifsd 095172b
implement gpu report
xudifsd 4b6db7b
use now as finished time if job is still running
xudifsd 5c3bc25
proxy prometheus
xudifsd b2b7cca
use framework finished_time to query all frameworks in given time frame
xudifsd b5e066d
change according to exit spec
xudifsd c3f3818
fix according to review
xudifsd File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,169 @@ | ||
# How to setup report script | ||
|
||
Many OpenPai cluster admins are interested in how is cluster usage and performance, who used the most/least resources, etc. Developers of OpenPai system are interested in what causes job failure, how to design and implment a system that can prevent such failure and avoid wasting cluster resources. | ||
|
||
But since not everyone is interested in this report, we do not maintain such a service, and merely provide a script for admins who interested in the report to execute. This document will provide informations about what the script will report, and how to maintain and query the result. | ||
|
||
## What the script will report | ||
|
||
The report consists of 4 reports `job`, `alert`, `raw_job` and `gpu`. | ||
|
||
### job | ||
|
||
This report will tell you uses' job statistic, this including the final status, job count and job resources, it have following columns: | ||
|
||
* user: username in OpenPai cluster | ||
* vc: VC name in OpenPai cluster | ||
* total job info: sum of all jobs | ||
* successful job info: those finished and exit code is 0 | ||
* failed job info: those finished and exit code is not 0 | ||
* stopped job info: those stopped by user | ||
* running job info: those running | ||
* waiting job info: those waiting | ||
|
||
The job info is group of following subcolumns: | ||
|
||
* count: job count of this category | ||
* elapsed time: job running time of this category | ||
* cpu second: how much vcore-second used by jobs of this category | ||
* memory second: how much memory(GB)-second used by jobs of this category | ||
* gpu second: how much gpu-second used by jobs of this category | ||
|
||
### alert | ||
|
||
This report will tell you what alerts was triggered in your cluster, the script can generate this report even if you didn't set an alert manager. Because the Prometheus service will delete data that's old enough, in default setup, it only retains 15 days of data, you may want to extend the retaintion date if you want an accurate number in montly report. | ||
|
||
It have following columns: | ||
|
||
* alert name: alert name defined in prometheus | ||
* host_ip: from which node this alert was triggered | ||
* source: the actual component in problem(have different meanings in different alert) | ||
* start: start time of this alert | ||
* durtion: how much time(seconds) this alert lasts | ||
* labels: original label sent along with alert | ||
|
||
### raw_job | ||
|
||
This report is a detailed job info, the `job.csv` can be deemed as aggreated statistic of this report. | ||
|
||
The report have following columns: | ||
|
||
* user: username in OpenPai cluster | ||
* vc: VC name in OpenPai cluster | ||
* job name: job name in OpenPai cluster | ||
* start time: when the job got started | ||
* finished time: when the job finished, if the job is still running, this will have value `1970/01/01` | ||
* waiting time: how much time(second) this job is in waiting status before running, this include waiting time of the retries. If the job is still running, this will have value 0 | ||
* running time: how much time this job is in running status, this include retries | ||
* retries: the retry count of this job | ||
* status: the status of the job, it could be `WAITING`, `RUNNING`, `SUCCEEDED`, `STOPPED`, `FAILED` and `UNKNOWN` | ||
* exit code: the exit code of the job, if the job is still in running, it will be value `N/A` | ||
* cpu allocated: how many vcore allocated to the job, this include the vcore allocated to app master | ||
* memory allocated: how much memory(GB) allocated to the job, this include the memory allocated to app master | ||
* max memory usage: maximum memory(GB) usage of this job, it will have value of `N/A` if Pai did not have record of memory usage, maybe due to running time of job is too short or system error | ||
* gpu allocated: how many gpu card allocated to the job | ||
|
||
### gpu | ||
|
||
This report is about all gpu util info in cluster. | ||
|
||
The report have following columns: | ||
|
||
* host_ip: where this gpu installed | ||
* gpu_id: gpu minor number in the node | ||
* avg: avg utils during the report time frame | ||
|
||
## Prerequisite | ||
|
||
You should prepare a node that have access to OpenPai services, the script will need to access hadoop-resource-manager, framework-launcher and Prometheus deployed by OpenPai. This node do not need to have much memory resource and do not need to have GPU cards. You only need to make sure this node will not restart frequently. Usually the master node of the OpenPai cluster is a good choice. | ||
|
||
After you choose a node, please make sure you have following software installed: | ||
|
||
* python3 | ||
* requests library | ||
* flask library | ||
|
||
If your node is ubuntu node, you can install these software using following commands: | ||
|
||
``` sh | ||
sudo apt-get install -y python3 python3-pip | ||
pip3 install -r $PAI_DIR/src/tools/reports_requirements.txt | ||
``` | ||
|
||
## How to Setup | ||
|
||
The [script](../../src/tools/reports.py) has thress actions, `refresh`, `report` and `serve`. | ||
|
||
The `refresh` action will tries to collect data from hadoop-resource-manager and framework-launcher, and save the data in sqlite3 DB for future process. The script needs to save data because hadoop-resource-manager will not retain job info too long, if we do not fetch them and save somewhere, we will not be able to generate correct report. We recommend admin run this script every 10 minutes using CRON job. | ||
|
||
The `report` action will query data about vc usage and job statistic from sqlite3 DB and generate vc/job/raw_job/gpu csv files, it will also get data from Prometheus to generate alert reports. You can execute this action whenever you want the reports. | ||
|
||
The `serve` action will start a http server so outside world can query report through web server instead of using files. | ||
|
||
Both `serve` and `report` will need `refresh` being called periodically to fetch data | ||
from underlaying source. | ||
|
||
First, log into the node you choose, put the [script](../../src/tools/reports.py) somewhere, for example, I put it in directory `/home/core/report`, edit the crontab using | ||
|
||
``` sh | ||
crontab -e | ||
``` | ||
|
||
It will prompt an editor with some documentation, you will need to paste following content at the end of the file | ||
|
||
``` crontab | ||
*/10 * * * * python3 /home/core/report/reports.py refresh -y $yarn_url -p $prometheus_url -l $launcher_url -d /home/core/report/cluster.sqlite >> /home/core/report/cluster.log 2>&1 | ||
``` | ||
|
||
Please replace `$yarn_url`, `$prometheus_url` and `$launcher_url` with your cluster value, they are should be like `http://$master:8088`, `http://$master:9091` and `http://$master:9086` respectively where `$master` is the IP/hostname of your OpenPai master, please also make sure they are in one line. It is a good practice to execute the command before put into crontab. | ||
|
||
After finished, you should save and exit the editor. You can then execute | ||
|
||
``` sh | ||
crontab -l | ||
``` | ||
|
||
to view your current crontab. It should showing what you edited. | ||
|
||
All available arguments and meanings can be viewed by executing script with `-h` arguments. | ||
|
||
The script will automatically delete old data, by default, it will retain 6 months of data. If this is too large for you, for example, if you only want to retain 1 months of data, you can add `-r 31` to above command to tell script delete data that's older than 31 days. | ||
|
||
You have two options to get report: `report` or `serve` action. | ||
|
||
### `report` | ||
|
||
Whenever you want an report, you can log into that node again and execute following command | ||
|
||
``` sh | ||
python3 /home/core/report/reports.py report -y $yarn_url -p $prometheus_url -l $launcher_url -d /home/core/report/cluster.sqlite | ||
``` | ||
|
||
By default, the script will generate a monthly report, which means it will query data from one month ago until now and use these data to generate the reports, you can change the time range using `--since` and `--until` argument, for example, if you want the reports from one month ago and until one week ago, you can add these arguments: | ||
|
||
``` sh | ||
--since `date --date='-1 month' +"%s"` --until `date --date='-1 week' +"%s"` | ||
``` | ||
|
||
### `serve` | ||
|
||
Some external tools can query http server directly, so you can start serve process and issue http request when you want a report, without having to login node and execute a command. | ||
|
||
To setup serve process, execute following command | ||
|
||
``` sh | ||
nohup python3 /home/core/report/reports.py serve -y $yarn_url -p $prometheus_url -l $launcher_url -d /home/core/report/cluster.sqlite > serve.log 2> serve.err.log & | ||
``` | ||
|
||
This will start a process in background and listen to default 10240 port, you can specify `--port` argument to change default port. | ||
|
||
With http server setup, you can now get those reports with the same name of csv file like: | ||
|
||
``` | ||
http://$IP:10240/job | ||
http://$IP:10240/raw_job | ||
http://$IP:10240/alert | ||
http://$IP:10240/gpu | ||
``` | ||
|
||
These end point all accept `span` argument, you can provide with value: `day`, `week` or `month`, which will generate report in that time span. The default span is week. This will get jobs finished during this time or is still running. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it's more like services than tools, why not make it an optional service, like alert-manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bin object this, if it is a service, we have to maintain its status, and if it is a tool, the users are responsible for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, but it's not that user friendly, and might mess host envs. Is it necessary to tell user how to uninstall this tool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe we should change it to service in the future.