Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

add script to generate reports for OpenPai cluster #2507

Merged
merged 22 commits into from
Jul 1, 2019
Merged

Conversation

xudifsd
Copy link
Member

@xudifsd xudifsd commented Apr 4, 2019

generate csv reports for OpenPai cluster.

Document is in the header of the script.

@xudifsd xudifsd requested a review from mzmssg April 4, 2019 09:47
@coveralls
Copy link

coveralls commented Apr 4, 2019

Coverage Status

Coverage increased (+0.8%) to 53.89% when pulling c3f3818 on dixu/vc-usage into 5a2d528 on master.

src/tools/reports.py Outdated Show resolved Hide resolved
src/tools/reports.py Outdated Show resolved Hide resolved
@mzmssg
Copy link
Member

mzmssg commented Apr 4, 2019

Should we add a document?

@xudifsd
Copy link
Member Author

xudifsd commented Apr 10, 2019

The intended usage will be

python3 reports.py refresh -y $YARN_URL -p $PROMETHEUS_URL -l $LAUNCHER_URL -d cluster.sqlite

and

python3 reports.py report -y $YARN_URL -p $PROMETHEUS_URL -l $LAUNCHER_URL -d cluster.sqlite

The refresh command should be called periodically, maybe every 10 minutes.

Will write a doc on how to setup the cron job and more fancy usage.

@xudifsd
Copy link
Member Author

xudifsd commented Apr 10, 2019

fixed #2127 and #2073

@xudifsd
Copy link
Member Author

xudifsd commented Apr 10, 2019

may relay on #2449

@xudifsd xudifsd requested a review from scarlett2018 April 11, 2019 05:57
@scarlett2018
Copy link
Member

@xudifsd Just made some inline changes for first 2 paragraphs: https://github.com/Microsoft/pai/blob/07c464838e7db343045a96dcacdb2d5c10532ec9/docs/tools/how-to-setup-report-script.md. Will review the rest next Monday.

Copy link
Member

@scarlett2018 scarlett2018 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some inline changes here for first 2 paragraphs: https://github.com/Microsoft/pai/pull/2507/files#diff-02f09b898fe248561eae9ff55b7e58d0.

Will review the rest next Monday.

Copy link
Member

@scarlett2018 scarlett2018 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xudifsd Feedbacks for cluster_report_alert csv:

  1. can we format the start to a Time format in csv? Same as duration, need in the d:h:m:s format for usability.

  2. There are several duplicated raws, what does it means? can we merge those duplications?
    image

  3. I did not get what's a "cleaner-ds-2cfmk" instance, which Node is it bending to? how ops can act after seeing this info. can you help me understand more? Same questions to all the other alerts with none ip instance values.
    image

@xudifsd
Copy link
Member Author

xudifsd commented Apr 15, 2019

@scarlett2018

  1. you mean something like 2019/04/15-14:50:39?
  2. They are not duplicate in a sense that not all labels are the same, for example, JobExporter will have docker_collector and container_collector, if they all hanging, two alerts will triggered, although they are in same host, and appears to be duplicated in reports if we only extract IP. I did not find a better way to do this, because alerts have different labels and even different number of labels. Maybe we can only keep alerts what we want in the first place: PaiServicePodNotReady.
  3. This is disk-cleaner pod, the origin requirement is from @fanyangCS to get a rough sense of what pai service is venerable to not ready, so no IP is required.

@scarlett2018 scarlett2018 requested a review from squirrelsc May 28, 2019 03:13

First, log into the node you choose, put the [script](../../src/tools/reports.py) somewhere, for example, I put it in directory `/home/core/report`, edit the crontab using

``` sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it's more like services than tools, why not make it an optional service, like alert-manager.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bin object this, if it is a service, we have to maintain its status, and if it is a tool, the users are responsible for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but it's not that user friendly, and might mess host envs. Is it necessary to tell user how to uninstall this tool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe we should change it to service in the future.

src/tools/reports.py Outdated Show resolved Hide resolved
src/tools/reports.py Outdated Show resolved Hide resolved
@mzmssg mzmssg self-requested a review June 25, 2019 05:30
@xudifsd xudifsd merged commit accf144 into master Jul 1, 2019
@xudifsd xudifsd deleted the dixu/vc-usage branch July 1, 2019 04:23
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants