Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design/model ci #9018

Closed
wants to merge 7 commits into from
Closed

Conversation

Superjomn
Copy link
Contributor

No description provided.


## Make Factor Tracking Extensible

There are some general factors including `train cost`, `validate cost` and `duration`, but some more factors such as `memory usage`, `gpu_memory_usage` and so on should add in the test.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 about tracking the memory usage.


1. the incomplete coverage test cases.
2. poor performance update, currently, we have no performance tracker for each operator.
3. API changes, developers are likely to forget to update usages in some other repositories such as paddle/models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is another benefit:
We can point user to our benchmark models as "currently working examples"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, some classical models will be tested every time.

Some more popular models might be tested every several days, and the ModelCI will generate a report about the precision and performance of the PaddlePaddle platform.

valid_duration_factor = ValidDurationFactor()
train_memory_factor = TrainMemoryFactor()

tracking_factors = [ train_duration, valid_duration_factor, train_meta_factor ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train_duration -> train_duration_factor

```python
# configuration of some model
import paddle
import meta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a name better than 'meta'? like continuous_evaluation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea.

- sending email to `paddle-dev@baidu.com` including
- error type
- the details of the tracked abnormal factors
- update the error information and push to git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can have more information, such as "current failed commit", "last success commit" and some history to show if it's flaky.

## Persistence of log

The log of each execution should be stored somewhere,
the simplest way is to use Git to maintain a versionable history.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it going to append or overwrite? I would prefer append, but I'm a bit worried about the git repo size.

{success_or_not} {short summary of the cause}

execution duration: {overall_duration}
paddle code version: {commitid}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a proto text format? In the future, we might want a dashboard to read the log and visualize it.

Copy link
Contributor Author

@Superjomn Superjomn Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, proto can store a more complex data structure, but a plain text file with a specific file name seems enough to store the values of most KPI.

For example, to store the data for train cost, a file called ./train.cost.txt will be created with data as follows

1.2
0.99
0.4

each line a float number, and not hard to load and visualize.

I wonder whether there is some KPI that has a complex data structure so that a plain text is hard to store the data.

Another question, I am confused about how to persist the history data, a Github repo is good for version and a plain text is human-friendly for analysis, but it can not scale up.

A database is good for scale-up, but works like a black box and make the need for the ModelCI to support some analysis accessing data from the black box.
@panyx0718

Copy link
Contributor

@panyx0718 panyx0718 Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can first save it to local disk. In the future, we might apply for some storage resource in the cloud.

To make the testing logic stable, the testable model should assure that

- fix the random seed to make result reproducible
- just run 10 or 20 batches, and each batch takes no more than 30 seconds to limit the overall execution time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect 30seconds is not enough. Loss could be spiky at the beginning. I would try 10~30min and get the loss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need eight models to cover the four application fields(image, audio, NLP, and CTR). Each model has at least two computation modes such as CPU and GPU, so the time consumption for a full test is from 3h(10min/test) to 9h on one free machine.

I wonder how long is the ideal duration for a full test.
@panyx0718

@luotao1
Copy link
Contributor

luotao1 commented Mar 14, 2018

For model ci, should we enforce to squash and merge the PR instead of Merge pull request, which would reduce the number of ci commits?

Copy link
Contributor

@abhinavarora abhinavarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design looks great overall. There are some language changes that I have suggested. Please correct them before merging.

@@ -0,0 +1,149 @@
# Model CI

The users occasionally found a negligible performance or precision issue between different Paddle versions. Though we have unit tests for each class and Travis-CI to ensures the precision of each operator, there is no any logic to ensure the model (a composition of several operators) works as reliable as the operators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found -> find

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring out the grammar issues.

A new repo is opened for this project. I've fixed the grammar issue in PaddlePaddle/continuous_evaluation#1

@@ -0,0 +1,149 @@
# Model CI

The users occasionally found a negligible performance or precision issue between different Paddle versions. Though we have unit tests for each class and Travis-CI to ensures the precision of each operator, there is no any logic to ensure the model (a composition of several operators) works as reliable as the operators.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop the any in there is no any logic


There are several conditions where an existing model will fail either in performance or precision:

1. the incomplete coverage test cases, such as lacking the test of precision.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be Incomplete coverage of test cases. For example, tests lacking precision check.


1. the incomplete coverage test cases, such as lacking the test of precision.
2. poor performance update, currently, we have no performance tracker for each operator.
3. API changes, developers are likely to forget to update usages in some other repositories such as paddle/models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence will be Developers generally forget to update API use in other repositories such as paddle/models.

# write to file self.out_file
```

More factors can be integrated into the test framework, for example, a factor tracker which test the training duration can be added in the following way
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracker which test the -> tracker which tests the


## Keep updating the baseline
The ModelCI will keep comparing the KPIs of the latest code with the last successful evaluated version,
if the current version has the KPIs better than baseline, update the baseline, otherwise ring an alarm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has the KPIs better than -> has better KPIs than baseline


The models should be placed in `./models` directory, each has a sub-directory, and a `train.xsh` script to define how to run this model. After triggering the `train.xsh`, all the data of `tracking_factors` should be created.

For example, a normal model might have following logic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have the following logic

run_train_gpu
```

To make the testing logic stable, the testable model should assure that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assure -> ensure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants