-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design/model ci #9018
design/model ci #9018
Conversation
contrib/modelci/README.md
Outdated
|
||
## Make Factor Tracking Extensible | ||
|
||
There are some general factors including `train cost`, `validate cost` and `duration`, but some more factors such as `memory usage`, `gpu_memory_usage` and so on should add in the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 about tracking the memory usage.
contrib/modelci/README.md
Outdated
|
||
1. the incomplete coverage test cases. | ||
2. poor performance update, currently, we have no performance tracker for each operator. | ||
3. API changes, developers are likely to forget to update usages in some other repositories such as paddle/models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is another benefit:
We can point user to our benchmark models as "currently working examples"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, some classical models will be tested every time.
Some more popular models might be tested every several days, and the ModelCI will generate a report about the precision and performance of the PaddlePaddle platform.
contrib/modelci/README.md
Outdated
valid_duration_factor = ValidDurationFactor() | ||
train_memory_factor = TrainMemoryFactor() | ||
|
||
tracking_factors = [ train_duration, valid_duration_factor, train_meta_factor ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
train_duration -> train_duration_factor
```python | ||
# configuration of some model | ||
import paddle | ||
import meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a name better than 'meta'? like continuous_evaluation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea.
- sending email to `paddle-dev@baidu.com` including | ||
- error type | ||
- the details of the tracked abnormal factors | ||
- update the error information and push to git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can have more information, such as "current failed commit", "last success commit" and some history to show if it's flaky.
## Persistence of log | ||
|
||
The log of each execution should be stored somewhere, | ||
the simplest way is to use Git to maintain a versionable history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it going to append or overwrite? I would prefer append, but I'm a bit worried about the git repo size.
{success_or_not} {short summary of the cause} | ||
|
||
execution duration: {overall_duration} | ||
paddle code version: {commitid} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using a proto text format? In the future, we might want a dashboard to read the log and visualize it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, proto can store a more complex data structure, but a plain text file with a specific file name seems enough to store the values of most KPI.
For example, to store the data for train cost
, a file called ./train.cost.txt
will be created with data as follows
1.2
0.99
0.4
each line a float number, and not hard to load and visualize.
I wonder whether there is some KPI that has a complex data structure so that a plain text is hard to store the data.
Another question, I am confused about how to persist the history data, a Github repo is good for version and a plain text is human-friendly for analysis, but it can not scale up.
A database is good for scale-up, but works like a black box and make the need for the ModelCI to support some analysis accessing data from the black box.
@panyx0718
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can first save it to local disk. In the future, we might apply for some storage resource in the cloud.
contrib/modelci/README.md
Outdated
To make the testing logic stable, the testable model should assure that | ||
|
||
- fix the random seed to make result reproducible | ||
- just run 10 or 20 batches, and each batch takes no more than 30 seconds to limit the overall execution time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect 30seconds is not enough. Loss could be spiky at the beginning. I would try 10~30min and get the loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need eight models to cover the four application fields(image, audio, NLP, and CTR). Each model has at least two computation modes such as CPU and GPU, so the time consumption for a full test is from 3h(10min/test) to 9h on one free machine.
I wonder how long is the ideal duration for a full test.
@panyx0718
For model ci, should we enforce to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design looks great overall. There are some language changes that I have suggested. Please correct them before merging.
@@ -0,0 +1,149 @@ | |||
# Model CI | |||
|
|||
The users occasionally found a negligible performance or precision issue between different Paddle versions. Though we have unit tests for each class and Travis-CI to ensures the precision of each operator, there is no any logic to ensure the model (a composition of several operators) works as reliable as the operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found -> find
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for figuring out the grammar issues.
A new repo is opened for this project. I've fixed the grammar issue in PaddlePaddle/continuous_evaluation#1
@@ -0,0 +1,149 @@ | |||
# Model CI | |||
|
|||
The users occasionally found a negligible performance or precision issue between different Paddle versions. Though we have unit tests for each class and Travis-CI to ensures the precision of each operator, there is no any logic to ensure the model (a composition of several operators) works as reliable as the operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop the any in there is no any logic
|
||
There are several conditions where an existing model will fail either in performance or precision: | ||
|
||
1. the incomplete coverage test cases, such as lacking the test of precision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be Incomplete coverage of test cases. For example, tests lacking precision check.
|
||
1. the incomplete coverage test cases, such as lacking the test of precision. | ||
2. poor performance update, currently, we have no performance tracker for each operator. | ||
3. API changes, developers are likely to forget to update usages in some other repositories such as paddle/models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence will be Developers generally forget to update API use in other repositories such as paddle/models.
# write to file self.out_file | ||
``` | ||
|
||
More factors can be integrated into the test framework, for example, a factor tracker which test the training duration can be added in the following way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tracker which test the -> tracker which tests the
|
||
## Keep updating the baseline | ||
The ModelCI will keep comparing the KPIs of the latest code with the last successful evaluated version, | ||
if the current version has the KPIs better than baseline, update the baseline, otherwise ring an alarm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has the KPIs better than -> has better KPIs than baseline
|
||
The models should be placed in `./models` directory, each has a sub-directory, and a `train.xsh` script to define how to run this model. After triggering the `train.xsh`, all the data of `tracking_factors` should be created. | ||
|
||
For example, a normal model might have following logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have the following logic
run_train_gpu | ||
``` | ||
|
||
To make the testing logic stable, the testable model should assure that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assure -> ensure
No description provided.