-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add logs and fix a bug #5074
add logs and fix a bug #5074
Conversation
go/master/client.go
Outdated
if err.Error() == ErrPassAfter.Error() { | ||
// to prevent too many logs | ||
if i%60 == 0 { | ||
log.Debug(fmt.Sprintf("getTask passID:%d error.", passID), log.Ctx{"error": err}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Debugf
or just log.Debug(a, b)
will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
go/master/client.go
Outdated
if i%60 == 0 { | ||
log.Debug("getTask of passID error.", | ||
log.Ctx{"error": err, "passID": passID}) | ||
i = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why here is not i = 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确实。Done。
go/master/client.go
Outdated
@@ -132,21 +132,19 @@ func (c *Client) getRecords(passID int) { | |||
break | |||
} | |||
|
|||
i := 0 | |||
if err.Error() == ErrPassAfter.Error() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line is still needed, or else the other unknown error may also cause sleep and wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// if err.Error() == ErrPassAfter.Error()
// wait util last pass finishes
// if other error such as network error
// wait to reconnect or task time out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ErrPassAfter
我理解应该是明确要等待。如果是其他不可恢复的错误,这里应该记录日志并且panic之类的,不然可能导致job hang住?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以从几个方面来考虑:
- 从接口来说,
pass_num
是一个全局量,如果出现错误就退出的话,client
会去尝试下一个pass_num
. https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/reader/creator.py#L113 - 从流程上来说,
task
应该得到这个pass_num
的数据,如果出错,而错误不是ErrPassBefore
ErrNoMoreAvailable
ErrAllTaskFailed
,他应该继续等待。如果不hang住,客户端的动作应该是?我觉得还是重试。
time.Sleep(time.Second * 3) | ||
continue | ||
|
||
if i%60 == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not use i
and the magic value 60
, just log error every time it happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那样日志会很多。满屏都是一个日志。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原来如此,好的。
go/master/client_test.go
Outdated
@@ -117,6 +117,9 @@ func TestNextRecord(t *testing.T) { | |||
if e != nil { | |||
panic(e) | |||
} | |||
|
|||
c.StartGetRecords(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
只是为了做什么的,能否在文件里comment说明一下?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个当时是想通过调用c.StartGetRecords(100)
来观察日志打印了,而且没有打印太多
,但现在发现这个放到unittest中也不能捕获不符合要求的错误,只能观察,所以还是去掉吧。
@@ -107,7 +107,12 @@ def cloud_reader(paths, etcd_endpoints, timeout_sec=5, buf_size=64): | |||
import cPickle as pickle | |||
import paddle.v2.master as master | |||
c = master.client(etcd_endpoints, timeout_sec, buf_size) | |||
c.set_dataset(paths) | |||
|
|||
if isinstance(paths, basestring): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^_^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -34,5 +34,6 @@ def test_check_grad(self): | |||
self.check_grad(['X', 'C_prev'], ['C', 'H']) | |||
|
|||
|
|||
if __name__ == "__main__": | |||
unittest.main() | |||
# TODO(gongwb):fix CI error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change should not exist in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fix https://github.com/PaddlePaddle/cloud/wiki/Demos'-common-issues#whole-job-is-waiting-but-doesnt-print-anything
Fix #5075
Fix #5078