-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E for TensorFlow Integration #381
Conversation
Hey @thandayuthapani, TravisCI finished with status TravisBuddy Request Identifier: cab9b0e0-ad65-11e9-aa77-bf7f12915990 |
Hey @thandayuthapani, TravisCI finished with status TravisBuddy Request Identifier: 41c164d0-ad6b-11e9-aa77-bf7f12915990 |
@thandayuthapani Download every images and run different samples for every application running on volcano would slow down our developing process. Maybe we should setup up a cronjob for this kind of e2e tests? |
test/e2e/util.go
Outdated
@@ -645,7 +646,7 @@ func waitJobStateAborted(ctx *context, job *vkv1.Job) error { | |||
|
|||
func waitJobPhaseExpect(ctx *context, job *vkv1.Job, state vkv1.JobPhase) error { | |||
var additionalError error | |||
err := wait.Poll(100*time.Millisecond, oneMinute, func() (bool, error) { | |||
err := wait.Poll(100*time.Millisecond, twoMinute, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may increase total test time, I would like to add a parameter to waitJobPhaseExpect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
I think this is good, we should have a tf job case. Cronjob does not fit very much, it can not prevent a pr from breaking cases like tf job. For the speed issue, the docker images should be reusable. |
@TommyLike To my Knowledge downloading Images will not affect test time much, since they provide high speed internet, which is around 35MB/s. So downloading images might add 10 more secs in E2E |
Hey @thandayuthapani, TravisCI finished with status TravisBuddy Request Identifier: c4fb3320-addf-11e9-aa77-bf7f12915990 |
@thandayuthapani We need to consider how much time it takes to download the image and how much time it takes to load the images into kind and to complete the e2e tests.
|
1.25GB Image which has been posted is old one, Have updated that image with lighter one.
And time to load docker-image in my local setup is
For E2E test it will take around 3 mins to complete but that is the simplest tensorflow training example I could find. |
@thandayuthapani thanks for your update, and that's why I am thinking if we need considering move this into another job |
@TommyLike Have reduced number of train steps, now the test runs in CI for 78 Secs, if that is okay we can maintain that, else we can reduce steps further also if necessary. |
/lgtm |
@thandayuthapani One point, the image used should be in volcanosh instead of your own repo. |
Sure will move that to volcanosh repo |
Have updated the same. |
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; | ||
export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; | ||
python /var/tf_dist_mnist/dist_mnist.py | ||
image: thanda/tf-operator-example:1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
volcanosh/dist-mnist-tf-example:0.0.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`; | ||
export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"}; | ||
python /var/tf_dist_mnist/dist_mnist.py | ||
image: thanda/tf-operator-example:1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
15aa674
to
f4e98ba
Compare
Hey @thandayuthapani, TravisCI finished with status TravisBuddy Request Identifier: a05e1c20-aed5-11e9-a97b-d704c485e7e4 |
@k82cn Can you please retrigger this build. |
/lgtm |
Hey @thandayuthapani, TravisCI finished with status TravisBuddy Request Identifier: 29ceca70-afaa-11e9-83c1-65c58ed3b4ac |
Hey @thandayuthapani, TravisCI finished with status TravisBuddy Request Identifier: 04909210-b04b-11e9-83c1-65c58ed3b4ac |
@k82cn @TommyLike Please have a look |
/lgtm |
/approve |
/assign @kevin-wangzefeng |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: k82cn, thandayuthapani, TommyLike The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR involves E2E for TensorFlow Integration with volcano