Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] Increase the availability of CI in katib #522

Closed
gaocegege opened this issue May 15, 2019 · 10 comments
Closed

[discussion] Increase the availability of CI in katib #522

gaocegege opened this issue May 15, 2019 · 10 comments

Comments

@gaocegege
Copy link
Member

/kind discussion

Describe the solution you'd like
[A clear and concise description of what you want to happen.]

Now CI hurts the development of katib, we need to find a way to solve the problem.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

/cc @richardsliu @johnugeorge @hougangliu

@gaocegege
Copy link
Member Author

Some suggestion:

  • btwn , do we need to build images in parallel ?. i don't think, it will make a huge difference in build time. we can fix this later. From @johnugeorge
  • Should we build the code instead of building the image in the CI? From @gaocegege

@richardsliu
Copy link
Contributor

List of issues we have seen in the past couple of days:

  • RESOURCE_EXHAUSTED: Quota exceeded for quota metric 'cloudbuild.googleapis.com/get_requests' and limit 'GetRequestsPerMinutePerUser' of service '
    cloudbuild.googleapis.com' for consumer 'project_number:593963025935'.

This one seems to be resolved for the moment after requesting quota increase.

  • Exception occurred: HTTPSConnectionPool(host='35.196.213.148', port=443): Max retries exceeded with url: /apis/argoproj.io/v1alpha1/namespaces/kubeflow-test-infra/workflows/kubeflow-katib-presubmit-e2e-v1alpha1-519-db486d7-0611-b811 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7ff439d6f950>: Failed to establish a new connection: [Errno 111] Connection refused',))

Transient error?

Transient error?

@hougangliu
Copy link
Member

maybe we can make katib build-.sh wiser, only build corresponding new images when related files changed. saying if only pkg/db/ changed, only katib-manager image should be built

@gaocegege
Copy link
Member Author

gaocegege commented May 15, 2019

We do not need GPU cluster, thus we can release the constraints.

--accelerator type=nvidia-tesla-k80,count=1 \

@gaocegege
Copy link
Member Author

After #525 , The building time is reduced to 10-20mins.

@andreyvelich
Copy link
Member

What if we move to go modules, in that case we don't need vendor folder and dep ensure?

@gaocegege
Copy link
Member Author

gaocegege commented May 16, 2019

Go module also needs to download packages from the remote. I prefer storing vendor whether we use dep or not.

@gaocegege
Copy link
Member Author

There is one thing we can do: Jeremy suggested batching the request to gcloud container builds.

Now we send single-image building requests one by one. Jeremy said we can do it in one request.

@gaocegege
Copy link
Member Author

Find a note in https://cloud.google.com/sdk/gcloud/reference/builds/submit:

You can also run a build locally using the separate component: gcloud components install cloud-build-local.

These variants are also available:

$ gcloud alpha builds submit
$ gcloud beta builds submit

@gaocegege
Copy link
Member Author

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants