Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[errors] Add distinct error codes for docker not running #4914

Merged

Conversation

tejal29
Copy link
Member

@tejal29 tejal29 commented Oct 14, 2020

Relates to #4692

This PR, tries to reduce BUILD_UNKNOWN for docker builder type when docker is not running.

Description of changes:

  1. Add a regular expression to recognize error thrown by docker when docker socket on host is not found.
  2. The way BuildState is represent right now, there is just one or most recent error code stored in handler.state.BuildState.StatusCode. From the event log, you can see the actual build error happens in the first build. The second build is cancelled as first build is failed. The DevLoopEventFailed uses this build state to propagate the error.
    Ignoring cancellation errors, also make sure the build state has the error code with last legit build failure.

Before this PR, on master

  1. Stop your docker daemon.
  2. Try skaffold dev
    With minikube deploy context skaffold fails when creating runner.
➜  microservices git:(get_docker_json_output) ✗ skaffold dev             
creating runner: creating builder: getting docker client: getting minikube env: running [/Users/tejaldesai/Downloads/google-cloud-sdk2/bin/minikube docker-env --shell none -p minikube]
 - stdout: "\n\n"
 - stderr: "! Executing \"docker container inspect minikube --format={{.State.Status}}\" took an unusually long time: 7.36540945s\n* Restarting the docker service may improve performance.\nX Exiting due to GUEST_STATUS: state: unknown state \"minikube\": docker container inspect minikube --format=: exit status 1\nstdout:\n\n\nstderr:\nCannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\n\n* \n* If the above advice does not help, please let us know: \n  - https://github.com/kubernetes/minikube/issues/new/choose\n"
 - cause: exit status 80

with GKE deploy context

➜  microservices git:(get_docker_json_output) ✗ skaffold dev     
Listing files to watch...
 - leeroy-web
 - leeroy-app
Generating tags...
 - leeroy-web -> leeroy-web:v1.15.0-16-g8992c54bd
 - leeroy-app -> leeroy-app:v1.15.0-16-g8992c54bd
Checking cache...
 - leeroy-web: Not found. Building
 - leeroy-app: Not found. Building
Building [leeroy-web]...
exiting dev mode because first build failed: couldn't build "leeroy-web": docker build: Cannot connect to the Docker daemon at tcp://127.0.0.1:32770. Is the docker daemon running?
➜  microservices git:(get_docker_json_output) ✗ 

With this PR
Minikube Deploy context

➜  microservices git:(get_docker_json_output) ✗ ../../out/skaffold dev             
Build Failed. Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Please check if docker is running.
➜  microservices git:(get_docker_json_output) ✗ 

GKE deploy context

➜  microservices git:(get_docker_json_output) ../../out/skaffold dev 
Listing files to watch...
 - leeroy-web
 - leeroy-app
Generating tags...
 - leeroy-web -> leeroy-web:v1.15.0-16-g8992c54bd
 - leeroy-app -> leeroy-app:v1.15.0-16-g8992c54bd
Checking cache...
 - leeroy-web: Not found. Building
 - leeroy-app: Not found. Building
Building [leeroy-web]...

Build Failed. Cannot connect to the Docker daemon at tcp://127.0.0.1:32770. Please check if docker is running.
➜  microservices git:(get_docker_json_output) 

Event API testing:

curl localhost:50052/v1/events              
{"result":{"timestamp":"2020-10-14T22:04:00.472529Z","event":{"metaEvent":{"entry":"Starting Skaffold: 
.....

{"result":{"timestamp":"2020-10-14T22:04:04.087004Z",
    "event":{"buildEvent":{"artifact":"leeroy-web",
                 "status":"In Progress"}},"entry":"Build started for artifact leeroy-web"}}


{"result":{"timestamp":"2020-10-14T22:04:08.037229Z",
    "event":{"buildEvent":{"artifact":"leeroy-web"," 
         status":"Failed",
        "err":"docker build: Cannot connect to the Docker daemon at tcp://127.0.0.1:32770. Is the docker daemon running?",
         "errCode":"BUILD_DOCKER_DAEMON_NOT_RUNNING",
             "actionableErr":{"errCode":"BUILD_DOCKER_DAEMON_NOT_RUNNING","message":"docker build: Cannot connect to the Docker daemon at tcp://127.0.0.1:32770. Is the docker daemon running?",
                "suggestions":[{"suggestionCode":"CHECK_DOCKER_RUNNING","action":"Please check if docker is running"}]}}},"entry":"Build failed for artifact leeroy-web"}}

{"result":{"timestamp":"2020-10-14T22:04:08.037291Z",
       "event":{"buildEvent":{"artifact":"leeroy-app",
         "status":"In Progress"}},"entry":"Build started for artifact leeroy-app"}}

{"result":{"timestamp":"2020-10-14T22:04:08.037789Z",
    "event":{"buildEvent":{"artifact":"leeroy-app",
      "status":"Failed",
      "err":"docker build: error during connect: Post \"https://127.0.0.1:32770/v1.24/build?buildargs=%7B%7D\u0026cachefrom=null\u0026cgroupparent=\u0026cpuperiod=0\u0026cpuquota=0\u0026cpusetcpus=\u0026cpusetmems=\u0026cpushares=0\u0026dockerfile=Dockerfile\u0026labels=null\u0026memory=0\u0026memswap=0\u0026networkmode=\u0026rm=0\u0026shmsize=0\u0026t=leeroy-app%3Av1.15.0-16-g8992c54bd\u0026target=\u0026ulimits=null\u0026version=\": context canceled","
      errCode":"BUILD_CANCELLED",
      "actionableErr":{"errCode":"BUILD_CANCELLED","message":"docker build: error during connect: Post \"https://127.0.0.1:32770/v1.24/build?buildargs=%7B%7D\u0026cachefrom=null\u0026cgroupparent=\u0026cpuperiod=0\u0026cpuquota=0\u0026cpusetcpus=\u0026cpusetmems=\u0026cpushares=0\u0026dockerfile=Dockerfile\u0026labels=null\u0026memory=0\u0026memswap=0\u0026networkmode=\u0026rm=0\u0026shmsize=0\u0026t=leeroy-app%3Av1.15.0-16-g8992c54bd\u0026target=\u0026ulimits=null\u0026version=\": context canceled"}}},"entry":"Build failed for artifact leeroy-app"}}

{"result":{"timestamp":"2020-10-14T22:04:08.042649Z",
      "event":{"devLoopEvent":{"status":"Failed",
      "err":{"errCode":"BUILD_DOCKER_DAEMON_NOT_RUNNING",
                  "message":"docker build: Cannot connect to the Docker daemon at tcp://127.0.0.1:32770. Is the docker daemon running?"}}},
             "entry":"Update failed with error code BUILD_DOCKER_DAEMON_NOT_RUNNING"}}

@tejal29 tejal29 requested a review from a team as a code owner October 14, 2020 20:48
@google-cla google-cla bot added the cla: yes label Oct 14, 2020
@codecov
Copy link

codecov bot commented Oct 14, 2020

Codecov Report

Merging #4914 into master will increase coverage by 0.20%.
The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4914      +/-   ##
==========================================
+ Coverage   71.96%   72.17%   +0.20%     
==========================================
  Files         357      358       +1     
  Lines       12327    12397      +70     
==========================================
+ Hits         8871     8947      +76     
+ Misses       2799     2796       -3     
+ Partials      657      654       -3     
Impacted Files Coverage Δ
pkg/skaffold/build/cache/retrieve.go 68.60% <0.00%> (ø)
pkg/skaffold/docker/remote.go 48.97% <0.00%> (ø)
pkg/skaffold/errors/buildProblems.go 100.00% <ø> (ø)
pkg/skaffold/errors/err_map.go 96.66% <95.65%> (-3.34%) ⬇️
cmd/skaffold/app/cmd/commands.go 88.88% <100.00%> (+4.67%) ⬆️
pkg/skaffold/build/scheduler.go 94.33% <100.00%> (+0.10%) ⬆️
pkg/skaffold/docker/image.go 79.34% <100.00%> (ø)
pkg/skaffold/errors/errors.go 80.00% <100.00%> (+6.47%) ⬆️
pkg/skaffold/event/event.go 91.11% <100.00%> (+0.02%) ⬆️
pkg/skaffold/build/gcb/kaniko.go 77.77% <0.00%> (-4.58%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c6260e8...7637189. Read the comment docs.

@tejal29
Copy link
Member Author

tejal29 commented Oct 14, 2020

Testing GetState in dev loop

  1. Run skaffold dev
  2. Pause Docker
  3. Make a change in build dependencies.
➜  microservices git:(get_docker_json_output) ✗ ../../out/skaffold dev --rpc-http-port=50052 --default-repo=gcr.io/tejal-test
Listing files to watch...
 - leeroy-web
 - leeroy-app
Generating tags...
 - leeroy-web -> gcr.io/tejal-test/leeroy-web:v1.15.0-16-g8992c54bd
 - leeroy-app -> gcr.io/tejal-test/leeroy-app:v1.15.0-16-g8992c54bd-dirty
Checking cache...
 - leeroy-web: Not found. Building
 - leeroy-app: Not found. Building
Building [leeroy-web]...
..
Building [leeroy-app]...
...
Starting deploy...
 - deployment.apps/leeroy-web created
 - service/leeroy-app created
 - deployment.apps/leeroy-app created
Deployments stabilized in 5.922908637s
Press Ctrl+C to exit

Watching for changes...
[leeroy-app] 2020/10/14 22:27:24 leeroy app server ready
[leeroy-web] 2020/10/14 22:27:25 leeroy web server ready
Generating tags...
 - leeroy-app -> gcr.io/tejal-test/leeroy-app:v1.15.0-16-g8992c54bd-dirty
Checking cache...
 - leeroy-app: Not found. Building
Building [leeroy-app]...

WARN[0059] Skipping deploy due to error: docker build: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? 
Watching for changes...
  1. Hit the get state API curl localhost:50052/v1/state
➜  github-dashing git:(docker-support) ✗ curl localhost:50052/v1/state               
{"buildState":{"artifacts":{"leeroy-app":"Failed","leeroy-web":"Not Started"},"autoTrigger":true,"statusCode":"BUILD_DOCKER_DAEMON_NOT_RUNNING"},"deployState":{"status":"Not Started","autoTrigger":true},"statusCheckState":{"status":"Not Started"},"fileSyncState":{"status":"Not Started","autoTrigger":true},"metadata":{"build":{"numberOfArtifacts":2,"builders":[{"type":"DOCKER","count":2}],"type":"LOCAL"},"deploy":{"deployers":[{"type":"KUBECTL","count":1}],"cluster":"GKE"}}}%                                                         ➜  github-dashing git:(docker-support) ✗ 
  1. Also check the event stream

{"result":{"timestamp":"2020-10-14T22:27:56.538037Z","event":{"buildEvent":{"artifact":"leeroy-app","status":"Failed","err":"docker build: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","errCode":"BUILD_DOCKER_DAEMON_NOT_RUNNING","actionableErr":{"errCode":"BUILD_DOCKER_DAEMON_NOT_RUNNING","message":"docker build: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?","suggestions":[{"suggestionCode":"CHECK_DOCKER_RUNNING","action":"Please check if docker is running"}]}}},"entry":"Build failed for artifact leeroy-app"}}

{"result":{"timestamp":"2020-10-14T22:27:56.538225Z",
   "event":{"devLoopEvent":{"status":"Failed",
    "err":{"errCode":"BUILD_DOCKER_DAEMON_NOT_RUNNING",
    "message":"docker build: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"}}},"entry":"Update failed with error code BUILD_DOCKER_DAEMON_NOT_RUNNING"}}

Comment on lines 33 to +34
regexp *regexp.Regexp
description string
description func(error) string
Copy link
Collaborator

@gsquared94 gsquared94 Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we're doing the same thing in two different ways: we match regexp against error to filter down to the problem but now we match error again to filter down to a description.

We can leave this as is and rather split DockerConnectionFailedPrefix into DockerConnectionFailed and DockerConnectionFailedWithUnknownHost for the case with known and unknown host address respectively.
That'll keep problem struct at lowest granularity and easily referenceable.

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above approach still won't solve the problem. The reason why i added the description as a function was to extract the "address" where docker daemon should be running from the error message.
e.g. if the error is

exiting dev mode because first build failed: docker build: Cannot connect to the Docker daemon at tcp://127.0.0.1:32770. Is the docker daemon running?

I want to extract the "tcp://127.0.0.1:32770." and trim the error message as

Build failed. Cannot connect to the Docker daemon at tcp://127.0.0.1:32770. Is the docker daemon running?

When deploy context is not minikube, we could just return the error we receive as is.

When deploy context is minikube, the error we receive is

creating runner: creating builder: getting docker client: getting minikube env: running [/Users/tejaldesai/Downloads/google-cloud-sdk2/bin/minikube docker-env --shell none -p minikube]
 - stdout: "\n\n"
 - stderr: "! Executing \"docker container inspect minikube --format={{.State.Status}}\" took an unusually long time: 7.36540945s\n* Restarting the docker service may improve performance.\nX Exiting due to GUEST_STATUS: state: unknown state \"minikube\": docker container inspect minikube --format=: exit status 1\nstdout:\n\n\nstderr:\nCannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\n\n* \n* If the above advice does not help, please let us know: \n  - https://github.com/kubernetes/minikube/issues/new/choose\n"
 - cause: exit status 80

which needs to be translated to

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Another way to do this, if we use parse the "stdout" when we get this error from minikube and raise the error as "Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
maybe this is long term approach we want.

Or, else we use a static string for both cases.

Cannot connect to the Docker daemon. Is the docker daemon running?

I hope it's clear.

In general, i feel we need a function to massage/trim the error we received to get rid of all the text added due to errors.Wrap.

Or, we could add another field to struct problem

type problem struct {
	regexp      *regexp.Regexp
	description func(error) string
	errCode     proto.StatusCode
	suggestion  func(opts config.SkaffoldOptions) []*proto.Suggestion
        userMessage  func(error) string
}

proto/skaffold.proto Outdated Show resolved Hide resolved
@@ -310,7 +310,7 @@ func BuildCanceled(imageName string) {
func BuildFailed(imageName string, err error) {
aiErr := sErrors.ActionableErr(sErrors.Build, err)
handler.stateLock.Lock()
handler.state.BuildState.StatusCode = aiErr.ErrCode
handler.state.BuildState.StatusCode = ignoreCancelled(handler.state.BuildState.StatusCode, aiErr.ErrCode)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoreCancelled(handler.state.BuildState.StatusCode, aiErr.ErrCode)

This makes me think that we're updating the BuildState.StatusCode in the wrong place since it seems to denote the status of the build sequence of several artifacts as a whole whereas BuildFailed is called for every artifact build failure.

Can we split out:

// BuildFailed notifies that a build has failed.
func BuildFailed(imageName string, err error) {
	aiErr := sErrors.ActionableErr(sErrors.Build, err)
	handler.handleBuildEvent(&proto.BuildEvent{
		Artifact:      imageName,
		Status:        Failed,
		Err:           err.Error(),
		ErrCode:       aiErr.ErrCode,
		ActionableErr: aiErr})
}

// BuildSequenceFailed notifies that the build sequence has failed.
func BuildSequenceFailed(err error) {
	aiErr := sErrors.ActionableErr(sErrors.Build, err)
	handler.stateLock.Lock()
	handler.state.BuildState.StatusCode = aiErr.ErrCode
	handler.stateLock.Unlock()
}

and call BuildSequenceFailed in scheduler.go line 70

if err := g.Wait(); err != nil {
	event.BuildSequenceFailed(err)
	return nil, err
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. i think with the error group approach it would work.

Copy link
Collaborator

@gsquared94 gsquared94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL at the comments. Looks good otherwise.

@tejal29
Copy link
Member Author

tejal29 commented Oct 15, 2020

Testing Notes:

As per @gsquared94 suggestion, moving updating build status code in scheduler.go, verified that devloop end event has the error code BUILD_PUSH_ACCESS_DENIED


{"result":{"timestamp":"2020-10-15T18:23:24.696448Z","event":{"metaEvent":{"entry":"Starting Skaffold: \u0026{Version:v1.15.0-17-ga57104a62-dirty ConfigVersion:skaffold/v2beta9 GitVersion: GitCommit:a57104a625bd9bbe918546cb52c194a0ec423814 GitTreeState:dirty BuildDate:2020-10-15T10:25:16Z GoVersion:go1.14.3 Compiler:gc Platform:darwin/amd64}","metadata":{"build":{"numberOfArtifacts":2,"builders":[{"type":"DOCKER","count":2}],"type":"LOCAL"},"deploy":{"deployers":[{"type":"KUBECTL","count":1}],"cluster":"OTHER"}}}}}}
{"result":{"timestamp":"2020-10-15T18:23:24.696477Z","event":{"devLoopEvent":{"status":"In Progress"}},"entry":"Update initiated due to file change"}}
{"result":{"timestamp":"2020-10-15T18:23:27.273327Z","
   event":{"buildEvent":{"artifact":"leeroy-web","status":"In Progress"}},"entry":"Build started for artifact leeroy-web"}}

{"result":{"timestamp":"2020-10-15T18:23:55.946762Z","event":{
    "buildEvent":{"artifact":"leeroy-web","status":"Failed","err":"could not push image \"leeroy-web:v1.15.0-17-ga57104a62\": denied: requested access to the resource is denied",
   "errCode":"BUILD_PUSH_ACCESS_DENIED","actionableErr":{"errCode":"BUILD_PUSH_ACCESS_DENIED","message":"could not push image \"leeroy-web:v1.15.0-17-ga57104a62\": denied: requested access to the resource is denied","suggestions":[{"suggestionCode":"ADD_DEFAULT_REPO","action":"Trying running with `--default-repo` flag"}]}}},"entry":"Build failed for artifact leeroy-web"}}
{"result":{"timestamp":"2020-10-15T18:23:55.946806Z","event":{"buildEvent":{"artifact":"leeroy-app","status":"In Progress"}},"entry":"Build started for artifact leeroy-app"}}

{"result":{"timestamp":"2020-10-15T18:23:55.949108Z",
   "event":{"buildEvent":{"artifact":"leeroy-app","status":"Failed","err":"docker build: error during connect: Post \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/build?buildargs=%7B%7D\u0026cachefrom=null\u0026cgroupparent=\u0026cpuperiod=0\u0026cpuquota=0\u0026cpusetcpus=\u0026cpusetmems=\u0026cpushares=0\u0026dockerfile=Dockerfile\u0026labels=null\u0026memory=0\u0026memswap=0\u0026networkmode=\u0026rm=0\u0026shmsize=0\u0026t=leeroy-app%3Av1.15.0-17-ga57104a62\u0026target=\u0026ulimits=null\u0026version=\": context canceled",
   "errCode":"BUILD_CANCELLED","actionableErr":{"errCode":"BUILD_CANCELLED","message":"docker build: error during connect: Post \"http://%2Fvar%2Frun%2Fdocker.sock/v1.40/build?buildargs=%7B%7D\u0026cachefrom=null\u0026cgroupparent=\u0026cpuperiod=0\u0026cpuquota=0\u0026cpusetcpus=\u0026cpusetmems=\u0026cpushares=0\u0026dockerfile=Dockerfile\u0026labels=null\u0026memory=0\u0026memswap=0\u0026networkmode=\u0026rm=0\u0026shmsize=0\u0026t=leeroy-app%3Av1.15.0-17-ga57104a62\u0026target=\u0026ulimits=null\u0026version=\": context canceled"}}},"entry":"Build failed for artifact leeroy-app"}}

{"result":{"timestamp":"2020-10-15T18:23:55.951643Z","event":{
   "devLoopEvent":{"status":"Failed",
   "err":{"errCode":
     "BUILD_PUSH_ACCESS_DENIED","message":"could not push image \"leeroy-web:v1.15.0-17-ga57104a62\": denied: requested access to the resource is denied"}}},"entry":"Update failed with error code BUILD_PUSH_ACCESS_DENIED"}}
curl: (18) transfer closed with outstanding read data remaining

@tejal29
Copy link
Member Author

tejal29 commented Oct 15, 2020

TODO:
Add test to make sure the error code is for the first failed artifact

@tejal29
Copy link
Member Author

tejal29 commented Oct 15, 2020

PTAL at the comments. Looks good otherwise.

Thanks a lot @gsquared94 I addressed all your comment except for the problem.description. Let me know what you think!

Copy link
Collaborator

@gsquared94 gsquared94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@gsquared94 gsquared94 merged commit 5690b56 into GoogleContainerTools:master Oct 16, 2020
@gsquared94
Copy link
Collaborator

TODO:
Add test to make sure the error code is for the first failed artifact

I didn't see this TODO and merged on all green. Can you please add the tests in another PR?
Thanks!

@tejal29
Copy link
Member Author

tejal29 commented Oct 16, 2020

will do. :)

@tejal29
Copy link
Member Author

tejal29 commented Oct 16, 2020

TODO:
Add test to make sure the error code is for the first failed artifact

I didn't see this TODO and merged on all green. Can you please add the tests in another PR?
Thanks!

will do. I am thinking if it should be an integration test or just a unit test. Let me know if you have any thoughts or if you think you have already covered this case in scheduler_test.go

@gsquared94
Copy link
Collaborator

will do. I am thinking if it should be an integration test or just a unit test. Let me know if you have any thoughts or if you think you have already covered this case in scheduler_test.go

I only said it because you had a comment above, otherwise I don't have any thoughts about it 🙂.
I think it's covered in the last case in TestInOrderForArgs in scheduler_test.go but you can add a more explicit test if you want to.

@tejal29 tejal29 deleted the get_docker_json_output branch April 15, 2021 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants