Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fabio running using Nomad system scheduler breaks Docker. #192

Closed
michaelmcguinness opened this issue Nov 24, 2016 · 9 comments
Closed

Fabio running using Nomad system scheduler breaks Docker. #192

michaelmcguinness opened this issue Nov 24, 2016 · 9 comments

Comments

@michaelmcguinness
Copy link

I realise how unlikely the title to this issue seems but if there is an obvious error in my set up I can't spot it. I want to run Fabio as a Nomad managed service using the Nomad system scheduler (type = "system"). When I do then any subsequent pulls from our private Docker registry fails with the error
failed to register layer: open /dev/mapper/docker-202:32-786433-35e363b33db58a87d6a55b19f3297715b9978052e70edec86f03b51af3e44455: no such file or directory
From that point on I am not able to recover Docker.

Some details about our set up:
Ubuntu 14-04
Kernel = 3.13.0-53-generic
Docker = 1.12.2
Nomad = 0.5.0
Fabio = 1.3.4

I have a 3 x servers with 2 x clients. I am trying to run Fabio using the exec driver and the system scheduler. I am running Nomad as the root user on which I believe is required for the exec driver.

I do not see the issue if I run Fabio using the service scheduler.
I do not see the issue if I run a Docker container using the system scheduler .
I do not see the issue if I run another job (sleep binary) using the system scheduler.
I do not see the issue if I run Fabio using the system scheduler but using the raw_exec driver.

Docker is using the LVM storage option but I see the same issue if I drop back to the devicemapper storage option.

Below is a repeatable test case. After that are copies of the job specs used in the test case.

  1. Go to Nomad user
    ubuntu@ip-10-75-70-27:~$ sudo su - nomad

  2. Software versions

$ uname -a
Linux ip-10-75-70-27 3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ docker --version
Docker version 1.12.2, build bb80604
$ nomad version
Nomad v0.5.0
  1. Nomad running as root with no running jobs
$ ps -ef | grep nomad
root     17416     1  0 12:57 ?        00:00:00 /usr/local/bin/nomad agent -config /etc/nomad.d/config.json -rejoin -node=nomad_client_poc1
consul   17540     1  0 12:57 ?        00:00:00 /usr/local/bin/consul agent -config-file /etc/consul.d/config.json -rejoin -node nomad_client_poc1
root     17617  2602  0 12:58 pts/1    00:00:00 sudo su - nomad
  1. Demonstrate Docker pull
$ docker pull dockerregistry.adm.myprivatecloud.net/bti1003:latest
latest: Pulling from bti1003
5a132a7e7af1: Pull complete
fd2731e4c50c: Pull complete
28a2f68d1120: Pull complete
a3ed95caeb02: Pull complete
87f9029820c8: Pull complete
7582f6d126ab: Pull complete
Digest: sha256:6d7379af49cc17cc8a0055e06c4cb8374e5be73fe42ce2e8f1abca013c50a62a
Status: Downloaded newer image for dockerregistry.adm.myprivatecloud.net/bti1003:latest
  1. Remove pulled image
$ docker rmi dockerregistry.adm.myprivatecloud.net/bti1003:latest
Untagged: dockerregistry.adm.myprivatecloud.net/bti1003:latest
Untagged: dockerregistry.adm.myprivatecloud.net/bti1003@sha256:6d7379af49cc17cc8a0055e06c4cb8374e5be73fe42ce2e8f1abca013c50a62a
Deleted: sha256:3fee2600d434e469b6d4ac0e468bd988ebc105536886d6624dc9566577fcafbe
Deleted: sha256:e5fcd939dd4a2a9b9543dea61ca90d2def7c92cd983108916895a39f239799b8
Deleted: sha256:57bd7c9432ae86d63f2342e442eebd0f4dfc340ca61c6a4c7d702b17a315865f
Deleted: sha256:0aaccda2aadfc70ab2248437568fd17f4e8860cf612cc4b7e154b97222dccf91
Deleted: sha256:9dcfe19e941956c63860afee1bec2e2318f6fbd336bc523094ed609a9c437a01
Deleted: sha256:6ff1ee6fc8a0358aeb92f947fb7125cd9e3d68c05be45f5375cb59b98c850b4d
Deleted: sha256:56abdd66ba312859b30b5629268c30d44a6bbef6e2f0ebe923655092855106e8
  1. Run 'sleep' test job
$ ps -ef | grep nomad.*executor
root     17897 17416  6 13:00 ?        00:00:02 /usr/local/bin/nomad executor /var/nomad/alloc/87a48081-5d8e-b1b1-1538-a1e79a3f4152/sleep-task/sleep-task-executor.out
nomad    18299 17619  0 13:07 pts/1    00:00:00 grep nomad.*executor
  1. Pull Docker image
$ docker pull dockerregistry.adm.myprivatecloud.net/bti1003:latest
latest: Pulling from bti1003
5a132a7e7af1: Pull complete
fd2731e4c50c: Pull complete
28a2f68d1120: Pull complete
a3ed95caeb02: Pull complete
87f9029820c8: Pull complete
7582f6d126ab: Pull complete
Digest: sha256:6d7379af49cc17cc8a0055e06c4cb8374e5be73fe42ce2e8f1abca013c50a62a
Status: Downloaded newer image for dockerregistry.adm.myprivatecloud.net/bti1003:latest
  1. Remove pulled image
$ docker rmi dockerregistry.adm.myprivatecloud.net/bti1003:latest
Untagged: dockerregistry.adm.myprivatecloud.net/bti1003:latest
Untagged: dockerregistry.adm.myprivatecloud.net/bti1003@sha256:6d7379af49cc17cc8a0055e06c4cb8374e5be73fe42ce2e8f1abca013c50a62a
Deleted: sha256:3fee2600d434e469b6d4ac0e468bd988ebc105536886d6624dc9566577fcafbe
Deleted: sha256:e5fcd939dd4a2a9b9543dea61ca90d2def7c92cd983108916895a39f239799b8
Deleted: sha256:57bd7c9432ae86d63f2342e442eebd0f4dfc340ca61c6a4c7d702b17a315865f
Deleted: sha256:0aaccda2aadfc70ab2248437568fd17f4e8860cf612cc4b7e154b97222dccf91
Deleted: sha256:9dcfe19e941956c63860afee1bec2e2318f6fbd336bc523094ed609a9c437a01
Deleted: sha256:6ff1ee6fc8a0358aeb92f947fb7125cd9e3d68c05be45f5375cb59b98c850b4d
Deleted: sha256:56abdd66ba312859b30b5629268c30d44a6bbef6e2f0ebe923655092855106e8
  1. Stop 'sleep' job
$ ps -ef | grep nomad.*executor
nomad    18239 17619  0 13:06 pts/1    00:00:00 grep nomad.*executor
  1. Start Fabio job
$ ps -ef | grep nomad.*executor
root     18262 17416 33 13:07 ?        00:00:04 /usr/local/bin/nomad executor /var/nomad/alloc/5729f45b-185c-fa7b-7b05-866a774b8c73/fabio-task/fabio-task-executor.out
nomad    18299 17619  0 13:07 pts/1    00:00:00 grep nomad.*executor
  1. Pull docker image
$ docker pull dockerregistry.adm.myprivatecloud.net/bti1003:latest
latest: Pulling from bti1003
5a132a7e7af1: Extracting [==================================================>] 65.69 MB/65.69 MB
fd2731e4c50c: Download complete
28a2f68d1120: Download complete
a3ed95caeb02: Download complete
87f9029820c8: Download complete
7582f6d126ab: Download complete
failed to register layer: open /dev/mapper/docker-202:32-786433-35e363b33db58a87d6a55b19f3297715b9978052e70edec86f03b51af3e44455: no such file or directory
  1. Fabio job dies (10 minutes later), from syslog
Nov 23 13:17:56 ip-10-75-70-27 nomad[17416]: driver.exec: error destroying executor: 1 error(s) occurred:#012#012* 1 error(s) occurred:#012#012* failed to unmou
nt shared alloc dir "/var/nomad/alloc/5729f45b-185c-fa7b-7b05-866a774b8c73/fabio-task/alloc": invalid argument
Nov 23 13:17:57 ip-10-75-70-27 nomad[17416]: client: failed to destroy context for alloc '5729f45b-185c-fa7b-7b05-866a774b8c73': 2 error(s) occurred:#012#012* 1 error(s) occurred:#012#012* failed to remove the secret dir "/var/nomad/alloc/5729f45b-185c-fa7b-7b05-866a774b8c73/fabio-task/secrets": unmount: invalid argument#012* remove /var/nomad/alloc/5729f45b-185c-fa7b-7b05-866a774b8c73/fabio-task: directory not empty

From Docker log

time="2016-11-23T13:07:59.287783575Z" level=error msg="Error trying v2 registry: failed to register layer: open /dev/mapper/docker-202:32-786433-35e363b33db58a87d6a55b19f3297715b9978052e70edec86f03b51af3e44455: no such file or directory"
time="2016-11-23T13:07:59.287830271Z" level=error msg="Attempting next endpoint for pull after error: failed to register layer: open /dev/mapper/docker-202:32-786433-35e363b33db58a87d6a55b19f3297715b9978052e70edec86f03b51af3e44455: no such file or directory"

Fabio Job Spec

job "fabio-job" {
  region = "eu"
  datacenters = ["vpc-poc"]
  type = "system"
  update {
    stagger = "5s"
    max_parallel = 1
  }

  group "fabio-group" {
     ephemeral_disk {
      size    = "500"
    }
    task "fabio-task" {
      driver = "exec"
      config {
        command = "fabio-1.3.4-go1.7.3-linux_amd64"
      }

      artifact {
        source = "https://github.com/eBay/fabio/releases/download/v1.3.4/fabio-1.3.4-go1.7.3-linux_amd64"
     }
     logs {
	max_files = 2
	max_file_size = 5
      }
      resources {
        cpu = 500
        memory = 64
        network {
          mbits = 1

          port "http" {
            static = 9999
          }
          port "ui" {
            static = 9998
          }
        }
      }
    }
  }
}

Sleep Job Spec

job "sleep-job" {
  region = "eu"
  datacenters = ["vpc-poc"]
  type = "system"
  update {
    stagger = "5s"
    max_parallel = 1
  }

  group "sleep-group" {
     ephemeral_disk {
      size    = "500"
    }
    task "sleep-task" {
      driver = "exec"
      config {
        command = "/bin/sleep"
        args = ["1000"]
      }

     logs {
	max_files = 2
	max_file_size = 5
      }
      resources {
        cpu = 500
        memory = 64
        network {
          mbits = 1

        }
      }
    }
  }
}
@magiconair
Copy link
Contributor

magiconair commented Nov 24, 2016 via email

@magiconair
Copy link
Contributor

@michaelmcguinness Did you make any progress with this? What do the fabio logs say?

@michaelmcguinness
Copy link
Author

No. Running it as raw_exec to work around the issue. Would I be right in saying that you mean the logs exposed by

$ nomad logs -verbose -stderr <allocid>

That being the case there is nothing interesting there...

2016/12/17 15:52:58 [INFO] Version 1.3.4 starting
2016/12/17 15:52:58 [INFO] Go runtime is go1.7.3
2016/12/17 15:52:58 [INFO] Using routing strategy "rnd"
2016/12/17 15:52:58 [INFO] Using routing matching "prefix"
2016/12/17 15:52:58 [INFO] Setting GOGC=800
2016/12/17 15:52:58 [INFO] Setting GOMAXPROCS=1
2016/12/17 15:52:58 [INFO] Metrics disabled
2016/12/17 15:52:58 [INFO] consul: Connecting to "localhost:8500" in datacenter "vpc-poc"
2016/12/17 15:52:58 [INFO] Admin server listening on ":9998"
2016/12/17 15:52:58 [INFO] HTTP proxy listening on :9999
2016/12/17 15:52:58 [INFO] consul: Using dynamic routes
2016/12/17 15:52:58 [INFO] consul: Using tag prefix "urlprefix-"
2016/12/17 15:52:58 [INFO] consul: Watching KV path "/fabio/config"
2016/12/17 15:52:58 [INFO] consul: Health changed to #2443304
2016/12/17 15:52:58 [INFO] consul: Skipping service "_prometheus-node-exporter-http" since agent on node "dns_slave" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_prometheus-node-exporter-process" since agent on node "dns_slave" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_nomad-server-nomad-serf" since agent on node "nomad_server1" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_nomad-server-nomad-rpc" since agent on node "nomad_server1" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_nomad-server-nomad-serf" since agent on node "nomad_server2" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_nomad-server-nomad-rpc" since agent on node "nomad_server2" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_nomad-server-nomad-http" since agent on node "nomad_server2" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_nomad-server-nomad-serf" since agent on node "nomad_server3" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_prometheus-node-exporter-http" since agent on node "nomad_server3" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Skipping service "_prometheus-node-exporter-process" since agent on node "nomad_server3" is down: Agent not live or unreachable
2016/12/17 15:52:58 [INFO] consul: Manual config changed to #2409046
2016/12/17 15:52:58 [INFO] Updated config to
2016/12/17 15:52:58 [INFO] consul: Registered fabio with id "fabio-ip-10-75-70-74-9998"
2016/12/17 15:52:58 [INFO] consul: Registered fabio with address "10.75.70.74"
2016/12/17 15:52:58 [INFO] consul: Registered fabio with tags ""
2016/12/17 15:52:58 [INFO] consul: Registered fabio with health check to "http://[10.75.70.74]:9998/health"
2016/12/17 15:52:58 [INFO] consul: Health changed to #2443305

and then just repeated messages about Consul.

@magiconair
Copy link
Contributor

If there is nothing in the fabio logs then that supports my suspicion that this is a docker and/or nomad issue. Maybe the way fabio interacts with docker or the way it shuts down triggers this. However, since you're running this with a single listener you could try to simulate this with a simple go program that runs a web server, then a reverse proxy, and then a reverse proxy that makes long polling http requests.

Below is a simple reverse proxy for testing. Store it in ~/gopath/src/fabiotest/main.go and then build with go build in that directory. Make sure you have set export GOPATH=~/gopath. When running you can test both endpoints with curl localhost:9998 and curl localhost:9999. Except for the long-polling outgoing connection to consul this is in essence the core of fabio. :)

You can use the following Dockerfile:

FROM scratch
ADD / fabiotest
EXPOSE 9998 9999
CMD ["/fabiotest"]

See how far you get with this.

package main

import (
	"flag"
	"fmt"
	"log"
	"net/http"
	"net/http/httputil"
	"net/url"
)

func main() {
	var proxyAddr, uiAddr, proxyURL string
	flag.StringVar(&proxyAddr, "proxy", ":9999", "host:port of the proxy")
	flag.StringVar(&uiAddr, "ui", ":9998", "host:port of the ui")
	flag.StringVar(&proxyURL, "proxyURL", "https://www.google.com/", "proxy url")
	flag.Parse()

	log.Println("fabiotest starting")

	go func() {
		u, err := url.Parse(proxyURL)
		if err != nil {
			log.Fatal("proxyURL:", err)
		}
		log.Println("proxy listening on", proxyAddr, "proxying", u)
		rp := httputil.NewSingleHostReverseProxy(u)
		if err := http.ListenAndServe(proxyAddr, rp); err != nil {
			log.Fatal("proxy:", err)
		}
	}()

	go func() {
		log.Println("UI listening on", uiAddr)
		http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
			fmt.Fprintln(w, "UI is OK")
		})
		if err := http.ListenAndServe(uiAddr, nil); err != nil {
			log.Fatal("ui:", err)
		}
	}()

	log.Println("Press CTRL-C to stop")
	select {}
}

@michaelmcguinness
Copy link
Author

Thanks I'll give this a go and let you know the result. Actually I upgraded to Nomad 0.5.1 just in case but no change.

@michaelmcguinness
Copy link
Author

This issue remains with this test code so as you suspected it is a more generic issue. I actually also tested it with the Consul binary and got the same thing. Thank you for your time on this. I'll head over to the Nomad issues page with it.

As a side note, what should have I been expecting from 'curl localhost:9999'. I am getting a 404.
Also I noticed that you posted a Dockerfile. As the issue was specifically for the exec driver I was not sure what you wanted me to do with it.

@michaelmcguinness
Copy link
Author

You may (or may not :)) be interested to know that this issue is something to do with the kernel version and the LVM storage driver implementation. Haven't quite figured it out but switching to AUFS makes the issue go away.

Thanks again for your attention and a great utility.

@magiconair
Copy link
Contributor

I am interested and I'm glad you've figured it out. If you have a reference issue for nomad feel free to link it.

Thanks and merry christmas. Enjoy the holidays.

@michaelmcguinness
Copy link
Author

I raised this up with Kelsey Hightower as it was his demo that made me look at it. Not for a fix but just for some info about his env. I'm not sure I am going to burn time raising it with Nomad as we are way behind here with our kernel revision (one of the many things on my to-do list). It seems to me that LVM may not be the strategic storage option for Docker so I think fixing forward by upgrading is the way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants