Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ambassador crashing on node with wrong DNS resolver address due to misconfigured kubelet #1289

Closed
372046933 opened this issue Jul 31, 2018 · 4 comments

Comments

@372046933
Copy link

After the follow deployment script,
curl https://raw.githubusercontent.com/kubeflow/kubeflow/v0.2.2/scripts/deploy.sh | bash.
Ambassador failed to start on one node.

 kubectl logs --namespace kubeflow ambassador-849fb9c8c5-kgrkb ambassador
./entrypoint.sh: set: line 65: can't access tty; job control turned off
2018-07-31 05:46:50 kubewatch 0.30.1 INFO: generating config with gencount 1 (4 changes)
2018-07-31 05:46:56 kubewatch 0.30.1 WARNING: Scout: could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7383625940>: Failed to establish a new connection: [Errno -3] Try again',))
2018-07-31 05:46:56 kubewatch 0.30.1 INFO: Scout reports {"latest_version": "0.30.1", "exception": "could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7383625940>: Failed to establish a new connection: [Errno -3] Try again',))", "cached": false, "timestamp": 1533016011.063859}
[2018-07-31 05:46:56.133][10][info][upstream] source/common/upstream/cluster_manager_impl.cc:132] cm init: all clusters initialized
[2018-07-31 05:46:56.133][10][info][config] source/server/configuration_impl.cc:55] loading 1 listener(s)
[2018-07-31 05:46:56.150][10][info][config] source/server/configuration_impl.cc:95] loading tracing configuration
[2018-07-31 05:46:56.150][10][info][config] source/server/configuration_impl.cc:122] loading stats sink configuration
AMBASSADOR: starting diagd
AMBASSADOR: starting Envoy
AMBASSADOR: waiting
PIDS: 11:diagd 12:envoy 13:kubewatch
[2018-07-31 05:46:56.556][14][info][main] source/server/server.cc:184] initializing epoch 0 (hot restart version=9.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363)
[2018-07-31 05:46:57.574][14][info][config] source/server/configuration_impl.cc:55] loading 1 listener(s)
[2018-07-31 05:46:57.767][14][info][config] source/server/configuration_impl.cc:95] loading tracing configuration
[2018-07-31 05:46:57.767][14][info][config] source/server/configuration_impl.cc:122] loading stats sink configuration
[2018-07-31 05:46:57.769][14][info][main] source/server/server.cc:359] starting main dispatch loop
2018-07-31 05:47:04 diagd 0.30.1 WARNING: Scout: could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0bee6d95f8>: Failed to establish a new connection: [Errno -3] Try again',))
2018-07-31 05:47:04 diagd 0.30.1 INFO: Scout reports {"latest_version": "0.30.1", "exception": "could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0bee6d95f8>: Failed to establish a new connection: [Errno -3] Try again',))", "cached": false, "timestamp": 1533016019.808133}
2018-07-31 05:47:14 kubewatch 0.30.1 INFO: generating config with gencount 2 (4 changes)
2018-07-31 05:47:19 kubewatch 0.30.1 WARNING: Scout: could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6fbb8468d0>: Failed to establish a new connection: [Errno -3] Try again',))
2018-07-31 05:47:19 kubewatch 0.30.1 INFO: Scout reports {"latest_version": "0.30.1", "exception": "could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6fbb8468d0>: Failed to establish a new connection: [Errno -3] Try again',))", "cached": false, "timestamp": 1533016034.702365}
[2018-07-31 05:47:19.770][26][info][upstream] source/common/upstream/cluster_manager_impl.cc:132] cm init: all clusters initialized
[2018-07-31 05:47:19.771][26][info][config] source/server/configuration_impl.cc:55] loading 1 listener(s)
[2018-07-31 05:47:19.788][26][info][config] source/server/configuration_impl.cc:95] loading tracing configuration
[2018-07-31 05:47:19.788][26][info][config] source/server/configuration_impl.cc:122] loading stats sink configuration
unable to initialize hot restart: previous envoy process is still initializing
starting hot-restarter with target: /application/start-envoy.sh
forking and execing new child process at epoch 0
forked new child process with PID=14
got SIGHUP
forking and execing new child process at epoch 1
forked new child process with PID=27
got SIGCHLD
PID=27 exited with code=1
Due to abnormal exit, force killing all child processes and exiting
force killing PID=14
exiting due to lack of child processes
AMBASSADOR: envoy exited with status 1
Here's the envoy.json we were trying to run with:
{
  "listeners": [

    {
      "address": "tcp://0.0.0.0:80",

      "filters": [
        {
          "type": "read",
          "name": "http_connection_manager",
          "config": {"codec_type": "auto",
            "stat_prefix": "ingress_http",
            "access_log": [
              {
                "format": "ACCESS [%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% \"%REQ(X-FORWARDED-FOR)%\" \"%REQ(USER-AGENT)%\" \"%REQ(X-REQUEST-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\"\n",
                "path": "/dev/fd/1"
              }
            ],
            "route_config": {
              "virtual_hosts": [
                {
                  "name": "backend",
                  "domains": ["*"],"routes": [

                    {
                      "timeout_ms": 3000,"prefix": "/ambassador/v0/check_ready","prefix_rewrite": "/ambassador/v0/check_ready",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_127_0_0_1_8877", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/ambassador/v0/check_alive","prefix_rewrite": "/ambassador/v0/check_alive",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_127_0_0_1_8877", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/ambassador/v0/","prefix_rewrite": "/ambassador/v0/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_127_0_0_1_8877", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/tfjobs/","prefix_rewrite": "/tfjobs/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_tf_job_dashboard_default", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/k8s/ui/","prefix_rewrite": "/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_kubernetes_dashboard_kube_system_otls", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 300000,"prefix": "/user/","prefix_rewrite": "/user/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_tf_hub_lb_default", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 300000,"prefix": "/hub/","prefix_rewrite": "/hub/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_tf_hub_lb_default", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/","prefix_rewrite": "/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_centraldashboard_default", "weight": 100.0 }

                          ]
                      }

                    }


                  ]
                }
              ]
            },
            "filters": [
              {
                "name": "cors",
                "config": {}
              },{"type": "decoder",
                "name": "router",
                "config": {}
              }
            ]
          }
        }
      ]
    }
  ],
  "admin": {
    "address": "tcp://127.0.0.1:8001",
    "access_log_path": "/tmp/admin_access_log"
  },
  "cluster_manager": {
    "clusters": [
      {
        "name": "cluster_127_0_0_1_8877",
        "connect_timeout_ms": 3000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://127.0.0.1:8877"
          }

        ]},
      {
        "name": "cluster_centraldashboard_default",
        "connect_timeout_ms": 3000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://centraldashboard.default:80"
          }

        ]},
      {
        "name": "cluster_kubernetes_dashboard_kube_system_otls",
        "connect_timeout_ms": 3000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://kubernetes-dashboard.kube-system:443"
          }

        ],
        "ssl_context": {

        }},
      {
        "name": "cluster_tf_hub_lb_default",
        "connect_timeout_ms": 3000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://tf-hub-lb.default:80"
          }

        ]},
      {
        "name": "cluster_tf_job_dashboard_default",
        "connect_timeout_ms": 3000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://tf-job-dashboard.default:80"
          }

        ]}

    ]
  },
  "statsd_udp_ip_address": "127.0.0.1:8125",
  "stats_flush_interval_ms": 1000
}AMBASSADOR: shutting down

@jlewi
Copy link
Contributor

jlewi commented Jul 31, 2018

Does ambassador start on the other nodes? What happened when it restarted? Did it just crash loop?

Are you running on minikube? Do you have RBAC installed (see #734)

/cc @kflynn

@jlewi
Copy link
Contributor

jlewi commented Jul 31, 2018

Is kube-dns running? see #1134?

@372046933
Copy link
Author

@jlewi Thanks for your kindly reply. I checked DNS service on every node by executing nslookup kubernetes on busybox pod, and finally found that the node where ambassador cashed have a wrong DNS resolver address. The root cause was the configuration of kubelet, which used erroneous --cluster-dns

@jlewi
Copy link
Contributor

jlewi commented Aug 2, 2018

Great glad its fixed.

@jlewi jlewi closed this as completed Aug 2, 2018
@jlewi jlewi changed the title Ambassador failed to start using the 0.2.2 deploy script ambassador crashing on node with wrong DNS resolver address due to misconfigured kubelet Aug 2, 2018
surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022
* Add owners for tf and pytorch operators

* Update test owners
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants