Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATAL error when metrics cannot be delivered #320

Closed
simonsparks opened this issue Jul 18, 2017 · 5 comments
Closed

FATAL error when metrics cannot be delivered #320

simonsparks opened this issue Jul 18, 2017 · 5 comments
Milestone

Comments

@simonsparks
Copy link

In our deployment scenario, Fabio is configured to deliver metrics to a remote StatsD collector.
When the infrastructure is provisioned, Fabio and other core services are started before the StatsD service so there is a period of time when metrics would not be collected.

The problem we found is that, if Fabio can't find the StatsD endpoint on startup, it logs a fatal error and exits. Presumably this might also happen after startup if the StatsD service was temporarily unavailable. We haven't tested whether this occurs for other supported metrics implementations as well.

I think it would be preferable for Fabio to continue operating without delivering its metrics rather than exiting.

An example log extract of the observed behaviour:

2017/07/17 23:22:26 [INFO] Runtime config
{
    "Proxy": {
        "Strategy": "rnd",
        "Matcher": "prefix",
        "NoRouteStatus": 404,
        "MaxConn": 10000,
        "ShutdownWait": 0,
        "DialTimeout": 30000000000,
        "ResponseHeaderTimeout": 0,
        "KeepAliveTimeout": 0,
        "FlushInterval": 1000000000,
        "LocalIP": "10.180.10.133",
        "ClientIPHeader": "",
        "TLSHeader": "",
        "TLSHeaderValue": "",
        "GZIPContentTypes": null,
        "RequestID": ""
    },
    "Registry": {
        "Backend": "consul",
        "Static": {
            "Routes": ""
        },
        "File": {
            "Path": ""
        },
        "Consul": {
            "Addr": "localhost:8500",
            "Scheme": "http",
            "Token": "",
            "KVPath": "/fabio/config",
            "TagPrefix": "urlprefix-",
            "Register": true,
            "ServiceAddr": ":9998",
            "ServiceName": "fabio",
            "ServiceTags": null,
            "ServiceStatus": [
                "passing"
            ],
            "CheckInterval": 1000000000,
            "CheckTimeout": 3000000000,
            "CheckScheme": "http",
            "CheckTLSSkipVerify": false
        },
        "Timeout": 10000000000,
        "Retry": 500000000
    },
    "Listen": [
        {
            "Addr": ":9999",
            "Proto": "http",
            "ReadTimeout": 0,
            "WriteTimeout": 0,
            "CertSource": {
                "Name": "",
                "Type": "",
                "CertPath": "",
                "KeyPath": "",
                "ClientCAPath": "",
                "CAUpgradeCN": "",
                "Refresh": 0,
                "Header": null
            },
            "StrictMatch": false,
            "TLSMinVersion": 0,
            "TLSMaxVersion": 0,
            "TLSCiphers": null
        },
        {
            "Addr": ":443",
            "Proto": "https",
            "ReadTimeout": 0,
            "WriteTimeout": 0,
            "CertSource": {
                "Name": "public",
                "Type": "path",
                "CertPath": "/etc/fabio.d/certs/server",
                "KeyPath": "",
                "ClientCAPath": "/etc/fabio.d/certs/client",
                "CAUpgradeCN": "ApiGateway",
                "Refresh": 5000000000,
                "Header": null
            },
            "StrictMatch": false,
            "TLSMinVersion": 0,
            "TLSMaxVersion": 0,
            "TLSCiphers": null
        }
    ],
    "Log": {
        "AccessFormat": "common",
        "AccessTarget": "stdout",
        "RoutesFormat": "delta"
    },
    "Metrics": {
        "Target": "statsd",
        "Prefix": "{{clean .Exec}}_{{clean .Hostname}}",
        "Names": "{{clean .Service}}.{{clean .Host}}.{{clean .Path}}.{{clean .TargetURL.Host}}",
        "Interval": 30000000000,
        "GraphiteAddr": "",
        "StatsDAddr": "metrics-statsd.service.consul:9125",
        "Circonus": {
            "APIKey": "",
            "APIApp": "fabio",
            "APIURL": "",
            "CheckID": "",
            "BrokerID": ""
        }
    },
    "UI": {
        "Listen": {
            "Addr": ":9998",
            "Proto": "http",
            "ReadTimeout": 0,
            "WriteTimeout": 0,
            "CertSource": {
                "Name": "",
                "Type": "",
                "CertPath": "",
                "KeyPath": "",
                "ClientCAPath": "",
                "CAUpgradeCN": "",
                "Refresh": 0,
                "Header": null
            },
            "StrictMatch": false,
            "TLSMinVersion": 0,
            "TLSMaxVersion": 0,
            "TLSCiphers": null
        },
        "Color": "teal",
        "Title": "Load Balancer",
        "Access": "rw"
    },
    "Runtime": {
        "GOGC": 800,
        "GOMAXPROCS": 1
    },
    "ProfileMode": "",
    "ProfilePath": "/tmp"
}
2017/07/17 23:22:26 [INFO] Version 1.5.1 starting
2017/07/17 23:22:26 [INFO] Go runtime is go1.8.3
2017/07/17 23:22:26 [INFO] Sending metrics to StatsD on metrics-statsd.service.consul:9125 as "fabio_ip-10-180-10-133"
2017/07/17 23:22:26 [FATAL]  cannot connect to StatsD: lookup metrics-statsd.service.consul on 127.0.0.1:53: no such host
@simonsparks
Copy link
Author

simonsparks commented Jul 18, 2017

Looks like this is due to the domain name resolution failing when the Registry is created, and the same pattern applies to other metrics implementations:

func gmStatsDRegistry(prefix, addr string, interval time.Duration) (Registry, error) {
	if addr == "" {
		return nil, errors.New(" statsd addr missing")
	}

	a, err := net.ResolveUDPAddr("udp", addr)
	if err != nil {
		return nil, fmt.Errorf(" cannot connect to StatsD: %s", err)
	}

	r := gm.NewRegistry()
	go statsd.StatsD(r, interval, prefix, a)
	return &gmRegistry{r}, nil
}

Because we use Consul as the dns for our StatsD service, the address is unresolvable until the service has started and registered. Perhaps a configuration option to postpone / retry address resolution would be helpful.

magiconair added a commit that referenced this issue Jul 25, 2017
This patch retries configuring metrics during startup
to mitigate a race between fabio and metrics availability.

Fixes #320
@magiconair
Copy link
Contributor

@simonsparks Indeed. Fabio waits for consul but not the metrics. I've hijacked the registry retry config parameters and made this more robust. Could you check whether that solves your issue?

@pvandervelde
Copy link

We just ran into this issue this week when fabio came up before the local consul instance did. We're running with fabio 1.5.2 at the moment

@magiconair
Copy link
Contributor

@pvandervelde I've taken the patch and added two proper metrics config parameters metrics.timeout and metrics.retry for it and merged it to master. The default behavior is now to retry every 500ms for 10s just like for the registry.

@pvandervelde
Copy link

Sweet. I'll grab the next release :) Thanks for fixing it!

@magiconair magiconair added this to the 1.6.0 milestone Oct 10, 2017
@magiconair magiconair modified the milestones: 1.6, 1.5.3 Nov 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants