Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRV records changed to addr instead of node format #7360

Open
jameshartig opened this issue Feb 27, 2020 · 10 comments
Open

SRV records changed to addr instead of node format #7360

jameshartig opened this issue Feb 27, 2020 · 10 comments
Labels
theme/dns Using Consul as a DNS provider, DNS related issues

Comments

@jameshartig
Copy link
Contributor

Overview of the Issue

Previously SRV records looked like:

;; ANSWER SECTION:
debug.packager-api.query.admiral. 1 IN  SRV     1 1 21748 ivy.node.staging-gce-us-east1.admiral.
debug.packager-api.query.admiral. 1 IN  SRV     1 1 16620 quinn.node.staging-gce-us-east1.admiral.

;; ADDITIONAL SECTION:
ivy.node.staging-gce-us-east1.admiral. 1 IN A   10.128.0.8
quinn.node.staging-gce-us-east1.admiral. 1 IN A 10.142.0.40

but with 1.7.1 they now look like:

;; ANSWER SECTION:
debug.packager-api.query.admiral. 1 IN  SRV     1 1 28945 0a800008.addr.staging-gce-us-east1.admiral.
debug.packager-api.query.admiral. 1 IN  SRV     1 1 19605 0a8e0028.addr.staging-gce-us-east1.admiral.

;; ADDITIONAL SECTION:
0a800008.addr.staging-gce-us-east1.admiral. 1 IN A 10.128.0.8
0a8e0028.addr.staging-gce-us-east1.admiral. 1 IN A 10.142.0.40

Reproduction Steps

We had previously relied on the output to be in the format <node>.node.<dc>.admiral but now it looks like it's changed to <random?>.addr.<dc>.admiral. I don't see anything in the CHANGELOG mentioning this, was it intentional or accidental? Can we choose the behavior?

Consul info for both Client and Server

Client info
agent:
        check_monitors = 0
        check_ttls = 71
        checks = 103
        services = 71
build:
        prerelease =
        revision = 2cf0a3c8
        version = 1.7.1
consul:
        acl = disabled
        known_servers = 3
        server = false
runtime:
        arch = amd64
        cpu_count = 4
        goroutines = 204
        max_procs = 4
        os = linux
        version = go1.13.8
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 72
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1197
        members = 5
        query_queue = 0
        query_time = 1
Server info
agent:
        check_monitors = 0
        check_ttls = 6
        checks = 6
        services = 6
build:
        prerelease =
        revision = 2cf0a3c8
        version = 1.7.1
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 7
        leader = false
        leader_addr = 10.142.15.197:8300
        server = true
raft:
        applied_index = 19366355
        commit_index = 19366355
        fsm_pending = 0
        last_contact = 36.245636ms
        last_log_index = 19366355
        last_log_term = 47
        last_snapshot_index = 19359054
        last_snapshot_term = 47
        latest_configuration = [{Suffrage:Voter ID:5492e882-3ad4-af25-9b37-4e23b2ebf1f5 Address:10.142.15.198:8300} {Suffrage:Voter ID:ab3569f5-6fa5-c69e-94d4-d662d7738261 Address:10.142.15.197:8300} {Suffrage:Voter ID:63f20cf7-dffa-8890-3086-db5382449cf8 
Address:10.142.15.199:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 47
runtime:
        arch = amd64
        cpu_count = 1
        goroutines = 188
        max_procs = 1
        os = linux
        version = go1.13.8
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 72
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1197
        members = 5
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1297
        members = 21
        query_queue = 0
        query_time = 1

Operating system and Environment details

CentOS 7 on GCE

Log Fragments

N/A

@danielehc danielehc added the theme/dns Using Consul as a DNS provider, DNS related issues label Apr 17, 2020
@scalp42
Copy link
Contributor

scalp42 commented Apr 21, 2020

Agree, we're running into the same issue here as well:

root@nomad-compute-i-0943ad3695c5bd2b1 [dev-usw2-dev1] ~ # dig jobs-shawn-postgres.service.consul SRV @127.0.0.1 -p 8600

; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> jobs-shawn-postgres.service.consul SRV @127.0.0.1 -p 8600
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43432
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;jobs-shawn-postgres.service.consul. IN	SRV

;; ANSWER SECTION:
jobs-shawn-postgres.service.consul. 0 IN SRV	1 1 23258 0a16004b.addr.dev-usw2-core1.consul.

;; ADDITIONAL SECTION:
0a16004b.addr.dev-usw2-core1.consul. 0 IN A	10.22.0.75

;; Query time: 1 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Tue Apr 21 01:12:51 UTC 2020
;; MSG SIZE  rcvd: 134

We leverage unbound and consul-template to cache records but we can't figure where 0a16004b is coming from?

This is definitely a breaking change.

@evandam
Copy link

evandam commented Apr 21, 2020

I'm seeing the same and wondering where the id.addr.datacenter.consul is coming from. It's a breaking change for us, too.

@scalp42
Copy link
Contributor

scalp42 commented Apr 21, 2020

For now we're passing Consul 1.6.5 previous answer format which includes the node to work around it:

nomad-compute-i-0cdc8320aa6b1b1aa.node.dev-usw2-core1.consul

@Y-Tian
Copy link

Y-Tian commented Apr 22, 2020

Bump on this issue please. Is there any way we can help speed up this issue?

@jameshartig
Copy link
Contributor Author

I forgot to mention this in the OP: This was broken in #6792 and given the commit message says:

Current implementation returns the node name instead of the service
address.
With this fix when querying for SRV record service address is return in
the SRV record.

This seems intentional.

@dansteen
Copy link

I just ran into this as well when trying to upgrade from 1.6.1, and found that, as you say above, the behavior was changed between 1.6.9 and 1.7.0.

Unfortunately, it's been this way for a year now, and I don't think it's going to be reverted so I'll have to figure out a workaround.

@zabranskiy
Copy link

zabranskiy commented Dec 21, 2021

Bump on this issue. Fix it please or suggest any workaround. We have the same problem upgrading consul from 1.4 to 1.11 and this is blocking us.

@blake
Copy link
Member

blake commented Jan 14, 2022

Hi folks, I dug into this a bit to understand the difference in SRV response behavior before and after Consul 1.7.0. I only able to reproduce this behavior with a specific service registration configuration. Below are my findings.

I’m seeing the same and wondering where the id.addr.datacenter.consul is coming from. It’s a breaking change for us, too.

The undocumented .addr subdomain (docs are being worked on in #12056) was added in Consul 0.7.1 with PR #2446 to address the issue reported in #832 wherein SRV responses did not support returning a service address that was different from the node’s address. For example:

# agent-config.hcl

# Node's IP
bind_addr = "203.0.113.0"

services {
  name = "apache"

  # Service is configured with a different IP
  address = "192.0.2.10"
  port = 80
}

This was broken in #6792 and given the commit message says:

Current implementation returns the node name instead of the service
address.
With this fix when querying for SRV record service address is return in
the SRV record.

This seems intentional.

PR #6792 was meant to fix bug where if a service was registered with the address parameter set to an FQDN, Consul incorrectly returned a CNAME record in the SRV target field of the DNS answer section which is a violation of RFC2782.

Target
The domain name of the target host. There MUST be one or more
address records for this name, the name MUST NOT be an alias (in
the sense of RFC 1034 or RFC 2181).

For example:

Behavior on Consul 1.6.10
$ dig @127.0.0.1 -p 8600 apache.service.consul -t srv
; <<>> DiG 9.16.24 <<>> @127.0.0.1 -p 8600 apache.service.consul -t srv
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16007
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;apache.service.consul.         IN      SRV

;; ANSWER SECTION:
apache.service.consul.  0       IN      SRV     1 1 80 blake-C02YX6QSLVCG.node.dc1.consul.

;; ADDITIONAL SECTION:
blake-C02YX6QSLVCG.node.dc1.consul. 0 IN CNAME  apache.example.com.
blake-C02YX6QSLVCG.node.dc1.consul. 0 IN TXT    "consul-network-segment="

;; Query time: 11 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Thu Jan 13 22:04:44 PST 2022
;; MSG SIZE  rcvd: 172

After the fix:

Consul 1.7.0 and later
$ dig @127.0.0.1 -p 8600 apache.service.consul -t srv
; <<>> DiG 9.16.24 <<>> @127.0.0.1 -p 8600 apache.service.consul -t srv
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18368
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;apache.service.consul.         IN      SRV

;; ANSWER SECTION:
apache.service.consul.  0       IN      SRV     1 1 80 apache.example.com.

;; ADDITIONAL SECTION:
blake-C02YX6QSLVCG.node.dc1.consul. 0 IN TXT    "consul-network-segment="

;; Query time: 8 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Thu Jan 13 22:05:01 PST 2022
;; MSG SIZE  rcvd: 152

That PR, however, introduced a change in behavior that is the source of this bug report.

Prior to 1.6.x, if an address was configured on a service and the IP address matched the node's configured advertise address, Consul would return the node's hostname in the target field of the SRV response.

consul/agent/dns.go

Lines 1522 to 1540 in 2e5c86c

// Add the extra record
records, meta := d.formatNodeRecord(cfg, node.Node, addr, srvRec.Target, dns.TypeANY, ttl, edns, maxRecursionLevel, cfg.NodeMetaTXT)
if len(records) > 0 {
// Use the node address if it doesn't differ from the service address
if addr == node.Node.Address {
resp.Extra = append(resp.Extra, records...)
} else {
// If it differs from the service address, give a special response in the
// 'addr.consul' domain with the service IP encoded in it. We have to do
// this because we can't put an IP in the target field of an SRV record.
switch record := records[0].(type) {
// IPv4
case *dns.A:
addr := hex.EncodeToString(record.A)
// Take the last 8 chars (4 bytes) of the encoded address to avoid junk bytes
srvRec.Target = fmt.Sprintf("%s.addr.%s.%s", addr[len(addr)-(net.IPv4len*2):], dc, d.domain)
record.Hdr.Name = srvRec.Target
resp.Extra = append(resp.Extra, record)

In Consul 1.7.0 and later, this was changed to always return address configured on the service in the <hex IP>.addr format (lines 1838 - 1840).

consul/agent/dns.go

Lines 1813 to 1840 in f3ac9da

serviceAddr := d.agent.TranslateServiceAddress(dc, node.Service.Address, node.Service.TaggedAddresses, addrTranslate)
nodeAddr := d.agent.TranslateAddress(node.Node.Datacenter, node.Node.Address, node.Node.TaggedAddresses, addrTranslate)
if serviceAddr == "" && nodeAddr == "" {
return nil, nil
}
nodeIPAddr := net.ParseIP(nodeAddr)
serviceIPAddr := net.ParseIP(serviceAddr)
// There is no service address and the node address is an IP
if serviceAddr == "" && nodeIPAddr != nil {
if node.Node.Address != nodeAddr {
// Do not CNAME node address in case of WAN address
return d.makeRecordFromIP(dc, nodeIPAddr, node, req, ttl)
}
return d.makeRecordFromServiceNode(dc, node, nodeIPAddr, req, ttl)
}
// There is no service address and the node address is a FQDN (external service)
if serviceAddr == "" {
return d.makeRecordFromFQDN(dc, nodeAddr, node, req, ttl, cfg, maxRecursionLevel)
}
// The service address is an IP
if serviceIPAddr != nil {
return d.makeRecordFromIP(dc, serviceIPAddr, node, req, ttl)
}

Fix it please or suggest any workaround.

Based on my testing, this can be resolved by removing the address field from the service registration, allowing Consul to fall back to using the node's IP.

Address (string: "") - Specifies the address of the service. If not provided, the agent's address is used as the address for the service during DNS queries.

For example

services {
  name = "apache"

  # Service is configured with an explicit Address
  address = "192.0.2.10"
  port = 80
}

becomes

services {
  name = "apache"
  port = 80
}

If for some reason you are seeing the .addr responses when you do not have an explicit address specified on the service registration, please provide more information about your configuration and ideally steps to reproduce the issue so that we can debug this further.

Thanks!

@blake
Copy link
Member

blake commented Jan 15, 2022

Based on my testing, this can be resolved by removing the address field from the service registration, allowing Consul to fall back to using the node's IP.

I discovered that this isn't possible with Nomad because Nomad always populates the Address field when registering services to Consul.

https://github.com/hashicorp/nomad/blob/v1.2.3/command/agent/consul/service_client.go#L1133-L1147

@jameshartig
Copy link
Contributor Author

@blake thanks for the clarification. I can confirm that removing the address from the registration fixes this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/dns Using Consul as a DNS provider, DNS related issues
Projects
None yet
Development

No branches or pull requests

8 participants