Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/srv, embed, etcdmain: Support multiple clusters in the same DNS domain #8690

Merged
merged 2 commits into from
Jan 25, 2018

Conversation

tavish-stripe
Copy link
Contributor

@tavish-stripe tavish-stripe commented Oct 12, 2017

This PR adds the ability to configure the subdomain used for DNS cluster discovery. Previously, the domain was hardcoded to _etcd-server._tcp.FOO and _etcd-server-ssl._tcp.FOO. This did not work with our DNS setup, so I've made this patch to allow configuring the subdomain portion of the SRV queries.

We've been running this in production for a few months.

@codecov-io
Copy link

codecov-io commented Oct 12, 2017

Codecov Report

Merging #8690 into master will decrease coverage by 0.03%.
The diff coverage is 19.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8690      +/-   ##
==========================================
- Coverage   75.87%   75.84%   -0.04%     
==========================================
  Files         363      363              
  Lines       30161    30168       +7     
==========================================
- Hits        22884    22880       -4     
- Misses       5670     5689      +19     
+ Partials     1607     1599       -8
Impacted Files Coverage Δ
embed/config.go 59.85% <0%> (-3.47%) ⬇️
etcdmain/config.go 81.5% <100%> (+0.1%) ⬆️
pkg/srv/srv.go 79.03% <66.66%> (+5.79%) ⬆️
etcdserver/api/v3rpc/lease.go 78.37% <0%> (-8.11%) ⬇️
clientv3/leasing/util.go 91.66% <0%> (-6.67%) ⬇️
clientv3/namespace/watch.go 93.93% <0%> (-6.07%) ⬇️
etcdctl/ctlv3/command/lease_command.go 65.34% <0%> (-5.95%) ⬇️
etcdserver/api/v3rpc/watch.go 80.68% <0%> (-4.73%) ⬇️
clientv3/concurrency/session.go 88.63% <0%> (-4.55%) ⬇️
clientv3/leasing/txn.go 88.09% <0%> (-3.18%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37f577c...81c9f78. Read the comment docs.

@xiang90
Copy link
Contributor

xiang90 commented Dec 22, 2017

/cc @hexfusion can you take a look of this PR?

@hexfusion
Copy link
Contributor

@xiang90, 👍

@tavish-stripe
Copy link
Contributor Author

@xiang90 / @hexfusion Thanks for taking a look. Let me know how I can help. If this overall approach looks good I'm happy to rebase on master to resolve the conflict.

@@ -152,6 +152,8 @@ func newConfig() *config {

fs.StringVar(&cfg.Dproxy, "discovery-proxy", cfg.Dproxy, "HTTP proxy to use for traffic to discovery service.")
fs.StringVar(&cfg.DNSCluster, "discovery-srv", cfg.DNSCluster, "DNS domain used to bootstrap initial cluster.")
fs.StringVar(&cfg.DNSClusterServiceNameSSL, "discovery-srv-name-ssl", cfg.DNSClusterServiceNameSSL, "Service name to query when using DNS discovery for SSL")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reduce the configurable flag down to one? The goal is to be able to have N etcd discovery under one domain, not to make the ssl suffix configurable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we already know if etcd is using TLS at this point right, how can we get this information from etcd? Can you name your etcd service _etcd-server-ssl then in the case of cluster 'a' cumulative SRV would be _etcd-server-ssl-a._tcp.service.consul. using --discovery-srv-name a?

@xiang90
Copy link
Contributor

xiang90 commented Dec 23, 2017

@tavish-stripe I am OK with the motivation.

@hexfusion
Copy link
Contributor

hexfusion commented Dec 23, 2017

@tavish-stripe thank you for your contribution! Could you do me a favor and create an issue for this PR and briefly outline the limitations you faced with your DNS setup and how this PR resolved it? You could copy and paste a lot of what you have above. Would be good to track the issue while we work through this. I hope to have sometime to review after the holiday weekend. Thanks again, Happy Holidays!

@hexfusion
Copy link
Contributor

hexfusion commented Jan 2, 2018

Previously, the domain was hardcoded to _etcd-server._tcp.FOO and _etcd-server-ssl._tcp.FOO. This did not work with our DNS setup

@tavish-stripe, just checking in. I am trying to better understand the issues you faced with your DNS setup. For example, why does this type of setup not work for you? It would help me a lot to get some context to why this change is necessary.

cluster 1

-–discovery-srv=foo.local

;SRV Records
_etcd-server-ssl._tcp.foo.local. 300 IN SRV 0 0 2380 node1.foo.local.
_etcd-server-ssl._tcp.foo.local. 300 IN SRV 0 0 2380 node2.foo.local.
_etcd-server-ssl._tcp.foo.local. 300 IN SRV 0 0 2380 node3.foo.local.

_etcd-client-ssl._tcp.foo.local. 300 IN SRV 0 0 2379 node1.foo.local.
_etcd-client-ssl._tcp.foo.local. 300 IN SRV 0 0 2379 node2.foo.local.
_etcd-client-ssl._tcp.foo.local. 300 IN SRV 0 0 2379 node3.foo.local.

cluster 2

-–discovery-srv=sub.foo.local

;SRV Records
_etcd-server-ssl._tcp.sub.foo.local. 300 IN SRV 0 0 2380 node4.sub.foo.local.
_etcd-server-ssl._tcp.sub.foo.local. 300 IN SRV 0 0 2380 node5.sub.foo.local.
_etcd-server-ssl._tcp.sub.foo.local. 300 IN SRV 0 0 2380 node6.sub.foo.local.

_etcd-client-ssl._tcp.sub.foo.local. 300 IN SRV 0 0 2379 node4.sub.foo.local.
_etcd-client-ssl._tcp.sub.foo.local. 300 IN SRV 0 0 2379 node5.sub.foo.local.
_etcd-client-ssl._tcp.sub.foo.local. 300 IN SRV 0 0 2379 node6.sub.foo.local.

@hexfusion
Copy link
Contributor

ping @tavish-stripe :)

@tavish-stripe
Copy link
Contributor Author

@hexfusion thanks for the ping!

With our internal service configuration setup everything is under the same foo.local domain and we can't tack on an extra subdomain on a per-cluster basis, so being able to configure the subdomain purpose is useful for us.

@hexfusion
Copy link
Contributor

@tavish-stripe thank you for the reply, so if I understand you correctly your saying that your DNS setup somehow is not configurable to allow sub domains for SRV records. So that all SRV records are defined as the root domain?

So as a work around you are passing the SRV name to etcd to circumvent this? I think that is a crafty solution :) but at that point we are hacking the DNS results which are supposed to be defining the discovery. So this would be a hybrid DNS discovery mechanism one that uses parts of the SRV records and manually overriding others. Please feel free to chime in if my assessment is wrong.

Before we go down this route I would like to propose an alternative discovery mechanism based on TXT records and see if it would be useful for others. I believe it could also solve your problem? See #9128

/cc @gyuho @xiang90

@tavish-stripe
Copy link
Contributor Author

@hexfusion sorry for the delay.

Since we're using Consul for service discovery internally, it's very easy for us to register each etcd cluster in our consul cluster as a separate service. This allows us to use Consul's automatically generated SRV records for each etcd service in the cluster. As a result, we get SRV services registered at _etcd-cluster-1._tcp.service.consul, _etcd-cluster-2._tcp.service.consul, etc. Consul does not, as far as I know, allow you to tack on subdomains per service.

Using SRV discovery (with our own provided service names) allows us to easily use Consul DNS to bootstrap our etcd clusters without having to configure DNS specially for each cluster. The TXT solution you suggested does not allow us to leverage Consul's built-in functionality.

but at that point we are hacking the DNS results which are supposed to be defining the discovery. So this would be a hybrid DNS discovery mechanism one that uses parts of the SRV records and manually overriding others.

It might be helpful to clarify our use case. The solution I'm proposing uses all of the SRV responses and does not override any of them. Here's an example query and response generated by consul. There are two clusters here advertising a client service each (etcd-client-a and etcd-client-b).

$ dig srv _etcd-client-a._tcp.service.consul

;; ANSWER SECTION:
_etcd-client-a._tcp.service.consul. 5 IN  SRV 1 1 2379 etcda-00000000000000000.node.qa-northwest.consul.
_etcd-client-a._tcp.service.consul. 5 IN  SRV 1 1 2379 etcda-11111111111111111.node.qa-northwest.consul.
_etcd-client-a._tcp.service.consul. 5 IN  SRV 1 1 2379 etcda-22222222222222222.node.qa-northwest.consul.

;; ADDITIONAL SECTION:
etcda-00000000000000000.node.qa-northwest.consul. 5 IN A 10.0.0.1
etcda-11111111111111111.node.qa-northwest.consul. 5 IN A 10.0.0.2
etcda-22222222222222222.node.qa-northwest.consul. 5 IN A 10.0.0.3


$ dig srv _etcd-client-b._tcp.service.consul

;; ANSWER SECTION:
_etcd-client-b._tcp.service.consul. 5 IN  SRV 1 1 2379 etcdb-00000000000000000.node.qa-northwest.consul.
_etcd-client-b._tcp.service.consul. 5 IN  SRV 1 1 2379 etcdb-11111111111111111.node.qa-northwest.consul.
_etcd-client-b._tcp.service.consul. 5 IN  SRV 1 1 2379 etcdb-22222222222222222.node.qa-northwest.consul.

;; ADDITIONAL SECTION:
etcdb-00000000000000000.node.qa-northwest.consul. 5 IN A 10.1.0.1
etcdb-11111111111111111.node.qa-northwest.consul. 5 IN A 10.1.0.2
etcdb-22222222222222222.node.qa-northwest.consul. 5 IN A 10.1.0.3

Does that help clarify the use case?

@hexfusion
Copy link
Contributor

@tavish-stripe this is very helpful will review shortly. Thanks!

@hexfusion
Copy link
Contributor

Since we're using Consul for service discovery internally, it's very easy for us to register each etcd cluster in our consul cluster as a separate service. This allows us to use Consul's automatically generated SRV records for each etcd service in the cluster. As a result, we get SRV services registered at _etcd-cluster-1._tcp.service.consul, _etcd-cluster-2._tcp.service.consul, etc. Consul does not, as far as I know, allow you to tack on subdomains per service.

@tavish-stripe

  • First lets start with consol, it appears that support for multiple domains is a long standing issue. Support multiple domains per datacenter hashicorp/consul#290. Currently the recommended solution by Armon is to have a separate consul per domain. But as this is a one to many relationship of consul/datacenter to etcd cluster, your usage is really the only option for now.

  • Second it does seems that using your proposed instance naming is supported by https://tools.ietf.org/html/rfc6763 which is good.

The problem from my experience is is that SRV discovery causes a lot of confusion for many folks. So to @xiang90's point we want to keep this to a single flag for simplicity. We would also need to add documentation and tests.

--discovery-srv-name

will comment on the code now a bit.

@@ -152,6 +152,8 @@ func newConfig() *config {

fs.StringVar(&cfg.Dproxy, "discovery-proxy", cfg.Dproxy, "HTTP proxy to use for traffic to discovery service.")
fs.StringVar(&cfg.DNSCluster, "discovery-srv", cfg.DNSCluster, "DNS domain used to bootstrap initial cluster.")
fs.StringVar(&cfg.DNSClusterServiceNameSSL, "discovery-srv-name-ssl", cfg.DNSClusterServiceNameSSL, "Service name to query when using DNS discovery for SSL")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we already know if etcd is using TLS at this point right, how can we get this information from etcd? Can you name your etcd service _etcd-server-ssl then in the case of cluster 'a' cumulative SRV would be _etcd-server-ssl-a._tcp.service.consul. using --discovery-srv-name a?

embed/config.go Outdated
var clusterStrs []string
var cerr error
var serviceName string
var scheme string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var (
    clusterStrs []string
    cerr error
    serviceName string
    scheme string
)?

@hexfusion
Copy link
Contributor

@tavish-stripe this is a good idea please rework your logic based on a single flag. Looking forward to getting this into core! Let us know if you have any questions.

@tavish-stripe
Copy link
Contributor Author

@hexfusion that seems like a reasonable solution. One alternative would be to provide the name and have -ssl get added in the TLS case. e.g. -discovery-srv-name=foobar would give _foobar._tcp.service.consul and _foobar-ssl._tcp.service.consul. This feels more flexible and readable to me, but either way solves the problem. What do you think? Happy to rework the patch either way.

@hexfusion
Copy link
Contributor

@tavish-stripe I feel the definition should be explicit but to your point flexibility is a consideration.

defer to @xiang90

@hexfusion
Copy link
Contributor

Another thing is that if we keep this _etcd-server-ssl-${name} this might make it less confusing as this is the previous standard and we are only appending name. Also a query of clusters might be easier to predefine/grep.

@tavish-stripe
Copy link
Contributor Author

Yeah, that seems like a reasonable argument. I'll wait for @xiang90 's opinion and then rework the patch.

@tavish-stripe tavish-stripe force-pushed the tavish/svc-flag branch 4 times, most recently from d388145 to f44123b Compare January 24, 2018 19:49
@hexfusion
Copy link
Contributor

@tavish-stripe since @xiang90 has not added a comment I will assume he has no strong feelings. I would like to keep the existing format etcd-server, etcd-server-ssl +-{name}. Thank you and I look forward to our next review.

@xiang90
Copy link
Contributor

xiang90 commented Jan 24, 2018

sorry for the delay. i do not have strong feeling. just want to keep the flag down to one, and make it as clear as possible. we probably need to document it, and provide an example in the configuration doc as well.

@tavish-stripe
Copy link
Contributor Author

@hexfusion @xiang90 I've reworked the patch to use the suffix. Please take a look and let me know if anything needs changing.

we probably need to document it, and provide an example in the configuration doc as well.

Do I need to include that as part of this PR in the Documentation/ folder, or is this something that should be handled as a separate PR?

@hexfusion
Copy link
Contributor

Do I need to include that as part of this PR in the Documentation/ folder, or is this something that should be handled as a separate PR?

@tavish-stripe please include in this PR as a single commit.

@xiang90
Copy link
Contributor

xiang90 commented Jan 24, 2018

@tavish-stripe
Copy link
Contributor Author

@hexfusion @xiang90 I made a first cut of documentation for the new flag. Please let me know if there's anything I should change.

@xiang90
Copy link
Contributor

xiang90 commented Jan 24, 2018

@tavish-stripe This PR looks good to me in general. I would rely on @hexfusion and @gyuho or @spzala to give a final LGTM before merging.

Thanks.

+ Suffix to the DNS srv name queried when bootstrapping using DNS.
+ default: ""
+ env variable: ETCD_DISCOVERY_SRV

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tavish-stripe ETCD_DISCOVERY_SRV_NAME ?

@hexfusion
Copy link
Contributor

@tavish-stripe one little knit, otherwise looks great!

Copy link
Contributor

@hexfusion hexfusion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks @tavish-stripe @hexfusion

@hexfusion hexfusion merged commit 7dd9c30 into etcd-io:master Jan 25, 2018
@hexfusion
Copy link
Contributor

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants