Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul >= 1.3.0 doesn't filter services by tags / returns an empty array #5179

Closed
sofarsog00d opened this issue Dec 29, 2018 · 29 comments
Closed
Labels
needs-investigation The issue described is detailed and complex. waiting-reply Waiting on response from Original Poster or another individual in the thread

Comments

@sofarsog00d
Copy link

Overview of the Issue

Consul >= 1.3.0 doesn't filter services by tags / returns an empty array

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Run curl -X GET 'http://localhost:8500/v1/health/service/service_name?tag=env1'
  2. Output:
    []
    It works with consul 1.2.4, but doesn't with 1.3.0 or higher.

Consul info for both Client and Server

Client info
agent:
	check_monitors = 2
	check_ttls = 0
	checks = 70
	services = 70
build:
	prerelease =
	revision = 0bddfa23
	version = 1.4.0
consul:
	acl = disabled
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 233
	max_procs = 8
	os = linux
	version = go1.11.1
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1322
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 24072
	members = 32
	query_queue = 0
	query_time = 53
Server info
agent:
	check_monitors = 2
	check_ttls = 0
	checks = 70
	services = 70
build:
	prerelease =
	revision = 0bddfa23
	version = 1.4.0
consul:
	acl = disabled
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 233
	max_procs = 8
	os = linux
	version = go1.11.1
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1322
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 24072
	members = 32
	query_queue = 0
	query_time = 53
@sofarsog00d
Copy link
Author

Any help?)

@ShimmerGlass
Copy link
Contributor

@sofarsog00d could you provide your service definition ?

@sofarsog00d
Copy link
Author

@Aestek sure

root@server1:~# cat /etc/consul.d/env1-service_name.json
{
  "service": {
    "id": "env1-service_name",
    "name": "service_name",
    "tags": ["env1","5.0.2"],
    "address": "11.11.11.11",
    "port": 22222,
    "check": {
        "http": "http://11.11.11.11:22222",
        "interval": "10s",
        "timeout": "2s"
    }
  }
}

@pearkes
Copy link
Contributor

pearkes commented Jan 3, 2019

@sofarsog00d can you post the results of the query without the tag parameter? Could be related to #4717.

@pearkes pearkes added needs-investigation The issue described is detailed and complex. waiting-reply Waiting on response from Original Poster or another individual in the thread labels Jan 3, 2019
@sofarsog00d
Copy link
Author

sofarsog00d commented Jan 3, 2019

@pearkes

consul 1.2.4

root@node1:~# curl -X GET 'http://localhost:8500/v1/catalog/service/service?tag=env1'
[{"ID":"7h4973f4h-3233-f37h-5555-666683625f46","Node":"node1",
"Address":"11.11.11.111","Datacenter":"dc-1","TaggedAddresses":{"lan":"11.11.11.111",
"wan":"11.11.11.111"},"NodeMeta":{"consul-network-segment":""},
"ServiceKind":"","ServiceID":"env1-service","ServiceName":"service",
"ServiceTags":["env1","5.8.1"],"ServiceAddress":"11.11.11.111",
"ServiceWeights":{"Passing":0,"Warning":0},"ServiceMeta":null,"ServicePort":11222,
"ServiceEnableTagOverride":false,"ServiceProxyDestination":"",
"ServiceConnect":{"Native":false,"Proxy":null},"CreateIndex":163932811,"ModifyIndex":273911893}]

consul 1.4

root@node1:~# curl -X GET 'http://localhost:8500/v1/catalog/service/service?tag=env1'
[]

@pearkes
Copy link
Contributor

pearkes commented Jan 3, 2019

@sofarsog00d can you post without the ?tag=env1 against 1.4?

@sofarsog00d
Copy link
Author

@pearkes yes, it works without ?tag=env, but output will be with service from all environments, e.g. service-env1, service-env2, service-env3, where envX - different servers. So I need to filter service by environment.

@pearkes
Copy link
Contributor

pearkes commented Jan 3, 2019

@sofarsog00d right, I understand -- just trying to gather more information to aid in debugging and I think that would be useful.

@sofarsog00d
Copy link
Author

@pearkes any updates?

@sofarsog00d
Copy link
Author

any news?

@mkeeler
Copy link
Member

mkeeler commented Jan 22, 2019

@sofarsog00d I just tried reproducing and so far tag filtering is working as expected. Could you please submit the curl output of the request without passing the ?tag=env1 filter. Seeing the raw information would be helpful in tracking this down.

@Elufimov
Copy link

Elufimov commented Jan 23, 2019

I am facing this problem too. We are in a process of updating our consul cluster to the lates version from 0.7.0. I have stable reproduction with agent (1.4.0) connected to our old cluster (0.7.0) and in new test cluster (1.4.0). Lets us see test cluster for example:
3 master node with simple config:

{
    "bootstrap_expect": 3,
    "server": true,
    "datacenter": "caravan",
    "data_dir": "/var/lib/consul",
    "encrypt": "some key",
    "log_level": "ERR",
    "ui": true,
    "recursors":[
      "172.18.1.2",
      "172.18.1.3",
      "172.18.1.4"
    ],
    "start_join": [
      "${local.consul1_ip}",
      "${local.consul2_ip}",
      "${local.consul3_ip}"
    ]
}

Service:

curl -X "PUT" "http://first_master:8500/v1/agent/service/register?dc=caravan" \
     -H 'Content-Type: application/json; charset=utf-8' \
     -d $'{
  "ID": "Elastic localhost PROD",
  "Address": "localhost",
  "Name": "Elastic",
  "Tags": [
    "PROD"
  ]
}'
curl -X "GET" "http://first_master:8500/v1/health/service/Elastic" 
[
    {
        "Node": {
            "ID": "",
            "Node": "first_master",
            "Address": "some_address"
            "Datacenter": "",
            "TaggedAddresses": {
                "lan": "some_address",
                "wan": "some_address"
            },
            "Meta": null,
            "CreateIndex": 5,
            "ModifyIndex": 80927
        },
        "Service": {
            "ID": "Elastic localhost PROD",
            "Service": "Elastic",
            "Tags": [
                "PROD"
            ],
            "Address": "localhost",
            "Meta": null,
            "Port": 0,
            "Weights": null,
            "EnableTagOverride": false,
            "ProxyDestination": "",
            "Proxy": {},
            "Connect": {},
            "CreateIndex": 80916,
            "ModifyIndex": 80916
        },
        "Checks": [
            {
                "Node": "first_master",
                "CheckID": "serfHealth",
                "Name": "Serf Health Status",
                "Status": "passing",
                "Notes": "",
                "Output": "Agent alive and reachable",
                "ServiceID": "",
                "ServiceName": "",
                "ServiceTags": [],
                "Definition": {},
                "CreateIndex": 5,
                "ModifyIndex": 5
            }
        ]
    }
]
curl -X "GET" "http://first_master:8500/v1/health/service/Elastic?tag=PROD" 
[]

@mkeeler
Copy link
Member

mkeeler commented Jan 23, 2019

@Elufimov Can you execute the secondary query with curl -v. I tried reproducing with your exact service and a 3 server node consul cluster (both versions 1.3.0 and 1.4.0) and am unable to. What I am hoping to determine is whether you are getting stale or cached data, the headers curl will output should indicate if either of these two scenarios are happening.

@Elufimov
Copy link

Here it is

curl -v -X "GET" "http://first_master:8500/v1/health/service/Elastic?tag=PROD"
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying 172.16.0.79...
* TCP_NODELAY set
* Connected to first_master (172.16.0.79) port 8500 (#0)
> GET /v1/health/service/Elastic?tag=PROD HTTP/1.1
> Host: first_master:8500
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: application/json
< Vary: Accept-Encoding
< X-Consul-Effective-Consistency: leader
< X-Consul-Index: 85280
< X-Consul-Knownleader: true
< X-Consul-Lastcontact: 0
< Date: Wed, 23 Jan 2019 16:43:01 GMT
< Content-Length: 2
<
* Connection #0 to host first_master left intact
[]%

@mkeeler
Copy link
Member

mkeeler commented Jan 23, 2019

Well that theory is definitely gone. Not stale or cached and the leader is serving the request.

@Elufimov
Copy link

It is very strange that we have different results. Can I provide additional data?

@mkeeler
Copy link
Member

mkeeler commented Jan 23, 2019

Maybe but I will need to do some more digging to see what I should ask you to gather.

@Elufimov
Copy link

I will be glad to help. This is definitely blocker for us.

@Elufimov
Copy link

Elufimov commented Jan 24, 2019

I re-created a cluster (1.4.0) from the same terraform template and now I can't reproduce the issue. Before re-creation cluster log was full of this messages
2019/01/24 04:50:18 [ERR] raft-net: Failed to decode incoming command: unknown rpc type 129

@Elufimov
Copy link

Elufimov commented Jan 24, 2019

I tried connect agents to our prod cluster (0.7.0):
1.2.4 agent worked fine
1.3.0 agent returned empty array
1.4.0 agent returned empty array
1.4.1 agent returned empty array
Something changed in 1.3.0

@jippi
Copy link
Contributor

jippi commented Jan 24, 2019

Might be #4944 ? was 1.3.1 to fix an issue in 1.3.0 - maybe not a full fix?

@Elufimov
Copy link

Elufimov commented Jan 24, 2019

I managed to reproduce the issue after I created 0.7.0 test cluster and joined 1.4.0 agent to it. It hope this will help with reproduction. Or this is not a supported scenario?

@mkeeler
Copy link
Member

mkeeler commented Jan 24, 2019

@Elufimov That scenario is definitely unusual but I think it should work. Does the problem always present itself with 1.3.0+ agents in a cluster with agents < 1.3.0?

I think I could see how that might cause an issue.

@Elufimov
Copy link

Elufimov commented Jan 28, 2019

Does the problem always present itself with 1.3.0+ agents in a cluster with agents < 1.3.0?

It was in my dev env. If you need I could reproduce issue on needed versions of agent and cluster. However right now I am focused of upgrading our cluster to the latest version. And so far I have no problems with this process.

@mkeeler
Copy link
Member

mkeeler commented Jan 28, 2019

I am going to try to repro with some servers on v1.2.4 and a client on 1.4.0. I think I might know why this is happening.

@mkeeler
Copy link
Member

mkeeler commented Jan 28, 2019

@Elufimov I was able to reproduce.

The cause of the problem is a 1.3.0+ client communicating with pre-1.3.0 servers.

@mkeeler
Copy link
Member

mkeeler commented Jan 28, 2019

The general recommendation is that servers in your cluster be upgraded first. Some features in the client agent's require the server-side updates to work properly and multi-tag filtering (introduced in 1.3.0) was one of them. The client sends somewhat different filters to the servers in 1.3.0 that pre-1.3.0 servers do not recognize.

Sometimes new clients may happen to work with older servers but its not something we explicitly support and its not something that should be relied upon.

@mkeeler mkeeler closed this as completed Jan 28, 2019
@Elufimov
Copy link

Thx for your quick help with this issue. Your versions policy is reasonable.

@sofarsog00d
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation The issue described is detailed and complex. waiting-reply Waiting on response from Original Poster or another individual in the thread
Projects
None yet
Development

No branches or pull requests

6 participants