request help: A node in the K8S environment ETCD cluster died, causing Apisix to fail #3673

GBXing · 2021-02-25T11:23:35Z

Issue description

Deploy Apisix and ETCD using K8S, configure 3 ETCD nodes by domain name in the Apisix configuration file, one of the ETCD nodes died, resulting in an Apisix error, the log indicates that the ETCD node domain name cannot be resolved.Is resty.etcd unable to determine if the configured nodes are normal?What's the solution

Environment

apisix version (cmd: apisix version): 2.2
OS (cmd: uname -a): CentOS
OpenResty / Nginx version (cmd: nginx -V or openresty -V): 1.19.3.1
etcd version, if have (cmd: run curl http://127.0.0.1:9090/v1/server_info to get the info from server-info API): v3.4.0
apisix-dashboard version, if have:

The text was updated successfully, but these errors were encountered:

moonming · 2021-02-25T12:28:19Z

@Yiyiyimu will chaos mesh cover this？

Yiyiyimu · 2021-02-25T12:34:30Z

will chaos mesh cover this？

I'll make a test tomorrow

spacewander · 2021-02-25T13:03:27Z

Maybe we can enable this for APISIX: api7/lua-resty-etcd#109

Yiyiyimu · 2021-02-25T13:13:50Z

@GBXing
You could give it a try on master branch (apisix:dev for docker tag) after #3676 got merged if it's urgent.
Or if it's more urgent you could try to build docker image from local code to have a test.

GBXing · 2021-02-26T04:14:16Z

@Yiyiyimu
I incorporated resty.etcd 1.4.4 into the code test. The native works with IP configuration, but the domain name configuration in the K8S environment still gets an error：etcd-1.etcd.apisix.svc.cluster.local could not be resolved (3: Host not found)

tokers · 2021-02-26T09:58:08Z

@GBXing etcd-1.etcd.apisix.svc.cluster.local the FQDN seems not valid, what are the service name and the namespace?

Yiyiyimu · 2021-02-27T15:59:41Z

Hi @GBXing, I tried to reproduce the problem, and my reproduce steps are:

Configure etcd host with domain name:

DNS_IP=$(kubectl get svc -n kube-system -l k8s-app=kube-dns -o 'jsonpath={..spec.clusterIP}')
echo "dns_resolver:
  - ${DNS_IP}
etcd:
  host:
    - \\"<http://etcd-cluster-client.default.svc.cluster.local:2379>\\" " > ./conf/config.yaml

Setup APISIX and everything works as expect

Kill leader/follower pod of etcd ( gives me the same result ), and the error log would produce:

# Multiple of
2021/02/27 15:30:19 [error] 49#49: *114289 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:593: attempt to index field 'result' (a nil value)
stack traceback:
/usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:593: in function 'res_func'
/usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir'
/usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data'
/usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530>
[C]: in function 'xpcall'
/usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>,  etcd key: /apisix/global_rules, context: ngx.timer

# Multiple of
2021/02/27 15:30:38 [error] 53#53: *113602 [lua] config_etcd.lua:544: failed to fetch data from etcd: connection refused, etcd key: /apisix/ssl, context: ngx.timer

With etcd-operator, etcd got unreachable for seconds and returned back to normal.

Is the error log the same with what you met

GBXing · 2021-02-27T17:51:10Z

@tokers service name is etcd,namespace is apisix, other nodes can be accessed normally

GBXing · 2021-02-27T17:53:00Z

@Yiyiyimu I may be using it in a wrong way, do I need to add any additional configuration in config.yaml for the health check of etcd?

moonming · 2021-02-27T23:50:11Z

v3.lua:593: attempt to index field 'result' (a nil value) is this a bug of apisix？ GBXing <notifications@github.com>于2021年2月28日周日上午1:53写道：

@Yiyiyimu <https://github.com/Yiyiyimu> I may be using it in a wrong way, do I need to add any additional configuration in config.yaml for the health check of etcd? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3673 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGJZBK3SOBVUCYLPGTNIDETTBEWQRANCNFSM4YGKIJPA> .

-- Thanks, Ming Wen Twitter: _WenMing

GBXing · 2021-02-28T10:59:22Z

@moonming Yes, the etcd cluster outputs the exception log after the nodes are paused
2021/02/28 10:21:59 [error] 54#54: *13 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: attempt to index field 'result' (a nil value) stack traceback: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: in function 'res_func' /usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir' /usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data' /usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530> [C]: in function 'xpcall' /usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>, etcd key: /apisix/upstreams, context: ngx.timer 2021/02/28 10:21:59 [error] 54#54: *6 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: attempt to index field 'result' (a nil value) stack traceback: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: in function 'res_func' /usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir' /usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data' /usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530> [C]: in function 'xpcall' /usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>, etcd key: /apisix/ssl, context: ngx.timer

GBXing · 2021-02-28T11:13:00Z

@Yiyiyimu I used docker locally to start the etcd cluster and Apisix and pause one of the etcd nodes, just like the k8s environment

its my exception log:

2021/02/28 10:21:59 [error] 54#54: *13 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: attempt to index field 'result' (a nil value)
stack traceback:
	/usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: in function 'res_func'
	/usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir'
	/usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>,  etcd key: /apisix/upstreams, context: ngx.timer
2021/02/28 10:21:59 [error] 54#54: *6 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: attempt to index field 'result' (a nil value)
stack traceback:
	/usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: in function 'res_func'
	/usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir'
	/usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>,  etcd key: /apisix/ssl, context: ngx.timer
2021/02/28 10:21:59 [error] 53#53: *36 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: attempt to index field 'result' (a nil value)
stack traceback:
	/usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: in function 'res_func'
	/usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir'
	/usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>,  etcd key: /apisix/upstreams, context: ngx.timer
2021/02/28 10:21:59 [error] 53#53: *27 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: attempt to index field 'result' (a nil value)
stack traceback:
	/usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: in function 'res_func'
	/usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir'
	/usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>,  etcd key: /apisix/global_rules, context: ngx.timer
2021/02/28 10:22:28 [error] 53#53: *28 [lua] config_etcd.lua:544: failed to fetch data from etcd: etcd-node2 could not be resolved (3: Host not found),  etcd key: /apisix/services, context: ngx.timer
2021/02/28 10:22:29 [error] 54#54: *12 [lua] config_etcd.lua:544: failed to fetch data from etcd: etcd-node2 could not be resolved (3: Host not found),  etcd key: /apisix/consumers, context: ngx.timer
2021/02/28 10:23:03 [error] 54#54: *4 [lua] config_etcd.lua:544: failed to fetch data from etcd: etcd-node2 could not be resolved (3: Host not found),  etcd key: /apisix/routes, context: ngx.timer
2021/02/28 10:23:04 [error] 53#53: *35 [lua] config_etcd.lua:544: failed to fetch data from etcd: etcd-node2 could not be resolved (3: Host not found),  etcd key: /apisix/consumers, context: ngx.timer
2021/02/28 10:23:18 [error] 54#54: *445 [lua] config_etcd.lua:544: failed to fetch data from etcd: etcd-node2 could not be resolved (3: Host not found),  etcd key: /apisix/ssl, context: ngx.timer

and calling the admin api returns:

{
	"error_msg": "etcd-node2 could not be resolved (3: Host not found)"
}

by printing the log I found that no health check for etcd was started and no health_check.init () method was called

tokers · 2021-03-01T01:25:46Z

@tokers service name is etcd,namespace is apisix, other nodes can be accessed normally

As per https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-hostname-and-subdomain-fields, the FQDN has A/AAAA record only if the pod has hostname and subdomain fields and the subdomain has the same value to the headless service, I'm not sure whether you set these in etcd's Statefulset templates fields.

Yiyiyimu · 2021-03-01T07:12:07Z

v3.lua:593: attempt to index field 'result' (a nil value)
is this a bug of apisix？

Yes I do think so, we missed some test coverage, so no related error messages are prepared for this

Yiyiyimu · 2021-03-01T07:13:40Z

@GBXing Got it, the error log is the same. Could you also show your config.yaml?

Yiyiyimu · 2021-03-01T07:14:34Z

I'm not sure whether you set these in etcd's Statefulset templates fields.

Hi @tokers it seems @GBXing deploy etcd with docker but not k8s, so that might not be the solution

tokers · 2021-03-01T07:22:56Z

I'm not sure whether you set these in etcd's Statefulset templates fields.

Hi @tokers it seems @GBXing deploy etcd with docker but not k8s, so that might not be the solution

I see, but that the case I'm confusing on 😂

GBXing · 2021-03-01T12:51:46Z

@Yiyiyimu I use version 2.2 of default-config.yaml, this is my config.yaml:

apisix:
  node_listen: 9080                # APISIX listening port
  enable_admin: true
  enable_admin_cors: true          # Admin API support CORS response headers.
  enable_debug: false
  enable_dev_mode: true           # Sets nginx worker_processes to 1 if set to true
  enable_reuseport: true           # Enable nginx SO_REUSEPORT switch if set to true.
  enable_ipv6: true
  config_center: etcd              # etcd: use etcd to store the config value
  allow_admin:
  admin_key:
	-
	  name: "admin"
	  key: edd1c9f034335f136f87ad84b625c8f1
	  role: admin                 # admin: manage all configuration data
								  # viewer: only can view configuration data
	-
	  name: "viewer"
	  key: 4054f7cf07e344346cd3f287985e76a2
	  role: viewer

nginx_config:
  error_log: "logs/error.log"
  error_log_level: "warn"
  http:
      lua_shared_dicts:
	  shared-datamap: 50m

etcd:
  host:
	- "http://etcd-node1:2379"
	- "http://etcd-node2:2379"
	- "http://etcd-node3:2379"
	# - "http://127.0.0.1:2379"
	# - "http://172.17.0.1:2379"   
  prefix: "/apisix"           
  timeout: 30
  tls:
	verify: true
  # resync_delay: 5             
  # user: root                  
  # password: 5tHkHhYkjr6cQY    



plugins:                          # plugin list (sorted in alphabetical order)
  - api-breaker
  - authz-keycloak
  - basic-auth
  - batch-requests
  - consumer-restriction
  - cors
  - echo
  # - error-log-logger
  # - example-plugin
  - fault-injection
  - grpc-transcode
  - hmac-auth
  - http-logger
  - ip-restriction
  - jwt-auth
  - kafka-logger
  - key-auth
  - limit-conn
  - limit-count
  - limit-req
  # - log-rotate
  # - node-status
  - openid-connect
  - prometheus
  - proxy-cache
  - proxy-mirror
  - proxy-rewrite
  - redirect
  - referer-restriction
  - request-id
  - request-validation
  - response-rewrite
  - serverless-post-function
  - serverless-pre-function
  # - skywalking
  - sls-logger
  - syslog
  - tcp-logger
  - udp-logger
  - uri-blocker
  - wolf-rbac
  - zipkin
  - server-info
  - traffic-split

plugin_attr:
  log-rotate:
	interval: 3600
	max_kept: 168
  skywalking:
	service_name: APISIX
	service_instance_name: "APISIX Instance Name"
	endpoint_addr: http://127.0.0.1:12800
  prometheus:
	export_uri: /apisix/prometheus/metrics
  server-info:
	report_interval: 60
	report_ttl: 3600

membphis · 2021-03-14T15:49:41Z

ping @Yiyiyimu

membphis · 2021-03-23T15:33:43Z

any news? @Yiyiyimu

nanamikon · 2021-05-06T01:41:19Z

Any news? We found the similar porblem, some of nodes (not all) found this error message, my etcd version is 3.4.13.

stack traceback:
        ...p/huya-nginx-proxy//deps/share/lua/5.1/resty/etcd/v3.lua:652: in function 'res_func'
        /data/app/huya-nginx-proxy/apisix/core/config_etcd.lua:131: in function 'waitdir'
        /data/app/huya-nginx-proxy/apisix/core/config_etcd.lua:318: in function 'sync_data'
        /data/app/huya-nginx-proxy/apisix/core/config_etcd.lua:546: in function </data/app/huya-nginx-proxy/apisix/core/config_etcd.lua:536>
        [C]: in function 'xpcall'

But admin api is ok provided by these nodes , and they can not recover forever

Yiyiyimu · 2021-05-06T01:45:09Z

Any news? We found the similar porblem, some of nodes (not all) found this error message, my etcd version is 3.4.13.

@nanamikon will add PR to solve it this week

spacewander mentioned this issue Feb 25, 2021

If a node in the ETCD cluster dies, it cannot be accessed properly api7/lua-resty-etcd#121

Closed

tzssangglass mentioned this issue Feb 26, 2021

[discuss]: enable etcd health check #3692

Closed

Yiyiyimu mentioned this issue May 6, 2021

feat: enable etcd health-check #4191

Merged

6 tasks

spacewander closed this as completed in #4191 Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

request help: A node in the K8S environment ETCD cluster died, causing Apisix to fail #3673

request help: A node in the K8S environment ETCD cluster died, causing Apisix to fail #3673

GBXing commented Feb 25, 2021

moonming commented Feb 25, 2021

Yiyiyimu commented Feb 25, 2021

spacewander commented Feb 25, 2021

Yiyiyimu commented Feb 25, 2021

GBXing commented Feb 26, 2021

tokers commented Feb 26, 2021 •

edited

Loading

Yiyiyimu commented Feb 27, 2021 •

edited

Loading

GBXing commented Feb 27, 2021

GBXing commented Feb 27, 2021

moonming commented Feb 27, 2021 via email

GBXing commented Feb 28, 2021

GBXing commented Feb 28, 2021 •

edited

Loading

tokers commented Mar 1, 2021

Yiyiyimu commented Mar 1, 2021

Yiyiyimu commented Mar 1, 2021

Yiyiyimu commented Mar 1, 2021

tokers commented Mar 1, 2021

GBXing commented Mar 1, 2021

membphis commented Mar 14, 2021

membphis commented Mar 23, 2021

nanamikon commented May 6, 2021

Yiyiyimu commented May 6, 2021

request help: A node in the K8S environment ETCD cluster died, causing Apisix to fail #3673

request help: A node in the K8S environment ETCD cluster died, causing Apisix to fail #3673

Comments

GBXing commented Feb 25, 2021

Issue description

Environment

moonming commented Feb 25, 2021

Yiyiyimu commented Feb 25, 2021

spacewander commented Feb 25, 2021

Yiyiyimu commented Feb 25, 2021

GBXing commented Feb 26, 2021

tokers commented Feb 26, 2021 • edited Loading

Yiyiyimu commented Feb 27, 2021 • edited Loading

GBXing commented Feb 27, 2021

GBXing commented Feb 27, 2021

moonming commented Feb 27, 2021 via email

GBXing commented Feb 28, 2021

GBXing commented Feb 28, 2021 • edited Loading

tokers commented Mar 1, 2021

Yiyiyimu commented Mar 1, 2021

Yiyiyimu commented Mar 1, 2021

Yiyiyimu commented Mar 1, 2021

tokers commented Mar 1, 2021

GBXing commented Mar 1, 2021

membphis commented Mar 14, 2021

membphis commented Mar 23, 2021

nanamikon commented May 6, 2021

Yiyiyimu commented May 6, 2021

tokers commented Feb 26, 2021 •

edited

Loading

Yiyiyimu commented Feb 27, 2021 •

edited

Loading

GBXing commented Feb 28, 2021 •

edited

Loading