Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

request help: A node in the K8S environment ETCD cluster died, causing Apisix to fail #3673

Closed
GBXing opened this issue Feb 25, 2021 · 22 comments · Fixed by #4191
Closed

request help: A node in the K8S environment ETCD cluster died, causing Apisix to fail #3673

GBXing opened this issue Feb 25, 2021 · 22 comments · Fixed by #4191

Comments

@GBXing
Copy link

GBXing commented Feb 25, 2021

Issue description

Deploy Apisix and ETCD using K8S, configure 3 ETCD nodes by domain name in the Apisix configuration file, one of the ETCD nodes died, resulting in an Apisix error, the log indicates that the ETCD node domain name cannot be resolved.Is resty.etcd unable to determine if the configured nodes are normal?What's the solution

Environment

  • apisix version (cmd: apisix version): 2.2
  • OS (cmd: uname -a): CentOS
  • OpenResty / Nginx version (cmd: nginx -V or openresty -V): 1.19.3.1
  • etcd version, if have (cmd: run curl http://127.0.0.1:9090/v1/server_info to get the info from server-info API): v3.4.0
  • apisix-dashboard version, if have:
@moonming
Copy link
Member

@Yiyiyimu will chaos mesh cover this?

@Yiyiyimu
Copy link
Member

will chaos mesh cover this?

I'll make a test tomorrow

@spacewander
Copy link
Member

Maybe we can enable this for APISIX: api7/lua-resty-etcd#109

@Yiyiyimu
Copy link
Member

@GBXing
You could give it a try on master branch (apisix:dev for docker tag) after #3676 got merged if it's urgent.
Or if it's more urgent you could try to build docker image from local code to have a test.

@GBXing
Copy link
Author

GBXing commented Feb 26, 2021

@Yiyiyimu
I incorporated resty.etcd 1.4.4 into the code test. The native works with IP configuration, but the domain name configuration in the K8S environment still gets an error:etcd-1.etcd.apisix.svc.cluster.local could not be resolved (3: Host not found)

@tokers
Copy link
Contributor

tokers commented Feb 26, 2021

@GBXing etcd-1.etcd.apisix.svc.cluster.local the FQDN seems not valid, what are the service name and the namespace?

@Yiyiyimu
Copy link
Member

Yiyiyimu commented Feb 27, 2021

Hi @GBXing, I tried to reproduce the problem, and my reproduce steps are:

  1. Configure etcd host with domain name:

    DNS_IP=$(kubectl get svc -n kube-system -l k8s-app=kube-dns -o 'jsonpath={..spec.clusterIP}')
    echo "dns_resolver:
      - ${DNS_IP}
    etcd:
      host:
        - \\"<http://etcd-cluster-client.default.svc.cluster.local:2379>\\" " > ./conf/config.yaml
    
  2. Setup APISIX and everything works as expect

  3. Kill leader/follower pod of etcd ( gives me the same result ), and the error log would produce:

    # Multiple of
    2021/02/27 15:30:19 [error] 49#49: *114289 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:593: attempt to index field 'result' (a nil value)
    stack traceback:
    /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:593: in function 'res_func'
    /usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir'
    /usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data'
    /usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530>
    [C]: in function 'xpcall'
    /usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>,  etcd key: /apisix/global_rules, context: ngx.timer
    
    # Multiple of
    2021/02/27 15:30:38 [error] 53#53: *113602 [lua] config_etcd.lua:544: failed to fetch data from etcd: connection refused, etcd key: /apisix/ssl, context: ngx.timer
    

    With etcd-operator, etcd got unreachable for seconds and returned back to normal.

Is the error log the same with what you met

@GBXing
Copy link
Author

GBXing commented Feb 27, 2021

@tokers service name is etcd,namespace is apisix, other nodes can be accessed normally

@GBXing
Copy link
Author

GBXing commented Feb 27, 2021

@Yiyiyimu I may be using it in a wrong way, do I need to add any additional configuration in config.yaml for the health check of etcd?

@moonming
Copy link
Member

moonming commented Feb 27, 2021 via email

@GBXing
Copy link
Author

GBXing commented Feb 28, 2021

@moonming Yes, the etcd cluster outputs the exception log after the nodes are paused
2021/02/28 10:21:59 [error] 54#54: *13 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: attempt to index field 'result' (a nil value) stack traceback: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: in function 'res_func' /usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir' /usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data' /usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530> [C]: in function 'xpcall' /usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>, etcd key: /apisix/upstreams, context: ngx.timer 2021/02/28 10:21:59 [error] 54#54: *6 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: attempt to index field 'result' (a nil value) stack traceback: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: in function 'res_func' /usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir' /usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data' /usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530> [C]: in function 'xpcall' /usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>, etcd key: /apisix/ssl, context: ngx.timer

@GBXing
Copy link
Author

GBXing commented Feb 28, 2021

@Yiyiyimu I used docker locally to start the etcd cluster and Apisix and pause one of the etcd nodes, just like the k8s environment

its my exception log:

2021/02/28 10:21:59 [error] 54#54: *13 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: attempt to index field 'result' (a nil value)
stack traceback:
	/usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: in function 'res_func'
	/usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir'
	/usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>,  etcd key: /apisix/upstreams, context: ngx.timer
2021/02/28 10:21:59 [error] 54#54: *6 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: attempt to index field 'result' (a nil value)
stack traceback:
	/usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: in function 'res_func'
	/usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir'
	/usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>,  etcd key: /apisix/ssl, context: ngx.timer
2021/02/28 10:21:59 [error] 53#53: *36 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: attempt to index field 'result' (a nil value)
stack traceback:
	/usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: in function 'res_func'
	/usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir'
	/usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>,  etcd key: /apisix/upstreams, context: ngx.timer
2021/02/28 10:21:59 [error] 53#53: *27 [lua] config_etcd.lua:566: failed to fetch data from etcd: /usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: attempt to index field 'result' (a nil value)
stack traceback:
	/usr/local/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:649: in function 'res_func'
	/usr/local/apisix/apisix/core/config_etcd.lua:125: in function 'waitdir'
	/usr/local/apisix/apisix/core/config_etcd.lua:305: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:540: in function </usr/local/apisix/apisix/core/config_etcd.lua:530>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:530: in function </usr/local/apisix/apisix/core/config_etcd.lua:521>,  etcd key: /apisix/global_rules, context: ngx.timer
2021/02/28 10:22:28 [error] 53#53: *28 [lua] config_etcd.lua:544: failed to fetch data from etcd: etcd-node2 could not be resolved (3: Host not found),  etcd key: /apisix/services, context: ngx.timer
2021/02/28 10:22:29 [error] 54#54: *12 [lua] config_etcd.lua:544: failed to fetch data from etcd: etcd-node2 could not be resolved (3: Host not found),  etcd key: /apisix/consumers, context: ngx.timer
2021/02/28 10:23:03 [error] 54#54: *4 [lua] config_etcd.lua:544: failed to fetch data from etcd: etcd-node2 could not be resolved (3: Host not found),  etcd key: /apisix/routes, context: ngx.timer
2021/02/28 10:23:04 [error] 53#53: *35 [lua] config_etcd.lua:544: failed to fetch data from etcd: etcd-node2 could not be resolved (3: Host not found),  etcd key: /apisix/consumers, context: ngx.timer
2021/02/28 10:23:18 [error] 54#54: *445 [lua] config_etcd.lua:544: failed to fetch data from etcd: etcd-node2 could not be resolved (3: Host not found),  etcd key: /apisix/ssl, context: ngx.timer

and calling the admin api returns:

{
	"error_msg": "etcd-node2 could not be resolved (3: Host not found)"
}

by printing the log I found that no health check for etcd was started and no health_check.init () method was called

@tokers
Copy link
Contributor

tokers commented Mar 1, 2021

@tokers service name is etcd,namespace is apisix, other nodes can be accessed normally

As per https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-hostname-and-subdomain-fields, the FQDN has A/AAAA record only if the pod has hostname and subdomain fields and the subdomain has the same value to the headless service, I'm not sure whether you set these in etcd's Statefulset templates fields.

@Yiyiyimu
Copy link
Member

Yiyiyimu commented Mar 1, 2021

v3.lua:593: attempt to index field 'result' (a nil value)
is this a bug of apisix?

Yes I do think so, we missed some test coverage, so no related error messages are prepared for this

@Yiyiyimu
Copy link
Member

Yiyiyimu commented Mar 1, 2021

@GBXing Got it, the error log is the same. Could you also show your config.yaml?

@Yiyiyimu
Copy link
Member

Yiyiyimu commented Mar 1, 2021

I'm not sure whether you set these in etcd's Statefulset templates fields.

Hi @tokers it seems @GBXing deploy etcd with docker but not k8s, so that might not be the solution

@tokers
Copy link
Contributor

tokers commented Mar 1, 2021

I'm not sure whether you set these in etcd's Statefulset templates fields.

Hi @tokers it seems @GBXing deploy etcd with docker but not k8s, so that might not be the solution

I see, but that the case I'm confusing on 😂

@GBXing
Copy link
Author

GBXing commented Mar 1, 2021

@Yiyiyimu I use version 2.2 of default-config.yaml, this is my config.yaml:

apisix:
  node_listen: 9080                # APISIX listening port
  enable_admin: true
  enable_admin_cors: true          # Admin API support CORS response headers.
  enable_debug: false
  enable_dev_mode: true           # Sets nginx worker_processes to 1 if set to true
  enable_reuseport: true           # Enable nginx SO_REUSEPORT switch if set to true.
  enable_ipv6: true
  config_center: etcd              # etcd: use etcd to store the config value
  allow_admin:
  admin_key:
	-
	  name: "admin"
	  key: edd1c9f034335f136f87ad84b625c8f1
	  role: admin                 # admin: manage all configuration data
								  # viewer: only can view configuration data
	-
	  name: "viewer"
	  key: 4054f7cf07e344346cd3f287985e76a2
	  role: viewer

nginx_config:
  error_log: "logs/error.log"
  error_log_level: "warn"
  http:
      lua_shared_dicts:
	  shared-datamap: 50m

etcd:
  host:
	- "http://etcd-node1:2379"
	- "http://etcd-node2:2379"
	- "http://etcd-node3:2379"
	# - "http://127.0.0.1:2379"
	# - "http://172.17.0.1:2379"   
  prefix: "/apisix"           
  timeout: 30
  tls:
	verify: true
  # resync_delay: 5             
  # user: root                  
  # password: 5tHkHhYkjr6cQY    



plugins:                          # plugin list (sorted in alphabetical order)
  - api-breaker
  - authz-keycloak
  - basic-auth
  - batch-requests
  - consumer-restriction
  - cors
  - echo
  # - error-log-logger
  # - example-plugin
  - fault-injection
  - grpc-transcode
  - hmac-auth
  - http-logger
  - ip-restriction
  - jwt-auth
  - kafka-logger
  - key-auth
  - limit-conn
  - limit-count
  - limit-req
  # - log-rotate
  # - node-status
  - openid-connect
  - prometheus
  - proxy-cache
  - proxy-mirror
  - proxy-rewrite
  - redirect
  - referer-restriction
  - request-id
  - request-validation
  - response-rewrite
  - serverless-post-function
  - serverless-pre-function
  # - skywalking
  - sls-logger
  - syslog
  - tcp-logger
  - udp-logger
  - uri-blocker
  - wolf-rbac
  - zipkin
  - server-info
  - traffic-split

plugin_attr:
  log-rotate:
	interval: 3600
	max_kept: 168
  skywalking:
	service_name: APISIX
	service_instance_name: "APISIX Instance Name"
	endpoint_addr: http://127.0.0.1:12800
  prometheus:
	export_uri: /apisix/prometheus/metrics
  server-info:
	report_interval: 60
	report_ttl: 3600

@membphis
Copy link
Member

ping @Yiyiyimu

@membphis
Copy link
Member

any news? @Yiyiyimu

@nanamikon
Copy link
Contributor

Any news? We found the similar porblem, some of nodes (not all) found this error message, my etcd version is 3.4.13.

stack traceback:
        ...p/huya-nginx-proxy//deps/share/lua/5.1/resty/etcd/v3.lua:652: in function 'res_func'
        /data/app/huya-nginx-proxy/apisix/core/config_etcd.lua:131: in function 'waitdir'
        /data/app/huya-nginx-proxy/apisix/core/config_etcd.lua:318: in function 'sync_data'
        /data/app/huya-nginx-proxy/apisix/core/config_etcd.lua:546: in function </data/app/huya-nginx-proxy/apisix/core/config_etcd.lua:536>
        [C]: in function 'xpcall'

But admin api is ok provided by these nodes , and they can not recover forever

@Yiyiyimu
Copy link
Member

Yiyiyimu commented May 6, 2021

Any news? We found the similar porblem, some of nodes (not all) found this error message, my etcd version is 3.4.13.

@nanamikon will add PR to solve it this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants