feat: chaos test on route could still works when etcd is down #3404

Yiyiyimu · 2021-01-24T16:33:38Z

What this PR does / why we need it:

Related to #2757

Use chaos mesh to generate the situation, when etcd is all killed, route could still works.

TODO

add test on when etcd restart, how's it going

Pre-submission checklist:

Did you explain what problem does this PR solve? Or what new features have been added?
Have you added corresponding test cases?
Have you modified the corresponding document?
Is this PR backward compatible? If it is not backward compatible, please discuss on the mailing list first

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

.github/workflows/chaos.yml

kubernetes/service.yaml

t/chaos-test/go.mod

t/chaos-test/kill-etcd_test.go

.github/workflows/chaos.yml

t/chaos-test/kill-etcd.yaml

t/chaos-test/kill-etcd_test.go

moonming · 2021-01-25T02:34:43Z

t/chaos-test/kill-etcd_test.go

+	setRoute(e, http.StatusOK)
+	getRoute(e, http.StatusOK)


We should add check such as nginx error log and Prometheus

Thanks! Added

t/chaos-test/kill-etcd_test.go

.github/workflows/chaos.yml

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

…nto chaos

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

t/chaos/kill-etcd_test.go

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

…t working Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Yiyiyimu · 2021-01-27T13:01:59Z

t/chaos/kill-etcd_test.go

+	// run in background
+	go func() {
+		for i := 1; ; i++ {
+			go getWithoutTest(g, "/hello", map[string]string{"Host": "foo.com"})


HELP NEEDED!!

Here I try to visit the route in the same frequency, so I could check if it keep the same before and after etcd shutdown.

For some reason, using bare net/http pkg to access the route would always return "404 Route not found", although the route is created, and accessing the same URL with httpexpect could work(in line 207). So the next check would fail, getIngressBandwidthPerSecond would return 0 since no request succeeds (commit) (see ci error).

While the reason I don't want to use httpexpect this place, is due to it would print lots of log each time a request sent and it would make test log into a mess (commit). So a bare HTTP get would be better.

Hi @juzhiyuan could you offer some help on this

Since using httpexpect would print log each time a request sent, I turn the route visit frequency from 10 times/s to 1 times/s to reduce the log printing. The reason not to reduce visit duration is that, when frequency is larger than 1 time/s, some visit would return "503 Service Unavailable", although I don't think that's reasonable since the frequency is still very low compared to the real usage.

The question is raised in apache/apisix-dashboard#1388

Yiyiyimu · 2021-01-27T13:13:27Z

Currently when etcd is killed, apisix would keep print out config_etcd.lua:530: failed to fetch data from etcd: connection refused, and the speed is 2 log per second per core. It could be fixed after api7/lua-resty-etcd#111 (return error code when no healthy etcd available) got fixed.

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

t/chaos/kill-etcd_test.go

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

…ics, so need to calculate the duration Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

t/chaos/kill-etcd_test.go

tokers · 2021-01-29T02:04:13Z

@Yiyiyimu It seems in this case you kill all etcd members. What about just killing the leader (will recover overtime), killing the follower (no issue), kill majority members (cluster will not recover)?

t/chaos/go.sum

Yiyiyimu · 2021-01-29T02:30:22Z

@Yiyiyimu It seems in this case you kill all etcd members. What about just killing the leader (will recover overtime), killing the follower (no issue), kill majority members (cluster will not recover)?

Hi @tokers, I think in that case, that's more about test etcd itself, but not apisix. Besides, I don't think we should test etcd in this way, since that's what we make sure how raft and etcd works.

If we test how apisix performs, like when we kill the leader, in a rare case (because the sync is so fast), we could find maybe one new route is missing. But the test case is so unstable, so it's maybe not so reasonable

Update:
There seems issue about this problem, I'll have a test later. Ref

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Yiyiyimu added 8 commits December 10, 2020 12:37

k8s setup

106cac3

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

remove big files

79e7fd2

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

test chaos actions

fe932e0

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

merge master

19fab70

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

finish kill etcd test

09d739c

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

remove unrelated change

d2ac0f3

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix license

b01ddd4

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix yaml check

51659d3

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

spacewander reviewed Jan 25, 2021

View reviewed changes

tokers reviewed Jan 25, 2021

View reviewed changes

.github/workflows/chaos.yml Outdated Show resolved Hide resolved

t/chaos-test/kill-etcd.yaml Outdated Show resolved Hide resolved

t/chaos-test/kill-etcd_test.go Outdated Show resolved Hide resolved

moonming reviewed Jan 25, 2021

View reviewed changes

juzhiyuan reviewed Jan 25, 2021

View reviewed changes

t/chaos-test/kill-etcd_test.go Outdated Show resolved Hide resolved

idbeta reviewed Jan 25, 2021

View reviewed changes

.github/workflows/chaos.yml Outdated Show resolved Hide resolved

gxthrj reviewed Jan 25, 2021

View reviewed changes

.github/workflows/chaos.yml Outdated Show resolved Hide resolved

Yiyiyimu added 6 commits January 25, 2021 23:26

fix according to reviews

2624410

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix typo

2a9f549

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Merge branch 'master' of https://github.com/apache/incubator-apisix i…

6f52f6a

…nto chaos

fix: change minikube to arm64

8d687ec

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix minikube setup due to kubernetes/minikube#9995

c4d02b5

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

merge master

b1c90ef

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

tokers reviewed Jan 27, 2021

View reviewed changes

t/chaos/kill-etcd_test.go Outdated Show resolved Hide resolved

fix test

43e09d9

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Yiyiyimu force-pushed the chaos branch from 0519a24 to 43e09d9 Compare January 27, 2021 03:35

add test bandwidth per second not change much, but get in net/http no…

d0ca10d

…t working Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Yiyiyimu commented Jan 27, 2021

View reviewed changes

use getRoute, but it would print too much unrelated logs

f2eb477

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

spacewander approved these changes Jan 28, 2021

View reviewed changes

t/chaos/kill-etcd_test.go Outdated Show resolved Hide resolved

remove uncommented code

2f12753

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Yiyiyimu mentioned this pull request Jan 28, 2021

bug: when etcd down, apisix would keep printing error log #3444

Closed

This was referenced Jan 28, 2021

request help: bare net/http pkg would always return "404 Route not found" apache/apisix-dashboard#1388

Closed

request help: visit route faster than 1 time/s would return "503 Service Unavailable" #3447

Closed

Yiyiyimu added 4 commits January 28, 2021 18:39

fix test: route haven't set yet at first get

89c91ba

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix test: debug to see bandwidth at each time

6acc0f6

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix typo

9e48e27

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix: after etcd got killed, it would take longer time to get the metr…

11dcca1

…ics, so need to calculate the duration Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

Yiyiyimu added the chaos chaos scenario to do label Jan 28, 2021

Yiyiyimu mentioned this pull request Jan 28, 2021

Billboard: all chaos test to do ( welcome new ideas!😆) #3449

Closed

10 tasks

tokers reviewed Jan 29, 2021

View reviewed changes

t/chaos/kill-etcd_test.go Outdated Show resolved Hide resolved

membphis reviewed Jan 29, 2021

View reviewed changes

t/chaos/go.sum Show resolved Hide resolved

Yiyiyimu added 2 commits January 29, 2021 12:14

fix grammar

d97d594

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

fix typo

125d996

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>

membphis approved these changes Jan 29, 2021

View reviewed changes

membphis merged commit a38cbf7 into apache:master Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: chaos test on route could still works when etcd is down #3404

feat: chaos test on route could still works when etcd is down #3404

Yiyiyimu commented Jan 24, 2021

moonming Jan 25, 2021

Yiyiyimu Jan 26, 2021

Yiyiyimu Jan 27, 2021 •

edited

Loading

Yiyiyimu Jan 27, 2021

Yiyiyimu Jan 27, 2021

Yiyiyimu Jan 28, 2021

Yiyiyimu commented Jan 27, 2021

tokers commented Jan 29, 2021

Yiyiyimu commented Jan 29, 2021 •

edited

Loading

feat: chaos test on route could still works when etcd is down #3404

feat: chaos test on route could still works when etcd is down #3404

Conversation

Yiyiyimu commented Jan 24, 2021

What this PR does / why we need it:

Pre-submission checklist:

moonming Jan 25, 2021

Choose a reason for hiding this comment

Yiyiyimu Jan 26, 2021

Choose a reason for hiding this comment

Yiyiyimu Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

Yiyiyimu Jan 27, 2021

Choose a reason for hiding this comment

Yiyiyimu Jan 27, 2021

Choose a reason for hiding this comment

Yiyiyimu Jan 28, 2021

Choose a reason for hiding this comment

Yiyiyimu commented Jan 27, 2021

tokers commented Jan 29, 2021

Yiyiyimu commented Jan 29, 2021 • edited Loading

Yiyiyimu Jan 27, 2021 •

edited

Loading

Yiyiyimu commented Jan 29, 2021 •

edited

Loading