Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: chaos test on route could still works when etcd is down #3404

Merged
merged 24 commits into from
Jan 29, 2021

Conversation

Yiyiyimu
Copy link
Member

What this PR does / why we need it:

Related to #2757

Use chaos mesh to generate the situation, when etcd is all killed, route could still works.

TODO

  • add test on when etcd restart, how's it going

Pre-submission checklist:

  • Did you explain what problem does this PR solve? Or what new features have been added?
  • Have you added corresponding test cases?
  • Have you modified the corresponding document?
  • Is this PR backward compatible? If it is not backward compatible, please discuss on the mailing list first

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
.github/workflows/chaos.yml Outdated Show resolved Hide resolved
.github/workflows/chaos.yml Outdated Show resolved Hide resolved
kubernetes/service.yaml Outdated Show resolved Hide resolved
t/chaos-test/go.mod Outdated Show resolved Hide resolved
t/chaos-test/kill-etcd_test.go Outdated Show resolved Hide resolved
.github/workflows/chaos.yml Outdated Show resolved Hide resolved
t/chaos-test/kill-etcd.yaml Outdated Show resolved Hide resolved
t/chaos-test/kill-etcd_test.go Outdated Show resolved Hide resolved
Comment on lines 101 to 102
setRoute(e, http.StatusOK)
getRoute(e, http.StatusOK)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add check such as nginx error log and Prometheus

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added

.github/workflows/chaos.yml Outdated Show resolved Hide resolved
.github/workflows/chaos.yml Outdated Show resolved Hide resolved
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
t/chaos/kill-etcd_test.go Outdated Show resolved Hide resolved
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
…t working

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
// run in background
go func() {
for i := 1; ; i++ {
go getWithoutTest(g, "/hello", map[string]string{"Host": "foo.com"})
Copy link
Member Author

@Yiyiyimu Yiyiyimu Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HELP NEEDED!!

Here I try to visit the route in the same frequency, so I could check if it keep the same before and after etcd shutdown.

For some reason, using bare net/http pkg to access the route would always return "404 Route not found", although the route is created, and accessing the same URL with httpexpect could work(in line 207). So the next check would fail, getIngressBandwidthPerSecond would return 0 since no request succeeds (commit) (see ci error).

While the reason I don't want to use httpexpect this place, is due to it would print lots of log each time a request sent and it would make test log into a mess (commit). So a bare HTTP get would be better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @juzhiyuan could you offer some help on this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since using httpexpect would print log each time a request sent, I turn the route visit frequency from 10 times/s to 1 times/s to reduce the log printing. The reason not to reduce visit duration is that, when frequency is larger than 1 time/s, some visit would return "503 Service Unavailable", although I don't think that's reasonable since the frequency is still very low compared to the real usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is raised in apache/apisix-dashboard#1388

@Yiyiyimu
Copy link
Member Author

Currently when etcd is killed, apisix would keep print out config_etcd.lua:530: failed to fetch data from etcd: connection refused, and the speed is 2 log per second per core. It could be fixed after api7/lua-resty-etcd#111 (return error code when no healthy etcd available) got fixed.

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
t/chaos/kill-etcd_test.go Outdated Show resolved Hide resolved
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
…ics, so need to calculate the duration

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
@Yiyiyimu Yiyiyimu added the chaos chaos scenario to do label Jan 28, 2021
t/chaos/kill-etcd_test.go Outdated Show resolved Hide resolved
@tokers
Copy link
Contributor

tokers commented Jan 29, 2021

@Yiyiyimu It seems in this case you kill all etcd members. What about just killing the leader (will recover overtime), killing the follower (no issue), kill majority members (cluster will not recover)?

@Yiyiyimu
Copy link
Member Author

Yiyiyimu commented Jan 29, 2021

@Yiyiyimu It seems in this case you kill all etcd members. What about just killing the leader (will recover overtime), killing the follower (no issue), kill majority members (cluster will not recover)?

Hi @tokers, I think in that case, that's more about test etcd itself, but not apisix. Besides, I don't think we should test etcd in this way, since that's what we make sure how raft and etcd works.

If we test how apisix performs, like when we kill the leader, in a rare case (because the sync is so fast), we could find maybe one new route is missing. But the test case is so unstable, so it's maybe not so reasonable


Update:
There seems issue about this problem, I'll have a test later. Ref

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
@membphis membphis merged commit a38cbf7 into apache:master Jan 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chaos chaos scenario to do
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants