Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read index retry #12780

Merged
merged 2 commits into from
Mar 23, 2021
Merged

Read index retry #12780

merged 2 commits into from
Mar 23, 2021

Conversation

wpedrak
Copy link
Contributor

@wpedrak wpedrak commented Mar 16, 2021

It is second approach (with first being #12762) to solve #12680

This PR is composed of 2 commit, first being refactor of l-read loop and second being implementation of retry mechanism itself.

Drawbacks of this change (would like to seek your opinion on it):

  • retry time is hardcoded as 500ms
  • current 7s timeout resulting in error can become up to 7.5s depending on execution flow

server/etcdserver/v3_server.go Show resolved Hide resolved
server/etcdserver/v3_server.go Outdated Show resolved Hide resolved
lg := s.Logger()
errorTimer := time.NewTimer(s.Cfg.ReqTimeout())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand it...

Would following approach work:

  • retryTimer instead of errorTimer.
  • both selects merged into a single select.
    ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I've decided to use 2 selects instead of one because of this article. In short: it is to guard against picking case at random when multiple channels are unblocked. It seemed reasonable when writing code, however now I can't say any good reason to have it this way, so I'll move it to single select.

Second think is "retryTimer instead of errorTimer". I'm not sure if I understand you correctly, but if you suggest having retryTimer (one that measure 500ms) outside of for loop and putting errorTimer (one that measure 7s) in <-time.After(...), then it would not work as we need to have single errorTimer across potentially multiple retryTimer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A.d. 1. I see. But I think we don't really care which branch will be taken if it happens that multiple activate exactly at the same time. So would go for merging.

A.d. 2: You are right. Potentially you can use top-level 'ticker' (time.NewTicker) to get periodic notification to refresh the request. The benefit is that it automatically will cancel goroutine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A.d. 2:
I see Timer more intuitive than Ticker here, as "ping me after 500ms" describes concept of timeout better than "ping me every 500ms". However I don't understand below part

The benefit is that it automatically will cancel goroutine

Could you elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<-time.After(readIndexRetryTime)

under the cover starts go-routine the sleep's some time and populates the channel.

If the select exits for another reason, the goroutne still exists for 'up to 500ms' to populate the channel (that no-one is waiting for).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've changed it to single timer initialised outside of loop and I use retryTimer.Reset(readIndexRetryTime) to refresh it.

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements for read index reliability!

Any chance to try this out for #12680?

@ptabor
Copy link
Contributor

ptabor commented Mar 20, 2021

Nice improvements for read index reliability!

Any chance to try this out for #12680?

@Cjen1 evaluated the solution in #12680 (comment). Bottom 3 charts show reduction of the delay to ~1s with etcd-retry solution.

etcd-postpone solution was in 2 cases ~700ms and in one ~1300ms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants