Skip to content
This repository has been archived by the owner on Apr 27, 2021. It is now read-only.

share ewma stats among workers #220

Merged
merged 2 commits into from
Jul 3, 2019
Merged

share ewma stats among workers #220

merged 2 commits into from
Jul 3, 2019

Conversation

ElvinEfendi
Copy link

@ElvinEfendi ElvinEfendi commented Jul 2, 2019

What this PR does / why we need it:

This is revert of kubernetes#3295. Additionally the PR also adds an e2e test for EWMA and a guard to error when resty lock can not be instantiated.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

@ElvinEfendi ElvinEfendi force-pushed the ewma-improvements-1 branch 2 times, most recently from d8d875e to 41607c5 Compare July 2, 2019 17:16
-- with an older value in the current worker. Theoretically it's possible that
-- we are stuck with the same old EWMA value but for this to happen all the workers
-- must be sending request to the same upstream in a way that they always read old value
-- and override one another's EWMA value with the old one.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could have some other unexpected side effects, but I can't tell exactly what ATM.
Also not sure how harmful that could be.

I think, if we go with this we need to emit some statsd to have a better view of what is happening.

In parallel we can dig into the solution we were talking about on Slack, asynchronously update EWMA from worker 0 based on information send by workers using shared dict [lr]push.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have some very strong opinion that this is safe, please move forward with this in a limited environment.

@ElvinEfendi ElvinEfendi force-pushed the ewma-improvements-1 branch 2 times, most recently from cd9dccd to 4981283 Compare July 3, 2019 04:51
@ElvinEfendi ElvinEfendi force-pushed the ewma-improvements-1 branch from 4981283 to c96a721 Compare July 3, 2019 04:52
@ElvinEfendi ElvinEfendi requested review from wayt and csfrancis July 3, 2019 04:53
self.ewma_last_touched_at = {}

ngx.shared.balancer_ewma:flush_all()
ngx.shared.balancer_ewma_last_touched_at:flush_all()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a subsequent PR I'll change this to not flush all data and ensure slow start for new upstreams. I'll also set a specific ngx.var to the value of EWMA corresponding to currently picked upstream/endpoint.

@ElvinEfendi ElvinEfendi force-pushed the ewma-improvements-1 branch from c96a721 to ce382f6 Compare July 3, 2019 13:48
f.UpdateNginxConfigMapData("load-balance", "ewma")
})

It("does not fail requests", func() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to assert on request distribution but it's too flaky.

end
local ewma = ngx.shared.balancer_ewma:get(upstream) or 0
if lock_err ~= nil then
return ewma, lock_err
Copy link
Author

@ElvinEfendi ElvinEfendi Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to note that lock_err gets ignored. In our DC implementation we also ignore this, but the difference is we emit statsd metric when an error happens.

I assume it's ignored because this happens a lot.

@ElvinEfendi ElvinEfendi merged commit 6ba947c into master Jul 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants