Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passive health checks #2888

Merged
merged 1 commit into from
Feb 23, 2024
Merged

Passive health checks #2888

merged 1 commit into from
Feb 23, 2024

Conversation

RomanZavodskikh
Copy link
Member

@RomanZavodskikh RomanZavodskikh commented Jan 26, 2024

There is huge ongoing effort to enable Passive Health Checks is Skipper to check backends for errors and try not to send requests to the backends that tend to fail the requests.

The initial idea I have at the moment is the following:

  • leverage the endpointregistry component (Introduce endpointregistry component #2535) developed recently to collect more data about endpoints (total amount of reqs sent to the endpoint, amount of reqs that were timed out because of backend outage)
  • if ratio of requests that were timed out is too high for particular endpoint, then send lower amount of requests to this endpoint, the decrease of RPS is configurable
  • once in a while (once in statsResetPeriod) update the stats to check endpoints' states change

@RomanZavodskikh RomanZavodskikh added do-not-merge architectural all changes in the hot path, big changes in the control plane, control flow changes in filters labels Jan 26, 2024
@RomanZavodskikh RomanZavodskikh added the wip work in progress label Jan 30, 2024
proxy/healthy_endpoints_test.go Outdated Show resolved Hide resolved
config/config.go Outdated Show resolved Hide resolved
config/config.go Outdated Show resolved Hide resolved
docs/operation/operation.md Outdated Show resolved Hide resolved
proxy/healthy_endpoints_test.go Outdated Show resolved Hide resolved
proxy/proxy.go Outdated Show resolved Hide resolved
proxy/proxy.go Outdated Show resolved Hide resolved
docs/operation/operation.md Outdated Show resolved Hide resolved
routing/endpointregistry.go Outdated Show resolved Hide resolved
routing/endpointregistry.go Outdated Show resolved Hide resolved
@szuecs
Copy link
Member

szuecs commented Feb 6, 2024

Let me comment on the idea.

The initial idea I have at the moment is the following:

* leverage the endpointregistry component ([Introduce endpointregistry component #2535](https://github.com/zalando/skipper/pull/2535)) developed recently to collect more data about endpoints (total amount of reqs sent to the endpoint, amount of reqs that were timed out because of backend outage)

yes you need requests total and errors total of a chosen time period.
Timeouts are only one error case. Another could be also connection reset by peer and others. I think we should count errors from skipper proxy point of view ( rsp, err := tr.RoundTrip(), all of those where err is not nil should be good ).

* if ratio of requests that were timed out is too high for particular endpoint, that filter it out, the same way we filter out fading in endpoint (https://github.com/zalando/skipper/blob/v0.19.32/proxy/proxy.go#L475)

* once in a while (once in `statsResetPeriod`) half the amount of both total and timed out request so every single request sent now will become less and less important in the future (1 at the moment, 1/2 after `statsResetPeriod`, 1/4 after `2*statsResetPeriod`, etc.)

For the stats collection I would rather have 2 counters for each period in time and swap them.
Example you have observation period d and counters total and errors


// let's assume these are sync.Atomic and package global for simplicity of this example
var totals [2]int
var errors [2]int
var cur int // current pointer to access the arrays

// called for each request-response observation
func measure(ok bool) {
   totals[cur]++
   if !ok {
      errors[cur]++
   }
}

func background() {
  d := time.Second
  ticker := time.NewTicker(d)

  for {
    select {
      case <-quit:
           ticker.Stop()
           return
      case <- ticker.C:
    }
    ptr = cur
    cur = (cur + 1) % 2 // swap to new observation window and have stable buckets for our compute
    compute(totals[ptr], errors[ptr]) // this should store for fast lookup in the hotpath
  }   
}

@RomanZavodskikh RomanZavodskikh force-pushed the passiveHealthChecks branch 3 times, most recently from 43d2d23 to c9abb70 Compare February 9, 2024 17:51
@RomanZavodskikh RomanZavodskikh marked this pull request as ready for review February 9, 2024 17:52
@RomanZavodskikh RomanZavodskikh added wip work in progress and removed wip work in progress labels Feb 9, 2024
proxy/proxy.go Outdated Show resolved Hide resolved
proxy/proxy.go Outdated Show resolved Hide resolved
@RomanZavodskikh RomanZavodskikh marked this pull request as draft February 19, 2024 17:49
@RomanZavodskikh RomanZavodskikh force-pushed the passiveHealthChecks branch 2 times, most recently from 519bddf to fe7d257 Compare February 21, 2024 17:12
proxy/proxy.go Outdated Show resolved Hide resolved
routing/endpointregistry.go Outdated Show resolved Hide resolved
routing/endpointregistry.go Outdated Show resolved Hide resolved
routing/endpointregistry.go Outdated Show resolved Hide resolved
@szuecs
Copy link
Member

szuecs commented Feb 22, 2024

I am testing this on my computer:

# failing backend
./bin/skipper -inline-routes='fail: Traffic(0.9) -> latency("600ms") -> status(502) -> <shunt>; r: * -> status(200) -> repeatContent("foo",1) -> <shunt>' -address :9001

# good backends
./bin/skipper -inline-routes='r: * -> status(200) -> repeatContent("foo",1) -> <shunt>' -address :9002 
./bin/skipper -inline-routes='r: * -> status(200) -> repeatContent("foo",1) -> <shunt>' -address :9003 
./bin/skipper -inline-routes='r: * -> status(200) -> repeatContent("foo",1) -> <shunt>' -address :9004 

# proxy
./bin/skipper -inline-routes='* -> backendTimeout("100ms") -> <roundRobin , "http://127.0.0.1:9001", "http://127.0.0.1:9002", "http://127.0.0.1:9003", "http://127.0.0.1:9004">' -passive-health-check='period=1s,min-requests=10,max-drop-probability=0.99'

# client
echo "GET http://127.0.0.1:9090" | vegeta attack -rate 100/s -duration 1m| vegeta report
# without PHC
Requests      [total, rate, throughput]  6000, 100.02, 77.77
Duration      [total, attack, wait]      1m0.062468929s, 59.990385304s, 72.083625ms
Latencies     [mean, 50, 95, 99, max]    23.311264ms, 1.211641ms, 101.237344ms, 101.579846ms, 103.797075ms
Bytes In      [total, mean]              25935, 4.32
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    77.85%
Status Codes  [code:count]               200:4671  504:1329

## change only probability

# config: period=1s,min-requests=10,max-drop-probability=0.99999
Requests      [total, rate, throughput]  6000, 100.02, 87.20
Duration      [total, attack, wait]      59.990814723s, 59.989948893s, 865.83µs
Latencies     [mean, 50, 95, 99, max]    13.965997ms, 1.158122ms, 101.110253ms, 101.531425ms, 102.462834ms
Bytes In      [total, mean]              17535, 2.92
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    87.18%
Status Codes  [code:count]               200:5231  504:769

# config: period=1s,min-requests=10,max-drop-probability=0.99
Requests      [total, rate, throughput]  6000, 100.02, 87.34
Duration      [total, attack, wait]      1m0.061878174s, 59.990611916s, 71.266258ms
Latencies     [mean, 50, 95, 99, max]    13.717603ms, 1.160598ms, 101.112726ms, 101.487819ms, 102.521526ms
Bytes In      [total, mean]              17310, 2.88
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    87.43%
Status Codes  [code:count]               200:5246  504:754

# config: period=1s,min-requests=10,max-drop-probability=0.9
Requests      [total, rate, throughput]  6000, 100.02, 87.19
Duration      [total, attack, wait]      1m0.062495893s, 59.990023414s, 72.472479ms
Latencies     [mean, 50, 95, 99, max]    13.87566ms, 1.171148ms, 101.101737ms, 101.467003ms, 102.234015ms
Bytes In      [total, mean]              17445, 2.91
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    87.28%
Status Codes  [code:count]               200:5237  504:763

# config: period=1s,min-requests=10,max-drop-probability=0.5
Requests      [total, rate, throughput]  6000, 100.02, 86.81
Duration      [total, attack, wait]      1m0.071568876s, 59.990726491s, 80.842385ms
Latencies     [mean, 50, 95, 99, max]    14.255548ms, 1.176741ms, 101.179163ms, 101.578261ms, 102.167219ms
Bytes In      [total, mean]              17775, 2.96
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    86.92%
Status Codes  [code:count]               200:5215  504:785

## change only period

# config: period=1s,min-requests=10,max-drop-probability=0.99
Requests      [total, rate, throughput]  6000, 100.02, 87.34
Duration      [total, attack, wait]      1m0.061878174s, 59.990611916s, 71.266258ms
Latencies     [mean, 50, 95, 99, max]    13.717603ms, 1.160598ms, 101.112726ms, 101.487819ms, 102.521526ms
Bytes In      [total, mean]              17310, 2.88
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    87.43%
Status Codes  [code:count]               200:5246  504:754

# config: period=2s,min-requests=10,max-drop-probability=0.99
Requests      [total, rate, throughput]  6000, 100.02, 88.68
Duration      [total, attack, wait]      59.99172223s, 59.990554612s, 1.167618ms
Latencies     [mean, 50, 95, 99, max]    12.494545ms, 1.167441ms, 101.087748ms, 101.453525ms, 102.300948ms
Bytes In      [total, mean]              16200, 2.70
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    88.67%
Status Codes  [code:count]               200:5320  504:680

# config: period=5s,min-requests=10,max-drop-probability=0.99
Requests      [total, rate, throughput]  6000, 100.02, 89.41
Duration      [total, attack, wait]      59.9917374s, 59.990436431s, 1.300969ms
Latencies     [mean, 50, 95, 99, max]    11.762907ms, 1.163042ms, 101.100827ms, 101.513375ms, 102.320193ms
Bytes In      [total, mean]              15540, 2.59
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    89.40%
Status Codes  [code:count]               200:5364  504:636

# config: period=10s,min-requests=10,max-drop-probability=0.99
Requests      [total, rate, throughput]  6000, 100.02, 93.86
Duration      [total, attack, wait]      59.991338643s, 59.990097256s, 1.241387ms
Latencies     [mean, 50, 95, 99, max]    7.306304ms, 1.137679ms, 100.903262ms, 101.396707ms, 102.076756ms
Bytes In      [total, mean]              11535, 1.92
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    93.85%
Status Codes  [code:count]               200:5631  504:369

## change min-requests

# config: period=1s,min-requests=10,max-drop-probability=0.99
Requests      [total, rate, throughput]  6000, 100.02, 87.34
Duration      [total, attack, wait]      1m0.061878174s, 59.990611916s, 71.266258ms
Latencies     [mean, 50, 95, 99, max]    13.717603ms, 1.160598ms, 101.112726ms, 101.487819ms, 102.521526ms
Bytes In      [total, mean]              17310, 2.88
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    87.43%
Status Codes  [code:count]               200:5246  504:754

# config: period=1s,min-requests=5,max-drop-probability=0.99
Requests      [total, rate, throughput]  6000, 100.02, 88.63
Duration      [total, attack, wait]      59.992140248s, 59.990822578s, 1.31767ms
Latencies     [mean, 50, 95, 99, max]    12.539775ms, 1.16641ms, 101.078034ms, 101.471462ms, 102.111202ms
Bytes In      [total, mean]              16245, 2.71
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    88.62%
Status Codes  [code:count]               200:5317  504:683

# config: period=1s,min-requests=25,max-drop-probability=0.99
Requests      [total, rate, throughput]  6000, 100.02, 80.14
Duration      [total, attack, wait]      1m0.071885967s, 59.990530836s, 81.355131ms
Latencies     [mean, 50, 95, 99, max]    20.934683ms, 1.193526ms, 101.288121ms, 101.618813ms, 106.245114ms
Bytes In      [total, mean]              23790, 3.96
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    80.23%
Status Codes  [code:count]               200:4814  504:1186

## fine tune 

# period=5s,min-requests=5,max-drop-probability=0.99
Requests      [total, rate, throughput]  6000, 100.02, 94.53
Duration      [total, attack, wait]      59.990708329s, 59.989737923s, 970.406µs
Latencies     [mean, 50, 95, 99, max]    6.638696ms, 1.139882ms, 91.635898ms, 101.341263ms, 102.028961ms
Bytes In      [total, mean]              10935, 1.82
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    94.52%
Status Codes  [code:count]               200:5671  504:329

# period=10s,min-requests=5,max-drop-probability=0.99
Requests      [total, rate, throughput]  6000, 100.02, 95.66
Duration      [total, attack, wait]      59.991345397s, 59.990136973s, 1.208424ms
Latencies     [mean, 50, 95, 99, max]    5.514355ms, 1.145402ms, 1.995272ms, 101.40748ms, 102.648189ms
Bytes In      [total, mean]              9915, 1.65
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    95.65%
Status Codes  [code:count]               200:5739  504:261

@AlexanderYastrebov
Copy link
Member

👍

@RomanZavodskikh
Copy link
Member Author

Signed-off-by: Roman Zavodskikh <roman.zavodskikh@zalando.de>
@szuecs
Copy link
Member

szuecs commented Feb 23, 2024

👍

1 similar comment
@RomanZavodskikh
Copy link
Member Author

👍

@RomanZavodskikh
Copy link
Member Author

RomanZavodskikh commented Feb 23, 2024

Load tested the latest image from this PR with such an options on k8s cluster. 1 healthy and 1 completely unhealthy endpoint.

# no options
Requests      [total, rate, throughput]         18005, 100.00, 49.74
Duration      [total, attack, wait]             3m0s, 3m0s, 12.291ms
Latencies     [min, mean, 50, 90, 95, 99, max]  2.331ms, 33.466ms, 39.489ms, 63.35ms, 63.878ms, 67.248ms, 306.544ms
Bytes In      [total, mean]                     207467, 11.52
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           49.75%
Status Codes  [code:count]                      200:8957  504:9048  

# -passive-health-check=period=1s,min-requests=10,max-drop-probability=0.99999
Requests      [total, rate, throughput]         20001, 100.01, 77.58
Duration      [total, attack, wait]             3m20s, 3m20s, 61.625ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.338ms, 16.912ms, 3.436ms, 63.252ms, 63.822ms, 65.332ms, 277.025ms
Bytes In      [total, mean]                     180336, 9.02
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           77.60%
Status Codes  [code:count]                      200:15520  504:4481  

# -passive-health-check=period=5s,min-requests=5,max-drop-probability=0.99
Requests      [total, rate, throughput]         18007, 100.01, 78.71
Duration      [total, attack, wait]             3m0s, 3m0s, 3.501ms
Latencies     [min, mean, 50, 90, 95, 99, max]  1.76ms, 16.075ms, 3.222ms, 62.711ms, 63.102ms, 64.696ms, 290.57ms
Bytes In      [total, mean]                     160564, 8.92
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           78.70%
Status Codes  [code:count]                      200:14172  504:3835  

# -passive-health-check=period=30s,min-requests=5,max-drop-probability=0.99
Requests      [total, rate, throughput]         18007, 100.01, 99.34
Duration      [total, attack, wait]             3m0s, 3m0s, 2.803ms
Latencies     [min, mean, 50, 90, 95, 99, max]  2.167ms, 3.733ms, 3.06ms, 4.018ms, 4.541ms, 14.933ms, 218.219ms
Bytes In      [total, mean]                     127129, 7.06
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           99.33%
Status Codes  [code:count]                      200:17887  504:120  

Seems fair enough to merge.

@RomanZavodskikh RomanZavodskikh merged commit accfde5 into master Feb 23, 2024
14 checks passed
@RomanZavodskikh RomanZavodskikh deleted the passiveHealthChecks branch February 23, 2024 14:43
Comment on lines +23 to +27
if p < e.Metrics.HealthCheckDropProbability() {
/* drop */
} else {
filtered = append(filtered, e)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not

if p >= e.Metrics.HealthCheckDropProbability() {
     filtered = append(filtered, e)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to be clear about the decision that if p < e.Metrics.HealthCheckDropProbability() holds true we want to skip the endpoint.

@@ -145,6 +145,69 @@ type OpenTracingParams struct {
ExcludeTags []string
}

type PassiveHealthCheck struct {
// The period of time after which the endpointregistry begins to calculate endpoints statistics
Copy link
Member

@MustafaSaber MustafaSaber Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for godocs you need to start the comment with field name
// Period .... same for other comments

MaxDropProbability float64
}

func InitPassiveHealthChecker(o map[string]string) (bool, *PassiveHealthCheck, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename o to opts?

AlexanderYastrebov added a commit that referenced this pull request May 7, 2024
YAML flag parses (preferably flow-style) YAML value from the command line
or unmarshals object yaml value from the yaml config file.

Example value:
```
bin/skipper -foo-flag='{foo: hello, bar: [world, "1"], baz: 2, qux: {baz: 3}}'
```

and equivalent branch in config yaml:
```yaml
foo-flag:
  foo: hello
  bar:
    - world
    - "1"
  baz: 2
  qux:
    baz: 3
```

This will be useful for #2104

It is also a better alternative to manual parsing of micro-syntax like
e.g. implemented in #2888

Signed-off-by: Alexander Yastrebov <alexander.yastrebov@zalando.de>
AlexanderYastrebov added a commit that referenced this pull request May 7, 2024
YAML flag parses (preferably flow-style) YAML value from the command line
or unmarshals object yaml value from the yaml config file.

Example value:
```
bin/skipper -foo-flag='{foo: hello, bar: [world, "1"], baz: 2, qux: {baz: 3}}'
```

and equivalent branch in config yaml:
```yaml
foo-flag:
  foo: hello
  bar:
    - world
    - "1"
  baz: 2
  qux:
    baz: 3
```

This will be useful for #2104

It is also a better alternative to manual parsing of micro-syntax like
e.g. implemented in #2888

Signed-off-by: Alexander Yastrebov <alexander.yastrebov@zalando.de>
AlexanderYastrebov added a commit that referenced this pull request May 8, 2024
YAML flag parses (preferably flow-style) YAML value from the command line
or unmarshals object yaml value from the yaml config file.

Example value:
```
bin/skipper -foo-flag='{foo: hello, bar: [world, "1"], baz: 2, qux: {baz: 3}}'
```

and equivalent branch in config yaml:
```yaml
foo-flag:
  foo: hello
  bar:
    - world
    - "1"
  baz: 2
  qux:
    baz: 3
```

This will be useful for #2104

It is also a better alternative to manual parsing of micro-syntax like
e.g. implemented in #2888

Signed-off-by: Alexander Yastrebov <alexander.yastrebov@zalando.de>
JanardhanSharma pushed a commit to JanardhanSharma/skipper that referenced this pull request Jul 19, 2024
YAML flag parses (preferably flow-style) YAML value from the command line
or unmarshals object yaml value from the yaml config file.

Example value:
```
bin/skipper -foo-flag='{foo: hello, bar: [world, "1"], baz: 2, qux: {baz: 3}}'
```

and equivalent branch in config yaml:
```yaml
foo-flag:
  foo: hello
  bar:
    - world
    - "1"
  baz: 2
  qux:
    baz: 3
```

This will be useful for zalando#2104

It is also a better alternative to manual parsing of micro-syntax like
e.g. implemented in zalando#2888

Signed-off-by: Alexander Yastrebov <alexander.yastrebov@zalando.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architectural all changes in the hot path, big changes in the control plane, control flow changes in filters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants