v5: Health Check & LeastLoad Strategy #589

qjebbs · 2021-01-07T09:02:39Z

The following config explains everything:

api:
  services:
  - RoutingService   # required for 'v2ray api bi' to work
routing:
  balancers:
  - tag: balancer
    selector:
    - all.balancer
    fallbackTag: block    # fallback to block if the balancer failed to pick one
    strategy:
      type: LeastLoad   # select least load nodes, those who have lower rtt stand deviation
      settings:
        healthCheck:
          interval: 120 # perform checks every 2 minutes (expectations, checks are random) for each outbound, default 60s
          sampling: 5   # sampling recent 5 rtts for selecting, default 5
                        # above setting evaluates the stability in last 10 minutes
          destination: http://www.google.com/gen_204
          connectivity: http://connectivitycheck.platform.hicloud.com/generate_204 # check connectivity if ping fail
          timeout: 5    # health ping timeout 5s
        costs:                          # optional. cost values are used to degrade rtt stand deviations
        - match: all.balancer.node1     # set cost=16 for specific
          value: 16
        - match: keyword                # exclude nodes by keyword
          value: 99999
        - match: cost5                  # set cost=5 if node name match cost5 (extract 5 from cost5)
        - regexp: true
          match: \.weight\d+(\.\d+)?    # matchs & set from .weight<num>, users can set weight with
                                        # outbound tags, like all.node1.weight0.5, all.node2.weight5
        tolerance: 0.1  # max acceptable failure rate
        maxRTT: 1000    # max acceptable rtt, filter away high delay nodes. defalut 0
        expected: 3     # make sure 3 nodes select (Bandwidth priority)
        baselines:      # also select nodes near above 3 according to "rtt stand deviation" baselines
        - 30
        - 40

More about baselines & expected:

// selectLeastLoad selects nodes according to Baselines and Expected Count.
//
// The strategy always improves network response speed, not matter which mode below is configurated.
// But they can still have different priorities.
//
// 1. Bandwidth priority: no Baseline + Expected Count > 0.: selects `Expected Count` of nodes.
// (one if Expected Count <= 0)
//
// 2. Bandwidth priority advanced: Baselines + Expected Count > 0.
// Select `Expected Count` amount of nodes, and also those near them according to baselines.
// In other words, it selects according to different Baselines, until one of them matches
// the Expected Count, if no Baseline matches, Expected Count applied.
//
// 3. Speed priority: Baselines + `Expected Count <= 0`.
// go through all baselines until find selects, if not, select none. Used in combination
// with 'balancer.fallbackTag', it means: selects qualified nodes or use the fallback.
func (s *LeastLoadStrategy) selectLeastLoad(nodes []*node) []*node {

To inspect how LeastLoad strategy works to tweak settings:

v2ray api provides tools to manipulate V2Ray via its API.

Usage:

	v2ray api <command> [arguments]

The commands are:

        ...
        bc            balancer health check
        bi            balancer information
        bo            balancer select override
        ...

Use "v2ray help api <command>" for more information about a command.

> v2ray api bi

Balancer: selector
  - Strategy:
    random
  - Selects:
        Tag
    1   all.xxxxxxxxxx
    2   all.xxxxxxxxxx
    3   all.xxxxxxxxxx
Balancer: wrappers
  - Strategy:
    leastload, expected: 3, baselines: 30ms 40ms, max rtt: 1s, tolerance: 0.1
    health ping, interval: 2m0s, sampling: 5, timeout: 5s, destination: http://www.google.com/gen_204
  - Selects:
            RTT STD+C     RTT STD.      RTT Avg.      Hit   Cost  Tag
    1   OK  72.881354ms   72.881354ms   667.268022ms  10/10 1.00  all.balancer.xxxxxxx
    2   OK  80.283841ms   80.283841ms   773.369174ms  10/10 1.00  all.balancer.xxxxxxx
    3   OK  97.549384ms   97.549384ms   705.809272ms  9/10  1.00  all.balancer.xxxxxxx
  - Others:
            RTT STD+C     RTT STD.      RTT Avg.      Hit   Cost  Tag
    4   OK  103.248756ms  73.007896ms   951.2193ms    10/10 2.00  all.balancer.xxxxxxx
    5   OK  121.640366ms  43.006364ms   711.839948ms  10/10 8.00  all.balancer.xxxxxxx
    6   >   321.662975ms  321.662975ms  1.178536215s  10/10 1.00  all.balancer.xxxxxxx
    7   >   364.949006ms  364.949006ms  1.138114298s  10/10 1.00  all.balancer.xxxxxxx
    8   x   -             -             -             0/10  1.00  all.balancer.xxxxxxx
    9   x   -             -             -             0/10  1.00  all.balancer.xxxxxxx
    10  x   -             -             -             0/10  1.00  all.balancer.xxxxxxx

PS: Default strategy is Random (the original strategy), the behavior is unchanged, user doesn't need to explicitly set to it.

database64128 · 2021-01-07T09:23:54Z

Can we have the interval in seconds? Currently it can be confusing since the timeout field right below it is in seconds.

An example of load balancer that does the job well is https://github.com/shadowsocks/shadowsocks-rust/blob/master/crates/shadowsocks-service/src/local/loadbalancing/server_stat.rs. It pulls TCP and UDP separately every 6 seconds, calculates the score based on different weights applied to RTT, failure rate, and latency stdev from the last 10 minutes.

qjebbs · 2021-01-07T09:39:43Z

Can we have the interval in seconds? Currently it can be confusing since the timeout field right below it is in seconds.

An example of load balancer that does the job well is https://github.com/shadowsocks/shadowsocks-rust/blob/master/crates/shadowsocks-service/src/local/loadbalancing/server_stat.rs. It pulls TCP and UDP separately every 6 seconds, calculates the score based on different weights applied to RTT, failure rate, and latency stdev from the last 10 minutes.

Thanks for the information.

~~One check could take 5, 10 seconds, and even more, small seconds interval is meaningless. I take the unit as 'minutes' on purpose.~~

I wrote these codes without reading any other implementations (and I cannot read Rust), there of course can be better implementations, I hope someone else can continue the work since it works enough for me now:

Improve health check logic (with RTT_WEIGHT, FAIL_WEIGHT, and more?)
Continue to write more better strategies

qjebbs · 2021-01-07T09:54:24Z

@database64128 I wonder about why they just measures by pulls TCP and UDP, other than making a real request through the outbound handler. (Was I misunderstanding your words?)

Suppose the outbound is setup like frontend->IPLC->outlet (a chain), measuring at frontend is unreliable, since the bottleneck can be the IPLC and outlet, or even the later nodes can be failed.

PS: I ever considered about measuring node by speed tests, that could be more reliable compares to ping, but it's more heavier to do so. Many clients, like shadowrocket, are doing the similar thing (http 204 ping)

database64128 · 2021-01-07T11:26:45Z

@qjebbs shadowsocks-rust sends real requests for latency measurements. My point was, in addition to the TCP latency test with HTTP 204 pings as implemented in your PR, ss-rust also implements a UDP test by sending DNS requests. TCP and UDP scores are calculated separately, and is likely to end up selecting different servers for TCP and UDP. This helps a lot in situations where certain servers perform well with TCP traffic but poorly with UDP traffic.

qjebbs · 2021-01-11T03:44:47Z

This PR could possibly introduce characteristic for packet inspection in the client side(Many extreme small payload size requests in each interval of time)? I choose to keep it for personal use.

database64128 · 2021-01-11T03:58:08Z

I'd love to see this PR merged. The feature is quite useful and extensible. And users can always choose to use it or not.

database64128 · 2021-01-11T04:01:43Z

This PR could possibly introduce characteristic for packet inspection in the client side(Many extreme small payload size requests in each interval of time)?

It's probably not uncommon for application protocols to have keep-alive messages that are sent at regular intervals.

qjebbs · 2021-01-11T04:08:31Z

@database64128 I have no expertise for this field, If the v2fly team discussed and choose to merge it, I was glad to see it happens. 😁

kslr · 2021-01-11T04:39:56Z

Work in this area is meaningful, I support a merge.

In the future, we don’t need a regular network test, We can get information from the route logs

qjebbs · 2021-01-11T04:52:58Z

Work in this area is meaningful, I support a merge.

In the future, we don’t need a regular network test, We can get information from the route logs

Yes, it's a good plan and we can even collects speed info for balancing strategy, though the disadvantage is that, we will firstly request to a failed node, then to decide not to use it next time.

I'm not good at v2ray underlying logic, like Link, Stat, hope someone will continue on this part?

kslr · 2021-01-11T06:06:41Z

We can support multiple types of balancers, And keep timing measurements.

qjebbs · 2021-01-11T07:28:34Z

@kslr Yes, the PR supports multiple balancers with different strategies and different strategies settings. Currently we have Random and LeastLoad, but they all based on the health checker.

LazyZhu · 2021-01-27T22:12:14Z

@qjebbs
balancers 不支持多个组么? 设置多个组也只会对配置文件中排序第一个进行测试.

    "balancers": [
        {
            "tag": "as-out",
            "selector": [
                "server1a",
                "server1b",
                "server1c"
            ],
            "strategy": {
                "type": "LeastLoad",
                "settings": {
                    "healthCheck": {
                        "destination": "http://detectportal.firefox.com/success.txt",
                        "timeout": 5,
                        "interval": 60,
                        "sampling": 10
                    },
                    "maxRTT": 500,
                    "expected": 3,
                    "baselines": [
                        20,
                        30,
                        50,
                        100,
                        200
                    ]
                }
            }
        },
        {
            "tag": "na-out",
            "selector": [
                "server2a",
                "server2b",
                "server2c"
            ],
            "strategy": {
                "type": "LeastLoad",
                "settings": {
                    "healthCheck": {
                        "destination": "http://detectportal.firefox.com/success.txt",
                        "timeout": 5,
                        "interval": 60,
                        "sampling": 10
                    },
                    "maxRTT": 1200,
                    "expected": 0,
                    "baselines": []
                }
            }
        },
        {
            "tag": "eu-out",
            "selector": [
                "server3a",
                "server3b",
                "server3c"
            ],
            "strategy": {
                "type": "LeastLoad",
                "settings": {
                    "healthCheck": {
                        "destination": "http://detectportal.firefox.com/success.txt",
                        "timeout": 5,
                        "interval": 60,
                        "sampling": 10
                    },
                    "maxRTT": 1500,
                    "expected": 3,
                    "baselines": [
                        50,
                        100,
                        150,
                        200,
                        300
                    ]
                }
            }
        }
    ]

> it causes data racing

qjebbs · 2021-01-28T15:07:55Z

@LazyZhu 有个大BUG，用最新的应该修复了

@kslr 新增了 cost加权、failure tolerance、手工覆盖均衡器、网络连通性检查。想做的都做了，没做的做不了或者不想做了😂

没做的是指手工覆盖时，暂停检查：导致数据竞争，能解决、但没有简单优雅的方法可以解决

~~请帮忙重点看下racing的情况，我已经跌了好几个坑了。。~~

最后切换回 sync.Mutex，应该问题不大了

PS: TestServiceSubscribeRoutingStats 随机失败，是不是因为它写死了端口，有时端口不可用？

testCases := []*RoutingContext{
	{InboundTag: "in", OutboundTag: "out"},
	{TargetIPs: [][]byte{{1, 2, 3, 4}}, TargetPort: 8080, OutboundTag: "out"},
	{TargetDomain: "example.com", TargetPort: 443, OutboundTag: "out"},
	{SourcePort: 9999, TargetPort: 9999, OutboundTag: "out"},
	{Network: net.Network_UDP, OutboundGroupTags: []string{"outergroup", "innergroup"}, OutboundTag: "out"},
	{Protocol: "bittorrent", OutboundTag: "blocked"},
	{User: "example@v2fly.org", OutboundTag: "out"},
	{SourceIPs: [][]byte{{127, 0, 0, 1}}, Attributes: map[string]string{"attr": "value"}, OutboundTag: "out"},
}

* switch sync.Mutex to avoid potential racing * add more tests * code optimize

qjebbs · 2021-01-29T06:57:48Z

好了，收工。

回头看核心策略部分不过300行代码（原以为事情很简单，检查再挑嘛），给它配套的却有3000行，弄到想吐。
现在从ping、采集、统计、加权、挑选、输出统计、手工控制都齐了，未来要写新策略大概率能在几百行的规模上搞定，包括不限于：

按权重将udp和tcp一起考察

按权重将rtt和失败率一起考查

ip hash 防止出口频繁变化

...

等有缘人吧。

kslr · 2021-01-29T09:07:30Z

Great! 只差写一下文档了

p.s 很有可能，不过这个功能作者已经弃坑了，如果未来一段时间没人接手，我可能会选择删掉，换一种简单的方式实现（指找一个库换上

kslr · 2021-01-30T00:31:44Z

🎆

ghost · 2021-02-26T04:13:41Z

@kslr @qjebbs

Is this balancer battery-friendly?
If I set interval to 2147483648 , will it perform health check when the failure rate of current outbound is high?

iusearch · 2022-08-19T15:47:08Z

这个合到v5分支了，但是现在的v5 release只有LeastLoad的添加，那几个api command都没有，想问问怎么回事？

AkinoKaede mentioned this pull request Jan 9, 2021

研发请求:自适应服务器故障分流[解决代理服务器不稳定] XTLS/Xray-core#145

Closed

qjebbs closed this Jan 11, 2021

database64128 reopened this Jan 11, 2021

qjebbs added 15 commits January 12, 2021 11:10

generate .pb.go

112ca27

health checker conf

b86a575

check logic

1ca57bd

implement ping

33f39fd

fix check interval

a081a7b

improve check results

7ea4295

health check on add outbounds

9ffb51b

fix tests

c49cd65

fix ping handler

e7d2628

fix min rtt < 0

555247a

random alive

503e002

fix check all on add outbounds

5fdbd9b

least load strategy

edbb7c0

conf codes optimize

44bc05d

improve least load strategy

f60d7ac

LazyZhu mentioned this pull request Jan 27, 2021

add balance optimal strategy XTLS/Xray-core#168

Closed

api bo to override balancer selecting

777f51d

qjebbs added 11 commits January 28, 2021 11:26

fix health ping statistics & fix test

abfdeaa

check connectivity if ping fail

a40ba6c

add tolerance setting & more detailed bi output

55825b4

fix connectivity check

182e138

optimize bi output

649a32e

should not put results when network is down

cf4acc7

fixes @_@

8e98502

mux optimize

6943b8b

remove pause option of selecting overriding

73bdb36

> it causes data racing

update bo desc

5de62fd

fix potential racing

a9cc48e

qjebbs added 2 commits January 29, 2021 12:44

simplify locking

209f41f

* switch sync.Mutex to avoid potential racing * add more tests * code optimize

code optimize

6274d74

fix connectivity check when url not set

540ae10

kslr merged commit 2c5a714 into v2fly:v5 Jan 30, 2021

RPRX mentioned this pull request Feb 12, 2021

建议加入fake dns功能 XTLS/Xray-core#245

Closed

AkinoKaede mentioned this pull request Mar 1, 2021

[Feature Request] balancer enhancement XTLS/Xray-core#323

Closed

This was referenced Oct 12, 2022

leastload and leastping Balancers SagerNet/sing-box#151

Closed

leastload and leastping Balancers SagerNet/sing-box#163

Closed

mkmark mentioned this pull request Jan 28, 2024

leastLoad是什么意思？可以给一个配置示例吗？ #2350

Closed

yuhan6665 mentioned this pull request Feb 8, 2024

Least load balancer XTLS/Xray-core#2999

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v5: Health Check & LeastLoad Strategy #589

v5: Health Check & LeastLoad Strategy #589

qjebbs commented Jan 7, 2021 •

edited

Loading

database64128 commented Jan 7, 2021 •

edited

Loading

qjebbs commented Jan 7, 2021 •

edited

Loading

qjebbs commented Jan 7, 2021 •

edited

Loading

database64128 commented Jan 7, 2021

qjebbs commented Jan 11, 2021

database64128 commented Jan 11, 2021

database64128 commented Jan 11, 2021

qjebbs commented Jan 11, 2021

kslr commented Jan 11, 2021

qjebbs commented Jan 11, 2021

kslr commented Jan 11, 2021

qjebbs commented Jan 11, 2021

LazyZhu commented Jan 27, 2021

qjebbs commented Jan 28, 2021 •

edited

Loading

qjebbs commented Jan 29, 2021 •

edited

Loading

kslr commented Jan 29, 2021

kslr commented Jan 30, 2021

ghost commented Feb 26, 2021 •

edited by ghost

Loading

iusearch commented Aug 19, 2022

v5: Health Check & LeastLoad Strategy #589

v5: Health Check & LeastLoad Strategy #589

Conversation

qjebbs commented Jan 7, 2021 • edited Loading

database64128 commented Jan 7, 2021 • edited Loading

qjebbs commented Jan 7, 2021 • edited Loading

qjebbs commented Jan 7, 2021 • edited Loading

database64128 commented Jan 7, 2021

qjebbs commented Jan 11, 2021

database64128 commented Jan 11, 2021

database64128 commented Jan 11, 2021

qjebbs commented Jan 11, 2021

kslr commented Jan 11, 2021

qjebbs commented Jan 11, 2021

kslr commented Jan 11, 2021

qjebbs commented Jan 11, 2021

LazyZhu commented Jan 27, 2021

qjebbs commented Jan 28, 2021 • edited Loading

qjebbs commented Jan 29, 2021 • edited Loading

kslr commented Jan 29, 2021

kslr commented Jan 30, 2021

ghost commented Feb 26, 2021 • edited by ghost Loading

iusearch commented Aug 19, 2022

qjebbs commented Jan 7, 2021 •

edited

Loading

database64128 commented Jan 7, 2021 •

edited

Loading

qjebbs commented Jan 7, 2021 •

edited

Loading

qjebbs commented Jan 7, 2021 •

edited

Loading

qjebbs commented Jan 28, 2021 •

edited

Loading

qjebbs commented Jan 29, 2021 •

edited

Loading

ghost commented Feb 26, 2021 •

edited by ghost

Loading