Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v5: Health Check & LeastLoad Strategy #589

Merged
merged 72 commits into from
Jan 30, 2021
Merged

v5: Health Check & LeastLoad Strategy #589

merged 72 commits into from
Jan 30, 2021

Conversation

qjebbs
Copy link
Contributor

@qjebbs qjebbs commented Jan 7, 2021

The following config explains everything:

api:
  services:
  - RoutingService   # required for 'v2ray api bi' to work
routing:
  balancers:
  - tag: balancer
    selector:
    - all.balancer
    fallbackTag: block    # fallback to block if the balancer failed to pick one
    strategy:
      type: LeastLoad   # select least load nodes, those who have lower rtt stand deviation
      settings:
        healthCheck:
          interval: 120 # perform checks every 2 minutes (expectations, checks are random) for each outbound, default 60s
          sampling: 5   # sampling recent 5 rtts for selecting, default 5
                        # above setting evaluates the stability in last 10 minutes
          destination: http://www.google.com/gen_204
          connectivity: http://connectivitycheck.platform.hicloud.com/generate_204 # check connectivity if ping fail
          timeout: 5    # health ping timeout 5s
        costs:                          # optional. cost values are used to degrade rtt stand deviations
        - match: all.balancer.node1     # set cost=16 for specific
          value: 16
        - match: keyword                # exclude nodes by keyword
          value: 99999
        - match: cost5                  # set cost=5 if node name match cost5 (extract 5 from cost5)
        - regexp: true
          match: \.weight\d+(\.\d+)?    # matchs & set from .weight<num>, users can set weight with
                                        # outbound tags, like all.node1.weight0.5, all.node2.weight5
        tolerance: 0.1  # max acceptable failure rate
        maxRTT: 1000    # max acceptable rtt, filter away high delay nodes. defalut 0
        expected: 3     # make sure 3 nodes select (Bandwidth priority)
        baselines:      # also select nodes near above 3 according to "rtt stand deviation" baselines
        - 30
        - 40

More about baselines & expected:

// selectLeastLoad selects nodes according to Baselines and Expected Count.
//
// The strategy always improves network response speed, not matter which mode below is configurated.
// But they can still have different priorities.
//
// 1. Bandwidth priority: no Baseline + Expected Count > 0.: selects `Expected Count` of nodes.
// (one if Expected Count <= 0)
//
// 2. Bandwidth priority advanced: Baselines + Expected Count > 0.
// Select `Expected Count` amount of nodes, and also those near them according to baselines.
// In other words, it selects according to different Baselines, until one of them matches
// the Expected Count, if no Baseline matches, Expected Count applied.
//
// 3. Speed priority: Baselines + `Expected Count <= 0`.
// go through all baselines until find selects, if not, select none. Used in combination
// with 'balancer.fallbackTag', it means: selects qualified nodes or use the fallback.
func (s *LeastLoadStrategy) selectLeastLoad(nodes []*node) []*node {

To inspect how LeastLoad strategy works to tweak settings:

v2ray api provides tools to manipulate V2Ray via its API.

Usage:

	v2ray api <command> [arguments]

The commands are:

        ...
        bc            balancer health check
        bi            balancer information
        bo            balancer select override
        ...

Use "v2ray help api <command>" for more information about a command.
> v2ray api bi

Balancer: selector
  - Strategy:
    random
  - Selects:
        Tag
    1   all.xxxxxxxxxx
    2   all.xxxxxxxxxx
    3   all.xxxxxxxxxx
Balancer: wrappers
  - Strategy:
    leastload, expected: 3, baselines: 30ms 40ms, max rtt: 1s, tolerance: 0.1
    health ping, interval: 2m0s, sampling: 5, timeout: 5s, destination: http://www.google.com/gen_204
  - Selects:
            RTT STD+C     RTT STD.      RTT Avg.      Hit   Cost  Tag
    1   OK  72.881354ms   72.881354ms   667.268022ms  10/10 1.00  all.balancer.xxxxxxx
    2   OK  80.283841ms   80.283841ms   773.369174ms  10/10 1.00  all.balancer.xxxxxxx
    3   OK  97.549384ms   97.549384ms   705.809272ms  9/10  1.00  all.balancer.xxxxxxx
  - Others:
            RTT STD+C     RTT STD.      RTT Avg.      Hit   Cost  Tag
    4   OK  103.248756ms  73.007896ms   951.2193ms    10/10 2.00  all.balancer.xxxxxxx
    5   OK  121.640366ms  43.006364ms   711.839948ms  10/10 8.00  all.balancer.xxxxxxx
    6   >   321.662975ms  321.662975ms  1.178536215s  10/10 1.00  all.balancer.xxxxxxx
    7   >   364.949006ms  364.949006ms  1.138114298s  10/10 1.00  all.balancer.xxxxxxx
    8   x   -             -             -             0/10  1.00  all.balancer.xxxxxxx
    9   x   -             -             -             0/10  1.00  all.balancer.xxxxxxx
    10  x   -             -             -             0/10  1.00  all.balancer.xxxxxxx

PS: Default strategy is Random (the original strategy), the behavior is unchanged, user doesn't need to explicitly set to it.

@database64128
Copy link
Contributor

database64128 commented Jan 7, 2021

Can we have the interval in seconds? Currently it can be confusing since the timeout field right below it is in seconds.

An example of load balancer that does the job well is https://github.com/shadowsocks/shadowsocks-rust/blob/master/crates/shadowsocks-service/src/local/loadbalancing/server_stat.rs. It pulls TCP and UDP separately every 6 seconds, calculates the score based on different weights applied to RTT, failure rate, and latency stdev from the last 10 minutes.

@qjebbs
Copy link
Contributor Author

qjebbs commented Jan 7, 2021

Can we have the interval in seconds? Currently it can be confusing since the timeout field right below it is in seconds.

An example of load balancer that does the job well is https://github.com/shadowsocks/shadowsocks-rust/blob/master/crates/shadowsocks-service/src/local/loadbalancing/server_stat.rs. It pulls TCP and UDP separately every 6 seconds, calculates the score based on different weights applied to RTT, failure rate, and latency stdev from the last 10 minutes.

Thanks for the information.

One check could take 5, 10 seconds, and even more, small seconds interval is meaningless. I take the unit as 'minutes' on purpose.

I wrote these codes without reading any other implementations (and I cannot read Rust), there of course can be better implementations, I hope someone else can continue the work since it works enough for me now:

  • Improve health check logic (with RTT_WEIGHT, FAIL_WEIGHT, and more?)
  • Continue to write more better strategies

@qjebbs
Copy link
Contributor Author

qjebbs commented Jan 7, 2021

@database64128 I wonder about why they just measures by pulls TCP and UDP, other than making a real request through the outbound handler. (Was I misunderstanding your words?)

Suppose the outbound is setup like frontend->IPLC->outlet (a chain), measuring at frontend is unreliable, since the bottleneck can be the IPLC and outlet, or even the later nodes can be failed.

PS: I ever considered about measuring node by speed tests, that could be more reliable compares to ping, but it's more heavier to do so. Many clients, like shadowrocket, are doing the similar thing (http 204 ping)

@database64128
Copy link
Contributor

@qjebbs shadowsocks-rust sends real requests for latency measurements. My point was, in addition to the TCP latency test with HTTP 204 pings as implemented in your PR, ss-rust also implements a UDP test by sending DNS requests. TCP and UDP scores are calculated separately, and is likely to end up selecting different servers for TCP and UDP. This helps a lot in situations where certain servers perform well with TCP traffic but poorly with UDP traffic.

@qjebbs
Copy link
Contributor Author

qjebbs commented Jan 11, 2021

This PR could possibly introduce characteristic for packet inspection in the client side(Many extreme small payload size requests in each interval of time)? I choose to keep it for personal use.

@qjebbs qjebbs closed this Jan 11, 2021
@database64128
Copy link
Contributor

I'd love to see this PR merged. The feature is quite useful and extensible. And users can always choose to use it or not.

@database64128
Copy link
Contributor

This PR could possibly introduce characteristic for packet inspection in the client side(Many extreme small payload size requests in each interval of time)?

It's probably not uncommon for application protocols to have keep-alive messages that are sent at regular intervals.

@qjebbs
Copy link
Contributor Author

qjebbs commented Jan 11, 2021

@database64128 I have no expertise for this field, If the v2fly team discussed and choose to merge it, I was glad to see it happens. 😁

@kslr
Copy link
Contributor

kslr commented Jan 11, 2021

Work in this area is meaningful, I support a merge.

In the future, we don’t need a regular network test, We can get information from the route logs

@qjebbs
Copy link
Contributor Author

qjebbs commented Jan 11, 2021

Work in this area is meaningful, I support a merge.

In the future, we don’t need a regular network test, We can get information from the route logs

Yes, it's a good plan and we can even collects speed info for balancing strategy, though the disadvantage is that, we will firstly request to a failed node, then to decide not to use it next time.

I'm not good at v2ray underlying logic, like Link, Stat, hope someone will continue on this part?

@kslr
Copy link
Contributor

kslr commented Jan 11, 2021

We can support multiple types of balancers, And keep timing measurements.

@qjebbs
Copy link
Contributor Author

qjebbs commented Jan 11, 2021

@kslr Yes, the PR supports multiple balancers with different strategies and different strategies settings. Currently we have Random and LeastLoad, but they all based on the health checker.

@database64128 database64128 reopened this Jan 11, 2021
@LazyZhu
Copy link

LazyZhu commented Jan 27, 2021

@qjebbs
balancers 不支持多个组么? 设置多个组也只会对配置文件中排序第一个进行测试.

    "balancers": [
        {
            "tag": "as-out",
            "selector": [
                "server1a",
                "server1b",
                "server1c"
            ],
            "strategy": {
                "type": "LeastLoad",
                "settings": {
                    "healthCheck": {
                        "destination": "http://detectportal.firefox.com/success.txt",
                        "timeout": 5,
                        "interval": 60,
                        "sampling": 10
                    },
                    "maxRTT": 500,
                    "expected": 3,
                    "baselines": [
                        20,
                        30,
                        50,
                        100,
                        200
                    ]
                }
            }
        },
        {
            "tag": "na-out",
            "selector": [
                "server2a",
                "server2b",
                "server2c"
            ],
            "strategy": {
                "type": "LeastLoad",
                "settings": {
                    "healthCheck": {
                        "destination": "http://detectportal.firefox.com/success.txt",
                        "timeout": 5,
                        "interval": 60,
                        "sampling": 10
                    },
                    "maxRTT": 1200,
                    "expected": 0,
                    "baselines": []
                }
            }
        },
        {
            "tag": "eu-out",
            "selector": [
                "server3a",
                "server3b",
                "server3c"
            ],
            "strategy": {
                "type": "LeastLoad",
                "settings": {
                    "healthCheck": {
                        "destination": "http://detectportal.firefox.com/success.txt",
                        "timeout": 5,
                        "interval": 60,
                        "sampling": 10
                    },
                    "maxRTT": 1500,
                    "expected": 3,
                    "baselines": [
                        50,
                        100,
                        150,
                        200,
                        300
                    ]
                }
            }
        }
    ]

@qjebbs
Copy link
Contributor Author

qjebbs commented Jan 28, 2021

@LazyZhu 有个大BUG,用最新的应该修复了

@kslr 新增了 cost加权、failure tolerance、手工覆盖均衡器、网络连通性检查。想做的都做了,没做的做不了或者不想做了😂

没做的是指手工覆盖时,暂停检查:导致数据竞争,能解决、但没有简单优雅的方法可以解决

请帮忙重点看下racing的情况,我已经跌了好几个坑了。。

最后切换回 sync.Mutex, 应该问题不大了

PS: TestServiceSubscribeRoutingStats 随机失败,是不是因为它写死了端口,有时端口不可用?

testCases := []*RoutingContext{
	{InboundTag: "in", OutboundTag: "out"},
	{TargetIPs: [][]byte{{1, 2, 3, 4}}, TargetPort: 8080, OutboundTag: "out"},
	{TargetDomain: "example.com", TargetPort: 443, OutboundTag: "out"},
	{SourcePort: 9999, TargetPort: 9999, OutboundTag: "out"},
	{Network: net.Network_UDP, OutboundGroupTags: []string{"outergroup", "innergroup"}, OutboundTag: "out"},
	{Protocol: "bittorrent", OutboundTag: "blocked"},
	{User: "example@v2fly.org", OutboundTag: "out"},
	{SourceIPs: [][]byte{{127, 0, 0, 1}}, Attributes: map[string]string{"attr": "value"}, OutboundTag: "out"},
}

* switch sync.Mutex to avoid potential racing
* add more tests
* code optimize
@qjebbs
Copy link
Contributor Author

qjebbs commented Jan 29, 2021

好了,收工。

回头看核心策略部分不过300行代码(原以为事情很简单,检查再挑嘛),给它配套的却有3000行,弄到想吐。
现在从ping、采集、统计、加权、挑选、输出统计、手工控制都齐了,未来要写新策略大概率能在几百行的规模上搞定,包括不限于:

  • 按权重将udp和tcp一起考察
  • 按权重将rtt和失败率一起考查
  • ip hash 防止出口频繁变化
  • ...

等有缘人吧。

@kslr
Copy link
Contributor

kslr commented Jan 29, 2021

Great! 只差写一下文档了

p.s 很有可能,不过这个功能作者已经弃坑了,如果未来一段时间没人接手,我可能会选择删掉,换一种简单的方式实现 ( 指找一个库换上

@kslr kslr merged commit 2c5a714 into v2fly:v5 Jan 30, 2021
@kslr
Copy link
Contributor

kslr commented Jan 30, 2021

🎆

@ghost
Copy link

ghost commented Feb 26, 2021

@kslr @qjebbs

Is this balancer battery-friendly?
If I set interval to 2147483648 , will it perform health check when the failure rate of current outbound is high?

@iusearch
Copy link
Contributor

这个合到v5分支了,但是现在的v5 release只有LeastLoad的添加,那几个api command都没有,想问问怎么回事?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants