Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add degradeCutHalfOpen and fix exceptionCount bug #525

Closed
wants to merge 33 commits into from
Closed

add degradeCutHalfOpen and fix exceptionCount bug #525

wants to merge 33 commits into from

Conversation

CalvinKirs
Copy link

@CalvinKirs CalvinKirs commented Feb 27, 2019

. #154

Describe what this PR does / why we need it

A new half-open state before the fuse is closed will pass a little flow, and if it is normal, it will be fully open. Avoid service without recovering large amounts of traffic
fix bug to DEGRADE_GRADE_EXCEPTION_COUNT。
降级这里由之前的两种状态变为三种状态,降级关闭,降级打开,降级半开,当降级关闭后后进入一个半开状态,这里会放五个请求过去,如果这五个请求都正常,那么则会由降级半开状态变为降级关闭,有任何一个请求失败,直接进入降级全开状态(我认为业务方应该捕获正常的业务异常,如果5个请求中有一个失败那一定是系统级别的异常,所以此处进入降级是没有问题的)。另外之前异常数与异常比率这里在处于同一个时间窗口时再次判断是会出现问题的,即无论系统是否正常也一定会再次降级。这里异常数目会做一个判断(异常比例没必要,因为我放进去的五个请求成功的话会使得比例倾斜),如果处于同一个时间窗口,会进行<=操作进行判断,如果不是,则维持原有操作(<)。说明:如果处于同一个时间窗口的话没必要进行降级半开状态,处于不同时将窗口的话如果系统未恢复正常,那么就会有一个很尴尬的情况,他会等待他的异常数或者异常数目进入他的一个阈值才会进行降级。比如我的异常数目是1000,那么这里的话熔断关闭后,这个时候可能系统还没有恢复正常,那么会直接放进来1000个请求。

Merge state

Does this pull request fix one issue?

Describe how you did it

Describe how to verify it

ut

Special notes for reviews

@codecov-io
Copy link

codecov-io commented Feb 27, 2019

Codecov Report

Merging #525 into master will increase coverage by 0.1%.
The diff coverage is 37.31%.

Impacted file tree graph

@@             Coverage Diff             @@
##             master     #525     +/-   ##
===========================================
+ Coverage     38.02%   38.12%   +0.1%     
- Complexity     1114     1123      +9     
===========================================
  Files           259      259             
  Lines          8171     8306    +135     
  Branches       1113     1130     +17     
===========================================
+ Hits           3107     3167     +60     
- Misses         4661     4726     +65     
- Partials        403      413     +10
Impacted Files Coverage Δ Complexity Δ
...sp/sentinel/slots/statistic/data/MetricBucket.java 77.27% <0%> (-22.73%) 19 <0> (ø)
...a/csp/sentinel/slots/statistic/base/LongAdder.java 31.91% <0%> (ø) 10 <0> (ø) ⬇️
...a/com/alibaba/csp/sentinel/node/StatisticNode.java 66.27% <30.76%> (-15.69%) 28 <0> (ø)
.../csp/sentinel/slots/block/degrade/DegradeRule.java 52.06% <46.55%> (-3.49%) 20 <12> (+3)
...p/sentinel/slots/statistic/metric/ArrayMetric.java 68.93% <5.26%> (-10.01%) 29 <1> (+1)
...k/flow/controller/WarmUpRateLimiterController.java 67.64% <70%> (ø) 5 <0> (ø) ⬇️
...s/block/flow/controller/RateLimiterController.java 79.31% <70%> (ø) 7 <0> (ø) ⬇️
...a/csp/sentinel/slots/statistic/base/LeapArray.java 65.82% <0%> (-3.8%) 24% <0%> (-1%)
.../com/alibaba/csp/sentinel/eagleeye/StatLogger.java 66.66% <0%> (-2.57%) 11% <0%> (-1%)
...ibaba/csp/sentinel/eagleeye/StatLogController.java 65.93% <0%> (-2.2%) 5% <0%> (-1%)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d3ccb0...a7dd38c. Read the comment docs.

@sczyh30 sczyh30 added the to-review To review label Feb 27, 2019
@CalvinKirs CalvinKirs changed the title add degradeCutHalfOpen add degradeCutHalfOpen and fix exceptionCount bug Mar 4, 2019
@sczyh30
Copy link
Member

sczyh30 commented Mar 4, 2019

Hi, thanks for contributing. Please illustrate your design and how-to-do in the PR description.

@CalvinKirs
Copy link
Author

CalvinKirs commented Mar 4, 2019

嗨,谢谢你的贡献。请说明您的设计和PR描述中的操作方法。

降级这里由之前的两种状态变为三种状态,降级关闭,降级打开,降级半开,当降级关闭后后进入一个半开状态,这里会放五个请求过去,如果这五个请求都正常,那么则会由降级半开状态变为降级关闭,有任何一个请求失败,直接进入降级全开状态(我认为业务方应该捕获正常的业务异常,如果5个请求中有一个失败那一定是系统级别的异常,所以此处进入降级是没有问题的)。另外之前异常数与异常比率这里在处于同一个时间窗口时再次判断是会出现问题的,即无论系统是否正常也一定会再次降级。这里异常数目会做一个判断(异常比例没必要,因为我放进去的五个请求成功的话会使得比例倾斜),如果处于同一个时间窗口,会进行<=操作进行判断,如果不是,则维持原有操作(<)。说明:如果处于同一个时间窗口的话没必要进行降级半开状态,处于不同时将窗口的话如果系统未恢复正常,那么就会有一个很尴尬的情况,他会等待他的异常数或者异常数目进入他的一个阈值才会进行降级。比如我的异常数目是1000,那么这里的话熔断关闭后,这个时候可能系统还没有恢复正常,那么会直接放进来1000个请求。

@sczyh30
Copy link
Member

sczyh30 commented Mar 4, 2019

对于 HALF-OPEN 而言,根据模式的不同,对应的策略也有不同:

  • 对于基于异常判断(异常比率/异常数)的模式,HALF-OPEN 的一般策略是,只尝试一个请求,如果该请求正常则置为 CLOSE 状态,否则重新打回 OPEN 状态;
  • 对于基于 RT 判断的模式,HALF-OPEN 状态同样只尝试一个请求,如果该请求 RT 正常则置为 CLOSE 状态,否则重新打回 OPEN 状态;这里需要考虑以下几点:(1) 若尝试的这个请求失败了,则是不会计算 RT 的(因此 RT 不会超),但请求确实是失败的,这时候要不要打回 OPEN 状态(2)HALF-OPEN 状态判断时不能用 average 值来作判断,而是要用最近一次请求的 RT。

另外考虑兼容性的问题,降级规则可能需要加个 enableHalfOpen 字段来控制是否加入半开启模式。

References:


For HALF-OPEN state, the corresponding action varies according to the circuit breaking strategy:

  • For exception-based mode (ratio or count), the general strategy of HALF-OPEN is to try only one request, and if the request is normal, it is set to CLOSE state, otherwise it will return to OPEN state;
  • For RT-based mode, the HALF-OPEN state should also attempt only one request. If RT of the request is normal, it is set to the CLOSE state, otherwise it is back to the OPEN state; here are the following points to consider: (1) If the request fails, the RT will not be calculated (so the RT will not be exceeded), but the request indeed failed. We should consider whether to set back the OPEN state in such scenario. (2) The HALF-OPEN state checking cannot be judged by the average RT. Instead, use the exact RT of the most recent request.

Also considering the compatibility issue, the circuit breaking rule may need an additional enableHalfOpen field to control whether to enable circuit-breaking mode.

@sczyh30
Copy link
Member

sczyh30 commented Mar 4, 2019

另外之前异常数与异常比率这里在处于同一个时间窗口时再次判断是会出现问题的,即无论系统是否正常也一定会再次降级。这里异常数目会做一个判断(异常比例没必要,因为我放进去的五个请求成功的话会使得比例倾斜),如果处于同一个时间窗口,会进行<=操作进行判断,如果不是,则维持原有操作(<)。

这里仍然会存在 滑动窗口长度 > 降级时间 的问题。比如分钟级异常数模式,分钟级滑动窗口的格子数为 60(每个格子时间长度为 1s),则降级时间比较小且异常分布在最近的格子比较多时,仍有可能出现熔断结束后异常数 < 阈值时就会被降级的问题。在这种统计模式下,一般需要设置降级时间 > 统计窗口长度。

# Conflicts:
#	sentinel-core/src/main/java/com/alibaba/csp/sentinel/node/StatisticNode.java
#	sentinel-core/src/main/java/com/alibaba/csp/sentinel/slots/statistic/data/MetricBucket.java
#	sentinel-core/src/main/java/com/alibaba/csp/sentinel/slots/statistic/metric/Metric.java
@CalvinKirs
Copy link
Author

CalvinKirs commented Mar 5, 2019

另外之前异常数与异常比率这里在处于同一个时间窗口时再次判断是会出现问题的,即无论系统是否正常也一定会再次降级。这里异常数目会做一个判断(异常比例没必要,因为我放进去的五个请求成功的话会使得比例倾斜),如果处于同一个时间窗口,会进行<=操作进行判断,如果不是,则维持原有操作(<)。

这里仍然会存在滑动窗口长度>降级时间的问题。比如分钟级异常数模式,分钟级滑动窗口的格子数为60(每个格子时间长度为1s),则降级时间比较小且异常分布在最近的格子比较多时,仍有可能出现熔断结束后异常数<在此种统计模式下,一般需要设置降级时间>统计窗口长度。

目前是采取这样的做法的,基于rt以及异常比例(s,ms级别的),会把当前窗口的进行清0。因为有一点需要考虑,rt这里的话准降级状态个人认为是必须的,成功状态下肯定没问题,五个请求都OK然后系统继续正常(即未降级),但是如果失败了的话这里不进行清除操作的话有可能存在这样一个问题,系统本来就是处于不正常情况,这五个请求耗时很严重。(如果未做dubbo超时限制),那么熔断后这五个请求会将整体的平均值拉的超级高而高出平均值。(极端情况)。异常比例这里的话这里因为时间短所以也是可以接受的,而且只会清除当前秒级别的异常比例。
异常数目这里是会在降级前进行-1操作。(会判断是否是同一个时间窗口),如果不是同一个时间窗口,那么就会不做处理。

@CalvinKirs
Copy link
Author

CalvinKirs commented Mar 5, 2019

降级半开的开关也加上了,默认是关闭的。业务方可以自行选择是否开启

/**
* minus exception
*/
void minusException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的设计可能还要再考量一下。Sentinel 底层的统计数据结构是限流、降级等等共用的,并且监控日志也会从实时统计数据生成,因此需要保证正确性。直接进行 decrease 和 rest 操作会使得实时统计数据不准确。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

您好,目前来看的话限流这里是用总的pass数来进行计算的。(这样是最合理的而且目前也确实是这样做的)。
基于rt这里也已经完善,是拿最近一次请求的rt与count进行比较的。另外这里的话清除的话只会清除当前时间窗口的一个异常rt总和,最小为1个rt(经过多个时间窗口,每次刚进入时间窗口进进入准降级状态),最大为5个rt(持续一个窗口的一个准降级-降级的rt总和)。
异常数目这里还是原有的逻辑,降级关闭之后放进一个请求,依赖于这个请求的结果来进行是否开关。降级打开之前会对分钟秒级别的纬度减去各自所属的时间窗口的值。如果是新的时间窗口则不做操作。
我这里觉得降级关闭之后应该是一个新的开始,所以-1个异常数这里是可以接受的。不这么做的话异常数目这里位于同一个时间窗口降级关闭之后始终是要再次打开的。因为count已经达到了阈值,必须去减

@CalvinKirs CalvinKirs closed this Mar 7, 2019
@sczyh30 sczyh30 removed the to-review To review label Mar 7, 2019
CST11021 pushed a commit to CST11021/Sentinel that referenced this pull request Nov 3, 2021
CST11021 pushed a commit to CST11021/Sentinel that referenced this pull request Nov 3, 2021
CST11021 pushed a commit to CST11021/Sentinel that referenced this pull request Nov 3, 2021
[ISSUE alibaba#525] Support the message track. Merge to branch msg_track
CST11021 pushed a commit to CST11021/Sentinel that referenced this pull request Nov 3, 2021
…ave supported trace_topic name value Configurable by users.
CST11021 pushed a commit to CST11021/Sentinel that referenced this pull request Nov 3, 2021
CST11021 pushed a commit to CST11021/Sentinel that referenced this pull request Nov 3, 2021
[ISSUE alibaba#525] Support the message track,add the function which supports trace topic name value configurable by users.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants