Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tcp connections leaks (at least most of them) #3001

Closed
wants to merge 6 commits into from

Conversation

deorth-kku
Copy link
Contributor

I've been noticing some memory/goroutine leaks for quiet some time. Lately I found out that tcp connections was also leaking, I think this might be the reason.
Here is what i find:

  1. tls.Conn.Close() can block indefinitely. See crypto/tls: set timeout so that Close does not block indefinitely golang/go#31224. I copied the idea from golang.org/x/net/http2, forcefully close tcp connection if tls.Conn.Close() does not return after a timeout. However, instead of a hardcoded 250ms timeout, I make it configurable in tlsSettings/realitySettings. For now, the default value is 0, which means it will be no timeout.
  2. We should use tls.Conn.HandshakeContext(ctx) instead of tls.Conn.Handshake(). tls.Conn.Handshake() can also block indefinitely if the server does not properly respond.
  3. I also find somewhere use a websocket.Dialer.Dial(), and replace it with websocket.Dialer.DialContext(). I saw websocket use a large chunk of my memory in pprof and I think this might be the problem.
    image

@Fangliding
Copy link
Member

Fangliding commented Feb 6, 2024

This is consistent with what I have observed
When clients establishe a large number of TLS connections in an unreliable network, it can lead to serious memory problem

@yuhan6665
Copy link
Member

感谢大佬!您应该是解决了 ray 长期以来的一个内存泄漏问题
请问您觉得是否可以不加新配置项 而是使用 PolicyObject 里面的 handshake 或者 connIdle ?

@deorth-kku
Copy link
Contributor Author

感谢大佬!您应该是解决了 ray 长期以来的一个内存泄漏问题 请问您觉得是否可以不加新配置项 而是使用 PolicyObject 里面的 handshake 或者 connIdle ?

我感觉不行。之所以我用的float32是因为考虑到x/net/http2使用的值是250ms,使用一个单位秒的配置项的话显然我们会需要小数。而这里浮点型损失的精度不那么重要。考虑到这一点的话,handshake的默认值4秒和connIdle的默认值300秒都显得太大。另外这两个值原来都是针对tcp层的,如果同时还会影响到tls层行为的话可能不是那么合适。所以我觉得还是新增一个配置项比较好。

另外,我又看了一些配置的代码,发现还有个infra/conf/cfgcommon/duration.Duration类型,可能用在这里做CloseTimeout更合适。但是目前只有Observatory的ProbeInterval在使用。可能是因为balencer的leastPing strategy目前是坏的,Observatory目前不在文档里面,而其它需要转换为Duration类型的配置项,目前使用的都是uint32的秒数。我看了一下您在 #2999 的改动,似乎是在infra/conf下新建了一个一模一样的Duration,但又还没移除cfgcommon/duration包。如果我现在改的话,只能使用cfgcommon/duration的Duration。我是不是应该等到 #2999 合并了之后再改为使用infra/conf.Duration?

@yuhan6665
Copy link
Member

yuhan6665 commented Feb 7, 2024

如果几秒太长的话 我觉得目前 hard code 250ms 更好一点(用户无关选项)
what do you think @RPRX ?

@hossinasaadi
Copy link
Contributor

hossinasaadi commented Feb 17, 2024

@yuhan6665 as far i tested on iOS and macOS devices, with same config and same actions.
We have good memory optimizes (~ 2-7 MB based on devices) on this fork.

@yuhan6665
Copy link
Member

我看了一下 先擅自把中间加选项的 commit 去掉合并了 a0f0304 再次感谢!

@yuhan6665 yuhan6665 closed this Feb 18, 2024
@deorth-kku deorth-kku deleted the fix-leak branch February 18, 2024 23:19
@RPRX
Copy link
Member

RPRX commented Feb 19, 2024

@yuhan6665 REALITY 那部分需要恢复原样,调用方的 close 或 cancel ctx 如果能传导到 REALITY,Spider 功能就会失效

@RPRX
Copy link
Member

RPRX commented Feb 19, 2024

解释一下,如果 REALITY 客户端的请求被重定向了,Spider 的逻辑是先返回(时间可以自定义)再装作与目标网站交互
然而一旦返回,调用方可能就会开始 cancel ctx 了,若传导到 REALITY 的 connection 就会造成连接被提前关闭

其它部分不涉及这种事情,不过之前比如 H2 和 gRPC 那里也是特意做了隔离,防止第一个子连接把整个连接给关了
总之其它部分我没看,等小白鼠们来测试,如果引入了新问题就再修,感谢 @hossinasaadi 的贡献!

@kyze8439690
Copy link

@deorth-kku 方便透露一下这个内存占用分析图是手动生成的吗?想学习一下

@yuhan6665
Copy link
Member

pprof

@Fangliding
Copy link
Member

pprof

@kyze8439690

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants