-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many open files (regression?) #6237
Too many open files (regression?) #6237
Comments
Could you cat Note: You may also want to consider sticking an nginx reverse proxy out in front. That should reduce the number of inbound connections. |
sorry for my words. There is a reverse proxy and a Layer 7 load balancer in front of that for HTTP. I would never leave something unstable like The highwater and lowwater previously were increased (with Docker and 0.4.19), since I moved to 0.4.20 I have used default values but enabling the writable gateway and changing the GC and Both of these nodes are slow or not responsive (but coming back after a while). Another node needed a forced restart as IPFS made it unresponsive, so I was not able to get the data. Peer
|
eu-stockholm1 definitely has too many peers (although some of those are obviously using non-fd-consuming transports).
|
@Stebalien I'm also running into this issue as well as of 0.4.20. We had roughly 800 peers connected to the node when I encountered the issue with the following limits:
It's probably worth noting that I have never encountered this problem before 0.4.20 (although this could be entirely coincidental, as I saw quite a few bug reports of this already exist on earlier versions). However it does make me wonder if the updates you mentioned in the relay section of the 0.4.20 release notes had anything to do with this:
|
The next time someone runs into this, please run |
Still running into this daily, so I can help as much as I can :) Et voila: https://siderus.io/ipfs/QmeHWnYK5VpQvfFWV4wPgujB3PwAnNS5TR8iUWfDkghqnd (Those are other nodes affected, if you want the same as the previous let me know) |
I'm also observing this errors, particularly on accepting new connections. I think this causes the affected listeners to shutdown. https://ipfs.io/ipfs/QmX4A8DzvxeR1uC7xoTCTQPbynsHAD64gdsSCm1VPc8brC I am going to try to run with a much larger FD limit and see if it hits it too. |
@koalalorenzo what are your connection manager limits? Both of you, does downgrading help? |
My connection manager config is the default one (Never touched since I generated the conf file)
I know that @obo20 downgraded to fix the issue. I will downgrade a percentage of the machines so that I can help debug this, in case. Let me know what I can do to help! |
FD Limit was 4096 before. I set a new limit at 16k. Right now after 24h or so we have:
I will report back to see if the figure increases in the next hours. The connManager is at:
|
@hsanjuan Can you provide the specs of the machines that you're using? I'd also be curious to know how many "root" objects you have on those machines. |
@obo20 12cores, 64GB RAM, root objects ranging from 3000 to 6000. |
I'd like to know this as well. While resources are always important, I'm curious to know how much of a role they play here. From inspecting my nodes after they hit this issue, they haven't seemed maxed out from a resource standpoint at all. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I have no tried with "normal" machines (or not long enough). I can say that the file descriptors are stable at around 5000-6000 after a couple of days on the cluster. About 60% of them on badger files and the rest are swarm connections. |
@hsanjuan Thx! Sadly I can't ask Orion users to use 64GB of RAM. I will wait for a fix before upgrading Orion to On the siderus.io gateway I have found a workaround but the problem persists in the logs. Any ETA on the fix? I will build the master and see how it performs. |
@koalalorenzo The 64GB of RAM are because those machines handle 2TB+ of repository size, but except some spikes, half of the RAM is unused. The workaround here is to either increase the FD limit from the default (see post above), or reduce the ConnManager limits so there are less connections or play with the badger configuration so it opens less files (I haven't tried this but I think the ipfs config allows to play with all badger options). I'm not sure at this point if there is an increased FD usage related to just BadgerDS or just connections or both (maybe someone can verify against 0.4.19). |
Fix still in progress, unfortunately. See the discussion on libp2p/go-libp2p-circuit#76. |
@dokterbob could you try building #6361? I've been running this for a while on one of our boot strappers and it appears to be pretty stable. |
May fix ipfs/kubo#6237 Basically, 1. We hang while closing a stream (because `Close` waits). 2. This blocks the connection manager because it assumes that close _doesn't_ wait. This may also fix a stream leak.
* Write coalescing in yamux and mplex. * Correctly tag relay streams. * Reset relay streams instead of closing them. Fixes #6237. * Trim connections in a background task instead of on-connect.
* Write coalescing in yamux and mplex. * Correctly tag relay streams. * Reset relay streams instead of closing them. Fixes #6237. * Trim connections in a background task instead of on-connect.
* Write coalescing in yamux and mplex. * Correctly tag relay streams. * Reset relay streams instead of closing them. Fixes #6237. * Trim connections in a background task instead of on-connect.
* Write coalescing in yamux and mplex. * Correctly tag relay streams. * Reset relay streams instead of closing them. Fixes #6237. * Trim connections in a background task instead of on-connect.
Reopening waiting on confirmation. The latest master should now have all the required fixes. |
Will build & check it tomorrow! Thx for the update |
@koalalorenzo and @obo20, please use #6368 and not master till I get some reviews on the libp2p patches and merge it into master. TL;DR: There was a bug in libp2p since time immortal. We fixed it this release cycle (#6368) but then introduced a feature that triggers this bug. The branch I linked to above removes this feature. |
The RC3 (#6237) contains the fix. Any updates? |
3 hours of running on 30% of the fleet. Still monitoring, nothing happened so far. I will wait for 24h to be sure :) |
@Stebalien Running on a few of our nodes right now. Things seem to be working well so far. |
Awesome! Let me know if anything changes. We'll probably ship a final release Monday. |
Running for the past day as well as some friends who also set it up on their nodes, I average around 850 - 900 peers. They average around the same. And the Ram usage is much less (5x). |
Ok. Considering this closed, let the release train roll... |
Also, a big thank you to everyone in this thread for helping debug this! |
Looking good here too! Thank you @Stebalien |
@dokterbob please upgrade to the RC3, not the RC1. The RC1 doesn't include the fix for this bug. |
May fix ipfs/kubo#6237 Basically, 1. We hang while closing a stream (because `Close` waits). 2. This blocks the connection manager because it assumes that close _doesn't_ wait. This may also fix a stream leak.
Version information:
Type: Bug(?)
maybe bug, maybe misconfiguration after upgrading
Description:
I have been meeting this issue on a set of different machines of different types with different networks and different configurations. All the nodes are providing a public writable gateway. ( https://siderus.io/ ) All of them reported this:
I found this issue after upgrading. The upgrade was executed automatically (no human touch) and all of the machines are similar with the exception of network, cpu and memory that is different based on the average traffic. Since this version we stopped using docker and started using our own Debian packages but previous tests were successful (12 hours of usage of a/b testing). The issue appeared recently and it seems that it might be caused by a set of requests from the users combined with a misconfiguration (
ulimit
?)Maybe related #4102 #3792 (?)
Is this a bug or should I change the limits? How can I ensure that IPFS respects that limit?
The text was updated successfully, but these errors were encountered: