Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Server hangs as php-fpm reach pm.max_children limit #39063

Closed
5 of 8 tasks
ThibautPlg opened this issue Jun 29, 2023 · 18 comments
Closed
5 of 8 tasks

[Bug]: Server hangs as php-fpm reach pm.max_children limit #39063

ThibautPlg opened this issue Jun 29, 2023 · 18 comments
Labels
0. Needs triage Pending check for reproducibility or if it fits our roadmap 26-feedback bug performance 🚀

Comments

@ThibautPlg
Copy link
Contributor

⚠️ This issue respects the following points: ⚠️

Bug description

Hello,
I'm administrating multiple Nextcloud 25 instances, and I'm slowly upgrading to Nextcloud 26. However, after some days (some hours sometimes), each and every instance upgraded to Nextcloud 26 crash due to the php-fpm pm.max_children limit being reached.
I then need to restart php-fpm and everything goes normal until the next crash.

Additional context:

  • The servers are only hosting one Nextcloud server, with php 8.1
  • php-fpm max_children and other configuration values have been customized to match the available RAM of the host (between 32 to 64 childs allowed)
  • The php-fpm max children have never been a problem as far as I recall (some instances are older than NC20). The problem also occurs on newly installed testing instances.
  • The ram consumption is average, we're not maxed
  • I've tried with php8.2, same results

systemctl status php-fpm.service output when server is down:

● php-fpm.service - The PHP FastCGI Process Manager
   Loaded: loaded (/usr/lib/systemd/system/php-fpm.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2023-06-28 14:30:57 CEST; 24h ago
  Process: 3345857 ExecReload=/bin/kill -USR2 $MAINPID (code=exited, status=0/SUCCESS)
 Main PID: 462753 (php-fpm)
   Status: "Processes active: 64, idle: 2, Requests: 127550, slow: 0, Traffic: 0req/sec"
    Tasks: 67 (limit: 47993)
   Memory: 1.7G
   CGroup: /system.slice/php-fpm.service
           ├─ 462753 php-fpm: master process (/etc/php-fpm.conf)
           ├─ 613484 php-fpm: pool nextcloud
           ├─ 613485 php-fpm: pool nextcloud
           ├─ 613486 php-fpm: pool nextcloud
           ├─ 613487 php-fpm: pool nextcloud
           ├─ 613488 php-fpm: pool nextcloud
           ├─ 613489 php-fpm: pool nextcloud
(the list goes on)

I feel like some processes are idle (hanged?) and never stopped.

Am I the only one facing this issue? Why does it occurs only with NC26? What has changed regarding php processes?

Best regards,

Steps to reproduce

  1. Upgrade to NC26 or install a fresh server
  2. Wait a little bit
  3. Observe php-fpm being overwhelmed

Expected behavior

Same as before, Nextcloud (php-fpm?) should remove children.

Installation method

None

Nextcloud Server version

26

Operating system

RHEL/CentOS

PHP engine version

PHP 8.1

Web server

Nginx

Database engine version

None

Is this bug present after an update or on a fresh install?

None

Are you using the Nextcloud Server Encryption module?

Encryption is Disabled

What user-backends are you using?

  • Default user-backend (database)
  • LDAP/ Active Directory
  • SSO - SAML
  • Other

Configuration report

No response

List of activated Apps

No response

Nextcloud Signing status

No response

Nextcloud Logs

No response

Additional info

No response

@ThibautPlg ThibautPlg added 0. Needs triage Pending check for reproducibility or if it fits our roadmap bug labels Jun 29, 2023
@joshtrichards
Copy link
Member

Hi @ThibautPlg:

There are a lot of possibilities, but you didn't provide a complete Issue form. :-)

The most important items to check off the top of my head that might provide clues:

  • your php-fpm logs
  • your nextcloud.log
  • your php-fpm status page (particularly in full mode)
  • your php-fpm pool configuration
  • your nginx configuration (also worth comparing against the one in the NC manual since it's periodically updated in-between NC versions)
  • which NC apps are active

Are all of these instances essentially similarly configured/built? While anything is possible, chances are this is some sort of local environment interaction.

I would suggest posting this in the Nextcloud Help Forum first. NC26 has been out awhile and I haven't seen rampant reports of new php-fpm related issues personally.

@bjo81
Copy link

bjo81 commented Aug 14, 2023

@ThibautPlg Does this issue disappear when you disable previews? We have an instance where php-fpm is stuck with requests like GET /core/preview?fileId=1151626&c=25fa3ba9d97519e2dd8f7ef2595bdf02&x=500&y=500&forceIcon=0&a=1 HTTP/2.0. Even running the preview generator app get's stuck on generating a preview of a PDF like "Nextcloud Flyer.pdf" then, so it seems the whole preview generation is stuck. All php-fpm processes hang at semop(1, [{0, -1, SEM_UNDO}],
The issue first appeared with 26.0.0, so it could be related to the new preview generation code. With < 26.0.0 15 php-fpm workers were fine, but now eben 270 are not enough as they alwas hang and never get killed.

@ThibautPlg
Copy link
Contributor Author

Hi
Sorry for my absence of answers to @joshtrichards , I had a lot on my plate lately and this subject kind of went in background.
2023-08-21_10-57
This is an example of a regular work day for my users, as you can see the fpm processes suddenly spikes until I manually reload them and everything goes right until the next time the server has a mood change.
I haven't noticed anything linked to the previews, requests are quite random and the slow.log contains entries for all kinds of scripts and nothing rings a bell for me.

In the end we "fixed" the problem by adding the following line to our php-fpm config : request_terminate_timeout = 5m.

@marc4s
Copy link

marc4s commented Aug 29, 2023

on my instance this started with 25.0.9 or 25.0.10. ltoday I upgraded to 26.0.5, I will check in the next days the issue persists...

preview was all the disabled

@diego-treitos
Copy link

I am experiencing this issue too and I also noticed it since the upgrade to 25.x.x (not sure what version). I administer several NC servers and all of them have the same behavior. While request_terminate_timeout = 5m is a working workaround, I think it is only a patch and I guess it might have an impact on performance? Anyway this looks like a bug on NC as it started happening after an upgrade.

@diego-treitos
Copy link

diego-treitos commented Oct 17, 2023

Actually setting request_terminate_timeout = 5m creates a problem when syncing big files. If you check the documentation for uploading big files (https://docs.nextcloud.com/server/20/admin_manual/configuration_files/big_file_upload_configuration.html) you see that they recommend to raise timeouts even to 1 hour. This means that if your requests terminate after 5 minutes, your big files won't sync.

I think the problem might be related to nextcloud server either not closing database connections or not recycling them in future requests because I've observed that database connections increase at the same pace that workers.

This issue is causing a lot of troubles in all my instances. I am surprised that it is not getting more attention.

@ThibautPlg
Copy link
Contributor Author

The request_terminate_timeout is indeed only a workaround. We've also noticed a high increase in mariadb connections prior to a total overflow of php-fpm processes.
Quite hard to debug. Thanks for your comment though, glad (kind of) to not be the only one affected by this behavior.

@ThibautPlg ThibautPlg reopened this Oct 17, 2023
@robert-scheck
Copy link
Contributor

robert-scheck commented Nov 8, 2023

I see exactly the same result like @ThibautPlg reported, however on a slightly different system:

  • CentOS 7 (fully up-to-date)
  • PHP 8.1 from Remi's Safe RPM repository
  • Nextcloud 27.1.2

However, it is mod_php instead of PHP-FPM, but all Apache webserver processes are stuck at semop() as well, when this issue occurs. Using strace, I unfortunately can gather this:

semop(12, [{0, -1, SEM_UNDO}], 1

And on this system, the pm.max_children limit isn't hit (because no PHP-FPM), but instead the maximum of Apache webserver processes (httpd) or the maximum of MariaDB connections (depending on where you set a higher limit) at mysqld.

@szaimen
Copy link
Contributor

szaimen commented Nov 8, 2023

#41263

@szaimen szaimen closed this as completed Nov 23, 2023
@Githopp192
Copy link

Githopp192 commented Dec 8, 2023

had a similar issue a one week ago - did increase pm.max_children and there was some process, which eat my whole memory (php-fpm setting was set to "on demand").

I did play with php-fpm setting "static" and "dynamic" - but both eat too much memory.

So i switched back to setting: "ondemand"

This calculation helped me finding the right values:
(where with "ondemand" you only would need "pm.max_children".
Additionally i set:

pm.process_idle_timeout = 10
pm.max_requests = 500

(see: ; ondemand - no children are created at startup. Children will be forked when
; new requests will connect. The following parameter are used:
; pm.max_children - the maximum number of children that
; can be alive at the same time.
; pm.process_idle_timeout - The number of seconds after which
; an idle process will be killed.

)

So far so long - no memory issues with php-fpm at all.

AvailableRAM=$(awk '/MemAvailable/ {printf "%d", $2/1024}' /proc/meminfo)
AverageFPM=$(ps --no-headers -o 'rss,cmd' -C php-fpm|awk '{ sum+=$1 } END { printf ("%d\n", sum/NR/1024,"M") }')
FPMS=$((AvailableRAM/AverageFPM))
PMaxSS=$((FPMS*2/3))
PMinSS=$((PMaxSS/2))
PStartS=$(((PMaxSS+PMinSS)/2))
echo "-------------------------"
echo "AvailableRAM:$AvailableRAM"
echo "AverageFPM:$AverageFPM"
echo "pm.max_children:$FPMS"
echo "pm.start_servers:$PStartS"
echo "pm.min_spare_servers:$PMinSS"
echo "pm.max_spare_servers:$PMaxSS"
echo "-------------------------"

Calculation PHP-FPM-Tweaks:

AvailableRAM:6457
AverageFPM:120
pm.max_children:53
pm.start_servers:26
pm.min_spare_servers:17
pm.max_spare_servers:35

@diego-treitos
Copy link

For me this was fixed in 27.1.4. If you are experiencing this problem, please be sure to have Nextcloud upgraded to at least that version before reporting the problem.

@Githopp192
Copy link

i'm om 27.1.4

@metafarion
Copy link

metafarion commented Feb 14, 2024

I'm still observing this on one of my instances running 27.1.6. I've been incrementally increasing pm.max_children, 12, 20, 30, 40, 50. Each time as soon as WARNING: [pool www] server reached pm.max_children setting (40), consider raising it shows up in the log, the site cannot load anything further. Looking in top at this time reveals no active threads at all and a load average of near zero. It really just seizes up and can't do any more work until php-fpm is restarted.

Using PHP-FPM 8.2.7 and Apache 2.4.57

@diego-treitos
Copy link

I agree that started observing a similar behavior again. This time the processes dissappear after a few minutes, but still it creates the problem of having many processes doing nothing and the service becoming unavailable.

@metafarion
Copy link

I don't want to tempt fate here, but I MAY have resolved my instance by adjusting a different php.ini parameter. I'm kicking myself now for not specifically noting which one, but I was in a hurry at the time. I can say that I wouldn't have known to do it except for a suggestion that showed up in the Nextcloud Administration Settings > Overview panel under Security & setup warnings ONLY after the pm.max_children warning appearing in the system php log, but before the Nextcloud web interface became unresponsive. It was a pretty brief window, but still catchable if you set up some kind of trigger to watch the log file.

It could also be total coincidence :-P

@Githopp192
Copy link

as i've written - switched back to setting: "ondemand" - on all problems solved

@metafarion
Copy link

Alright, so my earlier victory was ultimately short-lived, and probably coincidental. The thing that ACTUALLY seems to have fixed this for me was installing php-smbclient. My NC instance is entirely a mounted SMB external share, and without php-smbclient, any file transfer that took too long or was larger than 512MB would hang and lock up a child process until there were no more available and the server would stop processing requests.

@MrRinkana
Copy link

I'm still observing this on one of my instances running 27.1.6. I've been incrementally increasing pm.max_children, 12, 20, 30, 40, 50. Each time as soon as WARNING: [pool www] server reached pm.max_children setting (40), consider raising it shows up in the log, the site cannot load anything further. Looking in top at this time reveals no active threads at all and a load average of near zero. It really just seizes up and can't do any more work until php-fpm is restarted.

Using PHP-FPM 8.2.7 and Apache 2.4.57

Just for reference, similar symptoms can appear if you use keepalive between Apache and php-fpm. Not sure why but it's probably unnecessary either way if you run php-fpm on the same machine especially if using unix-sockets

@maxhoesel maxhoesel mentioned this issue Sep 17, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. Needs triage Pending check for reproducibility or if it fits our roadmap 26-feedback bug performance 🚀
Projects
None yet
Development

No branches or pull requests

10 participants