Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wordpress(?) over-committing memory #348

Closed
DevJohnC opened this issue Jan 18, 2019 · 13 comments
Closed

Wordpress(?) over-committing memory #348

DevJohnC opened this issue Jan 18, 2019 · 13 comments

Comments

@DevJohnC
Copy link

We have peachpie running a Wordpress network which keeps getting killed by the linux out-of-memory process killer.

[31346.098815] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[31346.315298] [20856]     0 20856  2098347   433262    1184       9        0          1000 dotnet
[31346.338398] Out of memory: Kill process 20856 (dotnet) score 1845 or sacrifice child
[31346.345205] Killed process 20856 (dotnet) total-vm:8393388kB, anon-rss:1733048kB, file-rss:0kB

The dotnet process (which is a Peachpie project) is reserving a lot more memory than it's actually using.

Are there tuning options to mitigate this situation?

@jakubmisek
Copy link
Member

jakubmisek commented Jan 19, 2019

Can this be related to dotnet/aspnetcore#3409 or dotnet/aspnetcore#1976?

@DevJohnC
Copy link
Author

Disabling the server-gc seems to have bought us a modicum of stability but it doesn't last. We also turned off the wp cron.

We still experience the same issue: after running for around half an hour with a steady flow of traffic the dotnet process spikes to 100% cpu usage and keeps allocating memory until the OOM killer finally kills the process and allows the node to restabalize.

@DevJohnC
Copy link
Author

To further add to this issue, I watched the issue happen in real-time today to try and get more details and see if it pinged anything in your minds.

  • We're on a pretty vanilla WordPress install - we have some basic themes and configured as a network with no plugins besides some filters to set comment options and filter admin menus
  • Traffic levels are not high but are consistent, it's normal that the app is serving a web request constantly
  • Traffic logs show nothing suspicious and a replay of web requests leading up to the issue doesn't reproduce it
  • The issue occurs at oddly specific timing, around 30-40mins of uptime each occurrence
  • Server death looks like this:
    • CPU usage spikes to 100% user-mode usage
    • This gradually becomes more and more kernel-mode cpu usage, staying at 100% utilization overall
    • During this time memory usage (which resides at just under 500MB usually) starts spiking by GBs at a time, quickly assigning then deleting GBs of RAM
    • This continues until the server becomes unresponsive, likely due to the CPU giving more and more time to kernel-mode operations
    • During this time the database is not showing a high data throughput, meaning the GBs of memory assigned and deleted aren't database records

Given the timing nature and how the CPU just becomes consumed in kernel-mode it feels like a deadlock issue? Or something related? Peachpie seems to just stall waiting for whatever kernel operation is happening to complete. Perhaps some sort of internal cache that deadlocks when there's continuous requests incoming?

@jakubmisek
Copy link
Member

I'm still thinking of some Linux specific .NET Core issue. (We're running tens of WordPress websites on .NET Core on Win10 x64 and Azure and the servers are stable for months so far, using 400-600 MB of RAM). Using the default setup https://github.com/iolevel/wpdotnet-sdk/blob/master/app/Program.cs

Anyways; it is possible there is a dead-loop in the PeachPie code ... in that case it would be great if you'd be able to debug the process? Or attach when this happens? Is it possible on Linux ?

@DevJohnC
Copy link
Author

Damn, I was seriously hoping you'd have a good idea of what is wrong right away, oh well :/

I admit I'm ignorant about attaching to a running process in production. I'm currently moving the blog network to a dedicated VM that's isolated from the rest of our setup to facilitate that.

What information am I looking to dump from the process, with what windows tools equivalents?

@jakubmisek
Copy link
Member

It seems like it might be caused by a plugin that we didn't test yet ..

Anyways; in Visual Studio there is Mini dump or actually if you'd have a chance to see stack trace where the OOM happens, that might help too.

@DevJohnC
Copy link
Author

I ended up attaching lldb with the libsosplugin.so plugin for linux on a production server. I tried to get a dump with procdump but it didn't want to load into lldb.

Anyway, we now have a full 8 or so hours of uptime after disabling WordPress' option to automatically convert smilies to images.

I managed to dump a couple of stack traces while the application was dying and found convert_smilies (https://core.trac.wordpress.org/browser/tags/5.0.3/src/wp-includes/formatting.php#L2836) to be running during both incidents. This function is doing a lot of, likely, inefficient string manipulation resulting in a lot of data copying and CPU bound workload.

Reading through that function I think it's very likely to be a combination of

  • a poorly designed method that isn't anywhere near efficient
  • Peachpie possibly having significant overhead when used in this manner

I'm going to keep the issue open while I re-enable the WP cron and monitor for more of this behavior but it seems the culprit is found.

@jakubmisek
Copy link
Member

@DevJohnC both are possible - however I was not able to replicate the issue (on Win x64). formatting.php file is definitely an ugly piece of code but seems to not cause any issues. When profiling performance it is not even reported as a significant function

image
(no formatting.php)

@jakubmisek
Copy link
Member

Is it possible for you to run the web site locally?
Do you have list of your enabled wp plugins ?

@DevJohnC
Copy link
Author

I'm currently on my laptop on satellite internet so I can't do a whole lot of debugging until I'm back at my workstation.

However, we've had 100% uptime since disabling the WordPress option for convert_smilies. The stacktraces I found it in were all executing as part of the RSS2 feed.

It's entirely plausible that convert_smilies crashing our servers was dependent on content; maybe something with lots of HTML tags, or non-latin language posts or posts with large content bodies or goodness knows whatever else.

I'll try and narrow it down when I can.

@jakubmisek
Copy link
Member

Thank yo, Will try more tests as well. BTW with newly released Peachpie 0.9.30 there is -40% memory utilization.

@jakubmisek
Copy link
Member

So far we cannot repro the issue with memory. (but we are running Windows servers).

We are constantly requesting RSS2 feed and nothing weird happens yet.

@jakubmisek
Copy link
Member

closing for now, if you'd have any more details, please comment :) thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants