-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate memory leak #13
Comments
Closing for now. Haven't noticed a leak in the monitoring for at least 6 months, even on the busiest bots for SWMT channels, en.wikipedia.org, and wikidata.org. |
Meant to re-open this And since about November or December, it seems to cause at least once a week that the bots come to a halt until we manually restart them. |
To get started locally, clone this repo and build the code (either via Visual Studio, or from the command-line with msbuild and mono), then tell the bot via IRC to monitor Wikipedia or Wikidata (the most active wikis and thus triggers the leak the quickest). By default the bot joins
|
Hey all, just seeing if I can be of any assistance here. First checking constraints: is Mono the only available runtime, or is something more recent like .NET Core an option? I ask because there's a lot of string parsing involved here and that's where a lot of allocations can be removed and performance improved in .NET Core, if running it is an option. I'm also curious about security implications here, there are some areas that allow SQL injection to the local database which would be good to fix. If .NET Core above is an option, is using additional libraries like Dapper an option? If so, we can clean those up in short order. I'm taking a look through now to see if I readily see any leak locations, but a memory dump would be the most useful if you're able to take one. If you are, I can show you how to run a script on it to analyze what the big leak items are, or if there's no sensitive data and it's okay to upload somewhere, I'd happily do such analysis. |
@NickCraver Hey, thanks for taking a look. Much appreciated. This is running on a VM in Wikimedia Cloud with Debian 9 Stretch and the Mono 5.16 that comes with that (via Wikimedia's APT mirror). I work at Wikimedia Foundation on the LAMP stack for Wikipedia, but CVNBot and its operation is something I do in my volunteer capacity, and I'm very much a beginner here. I've inherited this and done my best to keep it running over the years to account for changes in the source feed and MediaWiki APIs it calls to, but beyond that I'm very much a beginner with C# and .NET/Mono in general. I don't have much of a preference here, other than to work within the contraints of FLOSS (as required for Wikimedia Cloud), and ideally only dependencies that are in Debian stable (for operational simplicity/flexibility). Other than that, pretty much fine with anything and everything. It does not surprise me that there are SQL vectors here. I've patched a few of those before this was moved to Git, but it's not a high area of concern given only trusted users are given access to "bot commands" and those same users are also the main consumer of the data so it would be a "footgun" situation. Having said that, obviously fixing them would be awesome, but less worrisome than the memory leak at this point (unless it's attackable by non-voiced users of the output feed, or through the input source). Regarding memory dump, sure thing. I do need help with how to capture it though. The only sensitive data is the bot's NickServ password, but I can simply rotate that to make things easier. |
@Krinkle Got it - thanks! I'm honestly not sure what the process is to properly dump this on Debian, I'll ask for backup there - happy to analyze it though. From looking through the code, I'd currently be looking at unbounded project list growth. It seems to vary if a project is ever removed, and there are some races down that path with multiple threads. It wouldn't be completely trivial work, but again .NET Core with That's long-term though, let's see if we can get some help memory dumping and confirm our culprit. |
Hey all, with Mono 5.16 you can switch to target net472 and start using async/await and all the latest goodness. |
The project loader is rare used, only once when an instance is first set up, and then it stays like that. A restart every now and then after an upgrade, but for the most part processes start, live for weeks and get restarted without ever seeing a The Projects.xml file for CVNBot18/wikidata:<projects>
<project><projectName>wikidata.wikipedia</projectName><interwikiLink /><rooturl>https://www.wikidata.org/</rooturl><speciallog>Special:.+?/(.+)</speciallog><namespaces><?xml version="1.0"?><api batchcomplete=""><query><namespaces><ns _idx="-2" id="-2" case="first-letter" canonical="Media" xml:space="preserve">Media</ns><ns _idx="-1" id="-1" case="first-letter" canonical="Special" xml:space="preserve">Special</ns><ns _idx="0" id="0" case="first-letter" content="" defaultcontentmodel="wikibase-item" xml:space="preserve" /><ns _idx="1" id="1" case="first-letter" subpages="" canonical="Talk" xml:space="preserve">Talk</ns><ns _idx="2" id="2" case="first-letter" subpages="" canonical="User" xml:space="preserve">User</ns><ns _idx="3" id="3" case="first-letter" subpages="" canonical="User talk" xml:space="preserve">User talk</ns><ns _idx="4" id="4" case="first-letter" subpages="" canonical="Project" xml:space="preserve">Wikidata</ns><ns _idx="5" id="5" case="first-letter" subpages="" canonical="Project talk" xml:space="preserve">Wikidata talk</ns><ns _idx="6" id="6" case="first-letter" canonical="File" xml:space="preserve">File</ns><ns _idx="7" id="7" case="first-letter" subpages="" canonical="File talk" xml:space="preserve">File talk</ns><ns _idx="8" id="8" case="first-letter" subpages="" canonical="MediaWiki" xml:space="preserve">MediaWiki</ns><ns _idx="9" id="9" case="first-letter" subpages="" canonical="MediaWiki talk" xml:space="preserve">MediaWiki talk</ns><ns _idx="10" id="10" case="first-letter" subpages="" canonical="Template" xml:space="preserve">Template</ns><ns _idx="11" id="11" case="first-letter" subpages="" canonical="Template talk" xml:space="preserve">Template talk</ns><ns _idx="12" id="12" case="first-letter" subpages="" canonical="Help" xml:space="preserve">Help</ns><ns _idx="13" id="13" case="first-letter" subpages="" canonical="Help talk" xml:space="preserve">Help talk</ns><ns _idx="14" id="14" case="first-letter" canonical="Category" xml:space="preserve">Category</ns><ns _idx="15" id="15" case="first-letter" subpages="" canonical="Category talk" xml:space="preserve">Category talk</ns><ns _idx="120" id="120" case="first-letter" canonical="Property" defaultcontentmodel="wikibase-property" xml:space="preserve">Property</ns><ns _idx="121" id="121" case="first-letter" subpages="" canonical="Property talk" xml:space="preserve">Property talk</ns><ns _idx="122" id="122" case="first-letter" canonical="Query" xml:space="preserve">Query</ns><ns _idx="123" id="123" case="first-letter" canonical="Query talk" xml:space="preserve">Query talk</ns><ns _idx="828" id="828" case="first-letter" subpages="" canonical="Module" xml:space="preserve">Module</ns><ns _idx="829" id="829" case="first-letter" subpages="" canonical="Module talk" xml:space="preserve">Module talk</ns><ns _idx="1198" id="1198" case="first-letter" subpages="" canonical="Translations" xml:space="preserve">Translations</ns><ns _idx="1199" id="1199" case="first-letter" subpages="" canonical="Translations talk" xml:space="preserve">Translations talk</ns><ns _idx="2300" id="2300" case="first-letter" canonical="Gadget" xml:space="preserve">Gadget</ns><ns _idx="2301" id="2301" case="first-letter" canonical="Gadget talk" xml:space="preserve">Gadget talk</ns><ns _idx="2302" id="2302" case="case-sensitive" canonical="Gadget definition" defaultcontentmodel="GadgetDefinition" xml:space="preserve">Gadget definition</ns><ns _idx="2303" id="2303" case="case-sensitive" canonical="Gadget definition talk" xml:space="preserve">Gadget definition talk</ns><ns _idx="2600" id="2600" case="first-letter" canonical="Topic" defaultcontentmodel="flow-board" xml:space="preserve">Topic</ns></namespaces></query></api></namespaces><restoreRegex>^restored "\[\[(?<item1>.+?)\]\]"(?:: (?<comment>.*?))?$</restoreRegex><deleteRegex>^deleted "\[\[(?<item1>.+?)\]\]"(?:: (?<comment>.*?))?$</deleteRegex><protectRegex>^protected "\[\[(?<item1>.+?)\]\]"(?:: (?<comment>.*?))?$</protectRegex><unprotectRegex>^removed protection from "\[\[(?<item1>.+?)\]\]"(?:: (?<comment>.*?))?$</unprotectRegex><modifyprotectRegex>^changed protection level for "\[\[(?<item1>.+?)\]\]"(?:: (?<comment>.*?))?$</modifyprotectRegex><uploadRegex>^uploaded "\[\[(?<item1>.+?)\]\]"(?:: (?<comment>.*?))?$</uploadRegex><moveRegex>^moved \[\[(?<item1>.+?)\]\] to \[\[(?<item2>.+?)\]\](?:: (?<comment>.*?))?$</moveRegex><moveredirRegex>^moved \[\[(?<item1>.+?)\]\] to \[\[(?<item2>.+?)\]\] over redirect(?:: (?<comment>.*?))?$</moveredirRegex><blockRegex>^blocked \[\[(?<item1>.+?)\]\] with an expiration time of (?<item2>.+?) \((?<item3>.+?)\)(?:: (?<comment>.*?))?$</blockRegex><unblockRegex>^unblocked (?<item1>.+?)(?:: (?<comment>.*?))?$</unblockRegex><reblockRegex>^changed block settings for \[\[(?<item1>.+?)\]\] with an expiration time of (?<item2>.+?) (?<item3>.+?)(?:: (?<comment>.*?))?$</reblockRegex><autosummBlank>^Blanked the page(?:: (?<comment>.*?))?$</autosummBlank><autosummReplace>^Replaced content with "(?<item1>.+?)"(?:: (?<comment>.*?))?$</autosummReplace></project>
</projects> |
Could it be a leak in the underlying IRC library? (Meebey.SmartIrc4net.dll) |
@NickCraver The VMs currently run Mono 5.16 from mono-project.com's apt source (link to server provision config). I've added Microsoft's apt source as well on a similar instance and installed Is this an SDK you'd like to make us of within the CVNBot code, which Mono or MSBuild would pick up in the current workflow, or does this provide a binary I would use instead of Mono to run the CVNBot.exe process? As for .NET version, I'd be fine with moving to a newer version – provided it can still be built with VS2019 on Mac, and build/run on Debian Linux. |
@KirillOsenkov possible, that code hasn't been updated in a while either. |
To run with .NET Core you would just type First thing it should be safe to change this line to If you switch to net472 and you see build errors, we can help investigate those. When you build the .sln I'd use |
Here's the source for the SmartIRC library: It's on NuGet at: The .dll version currently used is 0.4.5.0. |
I'm seeing a few instances pop in and out of I've edited my previous comment to use |
One easy thing to do would be to update SmartIrc4net to latest in the hopes that the leak was there and it was fixed. |
For example, could it be this? |
cc @meebey have you had any known memory leaks in SmartIRC? |
Saw this check in on the bot, its already up to date with v 1.1 |
The code seems to compile without any warnings (both with msbuild and in VS2019), and run without issue from what I can tell. I do note though, that there is also the following in the exe-config file: <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.5" /> I'm not sure how the two relate, or whether I made a mistake by having |
|
With .NET Core you can generate an exe so that it can be invoked directly. It can have the runtime self-contained so it is not even necessary to install the .NET Core runtime or SDK. |
Depending on how replicable this is, you've also got access to JetBrains ReSharper, and more importantly/relevantly dotTrace/dotMemory via our subscription |
.. but support for .NET 4.7.2 was only added in Mono 5.18, whereas the servers still run Mono 5.16. However I see now that the mono-project apt source from which I installed Mono 5.16 are continuously updated even for older Debian distros. I didn't realise that they did that (I assumed they'd pin the version there for each Debian release). These apt sources noiw provide Mono 6.8 for Debian 9 Stretch, so I can do a rolling update for that first and then roll out the .NET 4.7.2 update after that. |
Support for .NET 4.7.2 was added in Mono 5.18 per <https://www.mono-project.com/docs/about-mono/releases/5.18.0/>. The main CVN servers in Wikimedia Cloud run on Mono 5.16 from mono-project's apt sources, but those same sources have been updated since we provisioned the server and now provide latest Mono 6.8 even for Debian 9 Stretch. This change is blocked on upgrading the servers first. Ref #13.
A memory dump from the process when it's having a lot of leaked memory might help in speeding up figuring out what's going on. |
Ref wikimedia/labs-countervandalism-CVNBot#13 The same apt source now provides Mono 6.8, I just ran `apt-get update` and `apt-get install mono-complete`. > Unpacking mono-complete `6.8.0.123-0xamarin2+debian9b1` > over `5.16.0.179-0xamarin1+debian9b1`.
Support for .NET 4.7.2 was added in Mono 5.18+. For issue #13 we want to try moving from .NET 4.5 to 4.7.2. I've upgraded the servers to Mono 6.8 and confirmed that CVNBot v3.1 still compiles, builds and runs fine. Ref #13. Ref wikimedia/labs-countervandalism-cvn-infrastructure@b33a9739e7.
Support for .NET 4.7.2 was added in Mono 5.18 per <https://www.mono-project.com/docs/about-mono/releases/5.18.0/>. The servers were upgraded from Mono 5.16 to 6.8 earlier today. Ref #13.
Itll be interesting to see if the updates to Mono 6.8 & going to .net 4.7 will alleviate the issue. |
This bug that is closed in Mono 6.8 mono/mono#8689 looks interesting. |
Updating the servers was complicated by an issue, but eventually completed. The
After manually uninstalling all traces of mono and msbuild, I tried again but this time install msbuild as its own package first before installing the rest of mono-complete. That worked, as described in mono/mono#15709. I first redeployed CVNBot v3.1 (.NET 4.5) on Mono 6.8 by itself, which seemed to work fine. I left the fleet running long enough to make sure no obvious issues would emerge, but not looking at its memory footprint. I've now deployed CVNBot v4.0 (.NET 4.7.2), let's see how that fares on Mono 6.8 for a few hours in terms of memory :) |
if needed, We can use docker for isolate environment to using other solution ( avoid constraint for debian stretch package ) |
@KirillOsenkov if you are running version 1.1, then there are no known leaks. SmartIr4net is a mature library and is used by quite a few projects for long-running bots (e.g. LIOC), so I would be surprised to find a leak in it. But yes, in the past it had a few. Any luck with a memory profiler to locate the leak? |
I'm not sure how to grab a memory dump on Linux and Mono. If this ran on .NET framework or .NET Core on Windows it would be very easy to investigate. |
I've managed to get it building and running on Windows. Just needed to add these PackageReferences: It's running now, I'll let it run overnight and see if I can reproduce a leak. If it leaks, it'll now be very easy to investigate. |
Repro instructions for Windows:
Running:
|
@KirillOsenkov this thread mentions the bot uses SQLite which caused a small dejavu. I ran into memory leaks with SQLite in https://github.com/meebey/smuxi in the past, check this commit meebey/smuxi@4e5f4c5#diff-256230ab72fa1386ce4aad3c8928e5ea |
I updated my PR with a few more IDisposable objects wrapped in using. @meebey I think you are on to something here. @Krinkle there is dbcon in ListManager.cs that is a global and used throughout the file which could be the culprit, because its not wrapped in a using and possibly left open. If that is refactored out from a global to just open as needed, I think it can alleviate at least that problem. |
@kashifkhan The sqlite database is used continously by the bot in the execution of its primary function (which is reading events from RCReader, applying information from the database, and then deciding whether and how to output it in the feed channel). I have not confirmed it for many years, but suspect that closing/re-opening the database could cause it to no longer be able to keep up with the RC feeds in real-time. I don't believe there are cases where the bot is working correctly and not need of the database connection being open. However, it would certainly be good to make sure that when things reconnects/restart that we don't open duplicate connections etc. Also, if there is potential for the open connection to leak memory if it is open for too long, perhaps we should periodically recycle it? |
So I ran CVNBot overnight on Windows, and I didn't notice any memory increase at all. It did however have a socket exception and a disconnect and I'm not sure what that was about. I restarted and it's running again now, keeps steady at 40MB. |
@Krinkle yeah that is a good question for the overall group if they can answer with regards to Sqlite. It usual rule of thumb is to keep a connection open for the unit of work that its supposed to do and then close it. In a chatty app like this one, where its doing stuff constantly a persistent connection might serve you better. Im not sure how Sqlite would react to connections opening/closing constantly vs a long running persistent connection Luckily the code in ListManager is structured as individual units already and would be quite easy to do a quick refactor to test, depending on current priorities. Would love to hear from others in the group though on thoughts |
@kashifkhan I don't know if SqlLite connections are also pooled like the ones ADO.NET opens for e.g. an MS SQL DB, but because of this connection pooling, open & closing connections is fast (because they don't actually open & close, you're just getting an open connection from the pool). |
@Krinkle am I reading graphs at https://tools.wmflabs.org/nagf/?project=cvn#h_cvn-app8_memory wrong or did bumping Mono to 6.8 fix leaking? |
@DavidKarlas It would seem so, yes. I'm not yet ready to close this ticket however as there seem to be one or two bugs I'm observing that seem to be new and may be masking the memory isssue:
It's still miles better than before, though. This has been a huge improvement! |
I’ve seen a random SocketException where the bot has disconnected from the channel, not sure what that was about but possibly related. I left the bot running under debugger and then saw it in the morning. |
The bots are increasingly building up memory usage when running. Over a weeks time of running half a dozen bots, it accumulates to several gigabytes of memory being allocated.
The text was updated successfully, but these errors were encountered: