Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bandwhich appears to have a serious memory leak #284

Closed
popey opened this issue Sep 9, 2023 · 39 comments
Closed

Bandwhich appears to have a serious memory leak #284

popey opened this issue Sep 9, 2023 · 39 comments
Labels
bug Something isn't working

Comments

@popey
Copy link

popey commented Sep 9, 2023

I built and ran bandwhich on a few machines. After a while of running, it dies. I see this in my logs

[Sat Sep  9 14:11:34 2023] [ 910813]     0 910813 25425140  6316768 76255232  3166464           200 bandwhich
[Sat Sep  9 14:11:34 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=zerotier-one.service,mems_allowed=0,global_oom,task_memcg=/system.slice/snap.bandwhich.bandwhich-92de3b50-fb9a-430c-9645-eac2cde77fe0.scope,task=bandwhich,pid=910813,uid=0
[Sat Sep  9 14:11:34 2023] Out of memory: Killed process 910813 (bandwhich) total-vm:101700560kB, anon-rss:25267072kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:74468kB oom_score_adj:200
[Sat Sep  9 14:11:37 2023] oom_reaper: reaped process 910813 (bandwhich), now anon-rss:88kB, file-rss:0kB, shmem-rss:0kB

While I've seen this on remote, small servers. I have also seen it on my workstation which has 32GB RAM and 22GB swap.

@popey
Copy link
Author

popey commented Sep 9, 2023

image

Caught a picture from bottom.

Note the purple line shoots up (that's bandwhich eating RAM) then the yellow line climbs as it eats all the swap, then gets killed by the OOM killer...

@cyqsimon
Copy link
Collaborator

cyqsimon commented Sep 11, 2023

Is this problem happening on main?

Bandwhich itself does not use any unsafe code, so it's likely that the memory leak was caused by a dependency. We had lots of dependency bumps since the last release, so this may have been already solved if indeed that's the version you are seeing problems on.

@popey
Copy link
Author

popey commented Sep 11, 2023

I was using whatever was in main when I filed the issue, yes. Happy to test again after the changes that landed today, but that all looks like CD pipeline stuff?

I found it quite easily reproducible on multiple machines for me. Just leave it running for a while and eventually, it blows up. Might take many minutes or hours, depending on traffic volume, maybe.

I built it with rust 1.70.0 if that makes any difference?

@cyqsimon
Copy link
Collaborator

but that all looks like CD pipeline stuff?

Yes that's true. You don't need to test again =).

I found it quite easily reproducible on multiple machines for me. Just leave it running for a while and eventually, it blows up. Might take many minutes or hours, depending on traffic volume, maybe.

Curious. I'm no expert in debugging memory issues, but I'll give it a try.

I built it with rust 1.70.0 if that makes any difference?

I doubt it.

@cyqsimon cyqsimon added bug Something isn't working help wanted Extra attention is needed labels Sep 11, 2023
@sigmaSd
Copy link
Collaborator

sigmaSd commented Sep 15, 2023

I can't reproduce on my linux machine, can you run it in debug mode with flamegraph https://github.com/flamegraph-rs/flamegraph it can give some hints

@cyqsimon
Copy link
Collaborator

Cannot reproduce on my box (Linux 6.5.3-arch1-1 x86_64) either. I kept bandwhich running for one hour while I worked, and the memory usage stayed constant around 9.7M.

@cyqsimon cyqsimon added needs reproduction and removed help wanted Extra attention is needed labels Sep 20, 2023
@cyqsimon
Copy link
Collaborator

Out of curiosity, are all of your testing done on the same distro? @popey

As unlikely as it is, maybe there's something wrong with the allocator or even the kernel?

@gms8994
Copy link

gms8994 commented Sep 21, 2023

I tested this on a box specifically trying to reproduce this and could: memory allocation of 6442450944 bytes failed.

Sep 21 06:00:39 kernel: [31561.482912] [   1291]  1000  1291   907564   394590  3289088        0             0 bandwhich
Sep 21 06:00:39 kernel: [31561.483047] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=bandwhich,pid=1291,uid=1000
Sep 21 06:00:39 kernel: [31561.483086] Out of memory: Killed process 1291 (bandwhich) total-vm:3630256kB, anon-rss:1578360kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:3212kB oom_score_adj:0

This is on Ubuntu 22.04.3 LTS with v0.21.0 from the releases tab. I started at around 21:15 on Sep 20, so it ran for ~9 hours before being killed. I'll try and get a flamegraph version today.

@terminaldweller
Copy link

same issue here with v0.21.0 on void.
currently can't figure out how to run flamegraph due to this:
flamegraph-rs/flamegraph#181
I can post a flamegraph when i manage to get one.

@terminaldweller
Copy link

here's a flamegraph for when bandwhich ends up taking up too much memory:

flamegraph

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

I can't seem to zoom in the indivual svg elements, anyhow the big view shows mostly read/iterate/openat, so maybe there is leak when reading procfs, maybe it needs a big number of connections

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

I tried https://github.com/KDE/heaptrack and it seems very simple and work really well, shows graph and exact functions and which one leaked

Note: heaptrak bandhwich didn't work for me, I had to start bandwhich, then use heaptrack -p pid of bandhwich

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

It shows 0 leaks for me

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

i kept the program running more, I do have some small leaks

heaptrack is making me suspect https://github.com/imsnif/bandwhich/blob/main/src/display/ui.rs#L184 seems like we keep updating the state which will keep updating the utilization map https://github.com/imsnif/bandwhich/blob/main/src/network/utilization.rs#L27 without ever removing from it

Maybe when there is a big number of connection this causes the leak to show

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

ah I see to make the leaks show up you need to attach heaptrack to bandwhich, and then close heptrack not bandhwich

Because seems RAII is working correctly and everyhting will get cleaned up at exit, but if we disconnect early it shows us places that accumulate memory in the life time of the program like the state updating above

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

We do have this function which is indeed called https://github.com/imsnif/bandwhich/blob/main/src/network/utilization.rs#L24

but its using HashMap::clear which does say in the doc that it clear the entries but keeps the memory allocated, maybe that's what we're looking for

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 7, 2023

ah I see to make the leaks show up you need to attach heaptrack to bandwhich, and then close heptrack not bandhwich

note that this have a lot of false positive though

@cyqsimon
Copy link
Collaborator

cyqsimon commented Oct 8, 2023

So if I understand correctly, what you are saying @sigmaSd is that this is not so much a leak as it is simply keeping a value long past its usefulness?

@cyqsimon
Copy link
Collaborator

cyqsimon commented Oct 8, 2023

@popey @gms8994 @terminaldweller Can you please reproduce the severe memory usage issue again while tracing using heaptrack, and upload the output file? This will help us verify @sigmaSd's hypothesis.

sudo heaptrack bandwhich

Thanks in advance.

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 8, 2023

So if I understand correctly, what you are saying @sigmaSd is that this is not so much a leak as it is simply keeping a value long past its usefulness?

Yes But its more like keeping the memory past its usefulness

Also maybe someone who reproduces the issue can try changing https://github.com/imsnif/bandwhich/blob/main/src/network/utilization.rs#L24 to self.connections = HashMap::new(); if thats indeed where the biggest issue is, it should be felt by forcing the hashmap to relase its memory

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 8, 2023

Though this doesn't explain alone why it would fail to allocate memory

@popey
Copy link
Author

popey commented Oct 8, 2023

@cyqsimon sure thing. I've left bandwhich running under heaptrack on my desktop. It has 64GB RAM and this OOM usually happens unexpectedly after some hours. So will report back if/when it happens.

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 8, 2023

You don't need to wait for oom, just stop heaptrack after you see an usually memory (like 100mb is probably enough)

@popey
Copy link
Author

popey commented Oct 8, 2023

Oh, too late. By the time I saw this reply it had already run for a while..
It was too big for github file upload, so I put it here: https://popey.com/~alan/bandwhich-heaptrack.zip

@cyqsimon
Copy link
Collaborator

cyqsimon commented Oct 8, 2023

Incorrect diagnosis. See my newer comment.

Good good. From the backtrace I was able to track the allocations back to a single line.

inode_to_procname.insert(inode, procname.clone());

It seems like your hypothesis is correct @imsnif, albeit of a different hashmap. I'll take a look at how this stored data is used and how to best implement some sort of cleanup.


Edit:

Perhaps I shouldn't rule out the possibility that it's caused by file handles not being closed. I'll make sure that isn't the case first.

@cyqsimon
Copy link
Collaborator

cyqsimon commented Oct 8, 2023

I don't think so.

self.utilization_data.push_back(UtilizationData {
connections_to_procs,
network_utilization,
});
if self.utilization_data.len() > RECALL_LENGTH {
self.utilization_data.pop_front();
}

Any more items than 5 in utilization_data get popped, which drops the hashmap. Not to mention that if this is the issue, I think we would be able to reproduce it very consistently.


Re-examing the backtrace more carefully, I'm afraid that my fear in my previous comment was justified:

Bottom-Up Backtrace

This is the only instance FDsIter was used:

let Ok(fds) = process.fd() else { continue };

And if we take a look at Process::fd:

Dir::read usage

Voila. And just to take it one step further, Dir does indeed contain a Vec<u8>, which does not necessarily confirm, but at least does not disagree with this assessment.


All this is to say that this might be an "upstream's upstream" bug, which is going to be a PITA to deal with. There's one thing we can try immediately though: procfs v0.15.1 depends on an old version of rustix: v0.35.15. I can try to bump it to the newest version via a temporary patch, and we can let @popey try again with the patched build.

@cyqsimon
Copy link
Collaborator

cyqsimon commented Oct 8, 2023

Okay, can all please try the bump-procfs-rustix branch?

You can either pull the branch and build locally or use the CI build.

If this still does not fix this issue, I will go ahead and submit bug reports in procfs and rustix repositories. And in the absolute worst case scenario, we may have actually found a bug in the glibc allocator 🥶.

@cyqsimon
Copy link
Collaborator

cyqsimon commented Oct 8, 2023

Actually, speaking of glibc, can all also try the MUSL build from release v0.21.0? It's circumstantial evidence at best but it might give us some clues.

@popey
Copy link
Author

popey commented Oct 8, 2023

Left the MUSL build running for 2.5 hours, and no OOM crash yet. Will leave it overnight.

@popey
Copy link
Author

popey commented Oct 9, 2023

It died at 3am.

[Mon Oct  9 03:08:32 2023] [1060191]     0 1060191 25175202  7098543 76488704  2427744           200 bandwhich
[Mon Oct  9 03:08:32 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-d2adcb5f-28a9-449f-93b3-db3ea12d073b.scope,task=bandwhich,pid=1060191,uid=0
[Mon Oct  9 03:08:32 2023] Out of memory: Killed process 1060191 (bandwhich) total-vm:100700808kB, anon-rss:28393916kB, file-rss:256kB, shmem-rss:0kB, UID:0 pgtables:74696kB oom_score_adj:200
[Mon Oct  9 03:08:35 2023] oom_reaper: reaped process 1060191 (bandwhich), now anon-rss:0kB, file-rss:256kB, shmem-rss:0kB

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 9, 2023

@popey please try @cyqsimon changes #284 also you don't need to wait for it to oom, if it reach lets say 500mb we can consider that the bug is still there

@popey
Copy link
Author

popey commented Oct 9, 2023

Ok, trying that build didn't work either. I don't tend to sit and watch it, but leave it running and look at the terminal when I remember. I just looked and it's using 70GB after 25 minutes.

python3 bin/ps_mem.py | grep bandwhich                                                                                                              nuc: Mon Oct  9 13:47:48 2023

 23.4 GiB +  46.8 GiB =  70.2 GiB       bandwhich

@cyqsimon
Copy link
Collaborator

@sigmaSd What distro are you on? Trying to establish some datapoints for an upstream bug report.

@konnorandrews
Copy link

Hi all, I went hunting for the bug after cyqsimon brought it up in the community discord server. Looks like I have found the issue and it is in rustix's linux_raw backend.

As cyqsimon already pointed out the Dir struct for the linux_raw backend contains a Vec<u8>: https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L21.

From the heaptrace the allocation is coming from the Dir::read() method here: https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L55

At https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L78 the code in read() checks if it has run out of items and needs to read more, it then calls Dir::read_more(). Immediately upon entering Dir::read_more(), it calls resize() on the self.buf (https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L142). However, it uses this new size:

self.buf.resize(self.buf.capacity() + 32 * size_of::<linux_dirent64>(), 0);

This causes self.buf to grow every time read_more() is called without bound because of the usage of self.buf.capacity(). I believe the issue is resolved by simply changing self.buf.capacity() to self.buf.len() as directly after this self.buf.resize(nread, 0); is called to truncate the Vec to the actual number of elements (note this does not deallocate anything).

In normal operation self.buf.capacity() should never need to be more than a few multiples of 32 * size_of::<linux_dirent64>() if not exactly that.

@sigmaSd
Copy link
Collaborator

sigmaSd commented Oct 10, 2023

I can't reproduce on archlinux, @cyqsimon did you try it on one of the affected distro?, this all would be easier if one could reproduce the issue

@cyqsimon
Copy link
Collaborator

Hi all, quick update.

Thanks to the collaborative effort of @sigmaSd, @konnorandrews, and @sunfishcode (rustix maintainer), we were able to diagnose the issue and reproduce the fault condition in rustix directly. They will likely patch it in the next release.

Since we currently depend on procfs 0.15.1, which depends on an older, incompatible version of rustix, we will not be able to update our indirect dependency on rustix immediately. I will notify @eminence (procfs maintainer) as soon as the rustix patch is available.

@popey
Copy link
Author

popey commented Oct 11, 2023

Nice work everyone!

cyqsimon added a commit that referenced this issue Oct 16, 2023
@cyqsimon
Copy link
Collaborator

Good news, rustix has backported this fix to all minor versions up to 0.35, which is the first minor version affected. This means we do not block on an update from procfs (I have notified them nonetheless).

Since I am pretty confident that we've fixed the exact problem, I will close this issue as completed. Please reopen and/or comment in case you are still seeing it on main, thanks.

TheRealLorenz added a commit to TheRealLorenz/dotlink that referenced this issue Oct 20, 2023
Includes dependency update to fix security issue: imsnif/bandwhich#284 (comment)
TheRealLorenz added a commit to TheRealLorenz/dotlink that referenced this issue Oct 20, 2023
Includes dependency update to fix security issue: imsnif/bandwhich#284 (comment)
a-kenji added a commit to a-kenji/nix-uri that referenced this issue Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants