Bandwhich appears to have a serious memory leak #284

popey · 2023-09-09T15:00:21Z

I built and ran bandwhich on a few machines. After a while of running, it dies. I see this in my logs

[Sat Sep  9 14:11:34 2023] [ 910813]     0 910813 25425140  6316768 76255232  3166464           200 bandwhich
[Sat Sep  9 14:11:34 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=zerotier-one.service,mems_allowed=0,global_oom,task_memcg=/system.slice/snap.bandwhich.bandwhich-92de3b50-fb9a-430c-9645-eac2cde77fe0.scope,task=bandwhich,pid=910813,uid=0
[Sat Sep  9 14:11:34 2023] Out of memory: Killed process 910813 (bandwhich) total-vm:101700560kB, anon-rss:25267072kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:74468kB oom_score_adj:200
[Sat Sep  9 14:11:37 2023] oom_reaper: reaped process 910813 (bandwhich), now anon-rss:88kB, file-rss:0kB, shmem-rss:0kB

While I've seen this on remote, small servers. I have also seen it on my workstation which has 32GB RAM and 22GB swap.

popey · 2023-09-09T15:44:10Z

Caught a picture from bottom.

Note the purple line shoots up (that's bandwhich eating RAM) then the yellow line climbs as it eats all the swap, then gets killed by the OOM killer...

cyqsimon · 2023-09-11T10:13:09Z

Is this problem happening on main?

Bandwhich itself does not use any unsafe code, so it's likely that the memory leak was caused by a dependency. We had lots of dependency bumps since the last release, so this may have been already solved if indeed that's the version you are seeing problems on.

popey · 2023-09-11T10:46:27Z

I was using whatever was in main when I filed the issue, yes. Happy to test again after the changes that landed today, but that all looks like CD pipeline stuff?

I found it quite easily reproducible on multiple machines for me. Just leave it running for a while and eventually, it blows up. Might take many minutes or hours, depending on traffic volume, maybe.

I built it with rust 1.70.0 if that makes any difference?

cyqsimon · 2023-09-11T11:27:26Z

but that all looks like CD pipeline stuff?

Yes that's true. You don't need to test again =).

I found it quite easily reproducible on multiple machines for me. Just leave it running for a while and eventually, it blows up. Might take many minutes or hours, depending on traffic volume, maybe.

Curious. I'm no expert in debugging memory issues, but I'll give it a try.

I built it with rust 1.70.0 if that makes any difference?

I doubt it.

sigmaSd · 2023-09-15T20:50:56Z

I can't reproduce on my linux machine, can you run it in debug mode with flamegraph https://github.com/flamegraph-rs/flamegraph it can give some hints

cyqsimon · 2023-09-20T03:33:03Z

Cannot reproduce on my box (Linux 6.5.3-arch1-1 x86_64) either. I kept bandwhich running for one hour while I worked, and the memory usage stayed constant around 9.7M.

cyqsimon · 2023-09-20T05:47:36Z

Out of curiosity, are all of your testing done on the same distro? @popey

As unlikely as it is, maybe there's something wrong with the allocator or even the kernel?

gms8994 · 2023-09-21T14:51:49Z

I tested this on a box specifically trying to reproduce this and could: memory allocation of 6442450944 bytes failed.

Sep 21 06:00:39 kernel: [31561.482912] [   1291]  1000  1291   907564   394590  3289088        0             0 bandwhich
Sep 21 06:00:39 kernel: [31561.483047] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=bandwhich,pid=1291,uid=1000
Sep 21 06:00:39 kernel: [31561.483086] Out of memory: Killed process 1291 (bandwhich) total-vm:3630256kB, anon-rss:1578360kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:3212kB oom_score_adj:0

This is on Ubuntu 22.04.3 LTS with v0.21.0 from the releases tab. I started at around 21:15 on Sep 20, so it ran for ~9 hours before being killed. I'll try and get a flamegraph version today.

terminaldweller · 2023-10-02T14:42:06Z

same issue here with v0.21.0 on void.
currently can't figure out how to run flamegraph due to this:
flamegraph-rs/flamegraph#181
I can post a flamegraph when i manage to get one.

terminaldweller · 2023-10-04T19:11:06Z

here's a flamegraph for when bandwhich ends up taking up too much memory:

sigmaSd · 2023-10-07T21:29:01Z

I can't seem to zoom in the indivual svg elements, anyhow the big view shows mostly read/iterate/openat, so maybe there is leak when reading procfs, maybe it needs a big number of connections

sigmaSd · 2023-10-07T22:00:13Z

I tried https://github.com/KDE/heaptrack and it seems very simple and work really well, shows graph and exact functions and which one leaked

Note: heaptrak bandhwich didn't work for me, I had to start bandwhich, then use heaptrack -p pid of bandhwich

sigmaSd · 2023-10-07T22:00:23Z

It shows 0 leaks for me

sigmaSd · 2023-10-07T22:13:54Z

i kept the program running more, I do have some small leaks

heaptrack is making me suspect https://github.com/imsnif/bandwhich/blob/main/src/display/ui.rs#L184 seems like we keep updating the state which will keep updating the utilization map https://github.com/imsnif/bandwhich/blob/main/src/network/utilization.rs#L27 without ever removing from it

Maybe when there is a big number of connection this causes the leak to show

sigmaSd · 2023-10-07T22:16:25Z

ah I see to make the leaks show up you need to attach heaptrack to bandwhich, and then close heptrack not bandhwich

Because seems RAII is working correctly and everyhting will get cleaned up at exit, but if we disconnect early it shows us places that accumulate memory in the life time of the program like the state updating above

sigmaSd · 2023-10-07T22:19:32Z

We do have this function which is indeed called https://github.com/imsnif/bandwhich/blob/main/src/network/utilization.rs#L24

but its using HashMap::clear which does say in the doc that it clear the entries but keeps the memory allocated, maybe that's what we're looking for

sigmaSd · 2023-10-07T22:23:48Z

ah I see to make the leaks show up you need to attach heaptrack to bandwhich, and then close heptrack not bandhwich

note that this have a lot of false positive though

cyqsimon · 2023-10-08T02:33:44Z

So if I understand correctly, what you are saying @sigmaSd is that this is not so much a leak as it is simply keeping a value long past its usefulness?

cyqsimon · 2023-10-08T02:40:32Z

@popey @gms8994 @terminaldweller Can you please reproduce the severe memory usage issue again while tracing using heaptrack, and upload the output file? This will help us verify @sigmaSd's hypothesis.

sudo heaptrack bandwhich

Thanks in advance.

sigmaSd · 2023-10-08T02:58:14Z

So if I understand correctly, what you are saying @sigmaSd is that this is not so much a leak as it is simply keeping a value long past its usefulness?

Yes But its more like keeping the memory past its usefulness

Also maybe someone who reproduces the issue can try changing https://github.com/imsnif/bandwhich/blob/main/src/network/utilization.rs#L24 to self.connections = HashMap::new(); if thats indeed where the biggest issue is, it should be felt by forcing the hashmap to relase its memory

sigmaSd · 2023-10-08T03:00:17Z

Though this doesn't explain alone why it would fail to allocate memory

popey · 2023-10-08T12:39:01Z

@cyqsimon sure thing. I've left bandwhich running under heaptrack on my desktop. It has 64GB RAM and this OOM usually happens unexpectedly after some hours. So will report back if/when it happens.

sigmaSd · 2023-10-08T13:36:13Z

You don't need to wait for oom, just stop heaptrack after you see an usually memory (like 100mb is probably enough)

popey · 2023-10-08T14:14:17Z

Oh, too late. By the time I saw this reply it had already run for a while..
It was too big for github file upload, so I put it here: https://popey.com/~alan/bandwhich-heaptrack.zip

cyqsimon · 2023-10-08T14:48:38Z

Incorrect diagnosis. See my newer comment.

Good good. From the backtrace I was able to track the allocations back to a single line.

bandwhich/src/os/linux.rs

Line 21 in 8c6be28

inode_to_procname.insert(inode, procname.clone());

It seems like your hypothesis is correct @imsnif, albeit of a different hashmap. I'll take a look at how this stored data is used and how to best implement some sort of cleanup.

Edit:

Perhaps I shouldn't rule out the possibility that it's caused by file handles not being closed. I'll make sure that isn't the case first.

cyqsimon · 2023-10-08T17:08:10Z

I don't think so.

bandwhich/src/display/ui_state.rs

Lines 116 to 122 in 8c6be28

    
           self.utilization_data.push_back(UtilizationData { 
        
               connections_to_procs, 
        
               network_utilization, 
        
           }); 
        
           if self.utilization_data.len() > RECALL_LENGTH { 
        
               self.utilization_data.pop_front(); 
        
           }

Any more items than 5 in utilization_data get popped, which drops the hashmap. Not to mention that if this is the issue, I think we would be able to reproduce it very consistently.

Re-examing the backtrace more carefully, I'm afraid that my fear in my previous comment was justified:

This is the only instance FDsIter was used:

bandwhich/src/os/linux.rs

Line 16 in 8c6be28

let Ok(fds) = process.fd() else { continue };

And if we take a look at Process::fd:

Voila. And just to take it one step further, Dir does indeed contain a Vec<u8>, which does not necessarily confirm, but at least does not disagree with this assessment.

All this is to say that this might be an "upstream's upstream" bug, which is going to be a PITA to deal with. There's one thing we can try immediately though: procfs v0.15.1 depends on an old version of rustix: v0.35.15. I can try to bump it to the newest version via a temporary patch, and we can let @popey try again with the patched build.

cyqsimon · 2023-10-08T17:36:58Z

Okay, can all please try the bump-procfs-rustix branch?

You can either pull the branch and build locally or use the CI build.

If this still does not fix this issue, I will go ahead and submit bug reports in procfs and rustix repositories. And in the absolute worst case scenario, we may have actually found a bug in the glibc allocator 🥶.

cyqsimon · 2023-10-08T18:22:28Z

Actually, speaking of glibc, can all also try the MUSL build from release v0.21.0? It's circumstantial evidence at best but it might give us some clues.

popey · 2023-10-08T21:59:45Z

Left the MUSL build running for 2.5 hours, and no OOM crash yet. Will leave it overnight.

popey · 2023-10-09T11:34:45Z

It died at 3am.

[Mon Oct  9 03:08:32 2023] [1060191]     0 1060191 25175202  7098543 76488704  2427744           200 bandwhich
[Mon Oct  9 03:08:32 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-d2adcb5f-28a9-449f-93b3-db3ea12d073b.scope,task=bandwhich,pid=1060191,uid=0
[Mon Oct  9 03:08:32 2023] Out of memory: Killed process 1060191 (bandwhich) total-vm:100700808kB, anon-rss:28393916kB, file-rss:256kB, shmem-rss:0kB, UID:0 pgtables:74696kB oom_score_adj:200
[Mon Oct  9 03:08:35 2023] oom_reaper: reaped process 1060191 (bandwhich), now anon-rss:0kB, file-rss:256kB, shmem-rss:0kB

sigmaSd · 2023-10-09T12:07:48Z

@popey please try @cyqsimon changes #284 also you don't need to wait for it to oom, if it reach lets say 500mb we can consider that the bug is still there

popey · 2023-10-09T12:49:26Z

Ok, trying that build didn't work either. I don't tend to sit and watch it, but leave it running and look at the terminal when I remember. I just looked and it's using 70GB after 25 minutes.

python3 bin/ps_mem.py | grep bandwhich                                                                                                              nuc: Mon Oct  9 13:47:48 2023

 23.4 GiB +  46.8 GiB =  70.2 GiB       bandwhich

cyqsimon · 2023-10-10T03:24:33Z

@sigmaSd What distro are you on? Trying to establish some datapoints for an upstream bug report.

konnorandrews · 2023-10-10T04:09:28Z

Hi all, I went hunting for the bug after cyqsimon brought it up in the community discord server. Looks like I have found the issue and it is in rustix's linux_raw backend.

As cyqsimon already pointed out the Dir struct for the linux_raw backend contains a Vec<u8>: https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L21.

From the heaptrace the allocation is coming from the Dir::read() method here: https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L55

At https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L78 the code in read() checks if it has run out of items and needs to read more, it then calls Dir::read_more(). Immediately upon entering Dir::read_more(), it calls resize() on the self.buf (https://github.com/bytecodealliance/rustix/blob/main/src/backend/linux_raw/fs/dir.rs#L142). However, it uses this new size:

self.buf.resize(self.buf.capacity() + 32 * size_of::<linux_dirent64>(), 0);

This causes self.buf to grow every time read_more() is called without bound because of the usage of self.buf.capacity(). I believe the issue is resolved by simply changing self.buf.capacity() to self.buf.len() as directly after this self.buf.resize(nread, 0); is called to truncate the Vec to the actual number of elements (note this does not deallocate anything).

In normal operation self.buf.capacity() should never need to be more than a few multiples of 32 * size_of::<linux_dirent64>() if not exactly that.

sigmaSd · 2023-10-10T18:23:09Z

I can't reproduce on archlinux, @cyqsimon did you try it on one of the affected distro?, this all would be easier if one could reproduce the issue

cyqsimon · 2023-10-11T05:04:51Z

Hi all, quick update.

Thanks to the collaborative effort of @sigmaSd, @konnorandrews, and @sunfishcode (rustix maintainer), we were able to diagnose the issue and reproduce the fault condition in rustix directly. They will likely patch it in the next release.

Since we currently depend on procfs 0.15.1, which depends on an older, incompatible version of rustix, we will not be able to update our indirect dependency on rustix immediately. I will notify @eminence (procfs maintainer) as soon as the rustix patch is available.

popey · 2023-10-11T08:03:09Z

Nice work everyone!

- Fixes #284 - See GHSA-c827-hfw6-qwvm

cyqsimon · 2023-10-16T00:21:58Z

Good news, rustix has backported this fix to all minor versions up to 0.35, which is the first minor version affected. This means we do not block on an update from procfs (I have notified them nonetheless).

Since I am pretty confident that we've fixed the exact problem, I will close this issue as completed. Please reopen and/or comment in case you are still seeing it on main, thanks.

Includes dependency update to fix security issue: imsnif/bandwhich#284 (comment)

Reference: imsnif/bandwhich#284

cyqsimon added bug Something isn't working help wanted Extra attention is needed labels Sep 11, 2023

cyqsimon added needs reproduction and removed help wanted Extra attention is needed labels Sep 20, 2023

cyqsimon removed the needs reproduction label Oct 8, 2023

cyqsimon mentioned this issue Oct 10, 2023

[Bug] FDsIter memory leak eminence/procfs#287

Closed

cyqsimon added a commit that referenced this issue Oct 16, 2023

Bump rustix dependencies

bc10c07

- Fixes #284 - See GHSA-c827-hfw6-qwvm

cyqsimon closed this as completed Oct 16, 2023

TheRealLorenz added a commit to TheRealLorenz/dotlink that referenced this issue Oct 20, 2023

Release 0.2.1

2a1e222

Includes dependency update to fix security issue: imsnif/bandwhich#284 (comment)

TheRealLorenz added a commit to TheRealLorenz/dotlink that referenced this issue Oct 20, 2023

Release 0.2.1

5384c76

Includes dependency update to fix security issue: imsnif/bandwhich#284 (comment)

mmatsa mentioned this issue Feb 26, 2024

Release 0.20.1 not SemVer compatible? lalrpop/lalrpop#848

Closed

This was referenced Jul 10, 2024

rustix "memory explosion" vulnerability launchbadge/sqlx#3338

Closed

rustix "memory explosion" vulnerability async-rs/async-std#1080

Open

a-kenji added a commit to a-kenji/nix-uri that referenced this issue Oct 29, 2024

deps(fuzz): Update rustix from 0.38.15 -> 0.38.38

f181bb8

Reference: imsnif/bandwhich#284

a-kenji mentioned this issue Oct 29, 2024

deps(fuzz): Update rustix from 0.38.15 -> 0.38.38 a-kenji/nix-uri#120

Merged

Bandwhich appears to have a serious memory leak #284

Bandwhich appears to have a serious memory leak #284

Comments

popey commented Sep 9, 2023

popey commented Sep 9, 2023

cyqsimon commented Sep 11, 2023 • edited Loading

popey commented Sep 11, 2023 • edited Loading

cyqsimon commented Sep 11, 2023

sigmaSd commented Sep 15, 2023

cyqsimon commented Sep 20, 2023

cyqsimon commented Sep 20, 2023

gms8994 commented Sep 21, 2023

terminaldweller commented Oct 2, 2023

terminaldweller commented Oct 4, 2023

sigmaSd commented Oct 7, 2023

sigmaSd commented Oct 7, 2023

sigmaSd commented Oct 7, 2023

sigmaSd commented Oct 7, 2023

sigmaSd commented Oct 7, 2023

sigmaSd commented Oct 7, 2023

sigmaSd commented Oct 7, 2023

cyqsimon commented Oct 8, 2023

cyqsimon commented Oct 8, 2023

sigmaSd commented Oct 8, 2023

sigmaSd commented Oct 8, 2023

popey commented Oct 8, 2023

sigmaSd commented Oct 8, 2023

popey commented Oct 8, 2023 • edited Loading

cyqsimon commented Oct 8, 2023 • edited Loading

Edit:

cyqsimon commented Oct 8, 2023 • edited Loading

cyqsimon commented Oct 8, 2023 • edited Loading

cyqsimon commented Oct 8, 2023

popey commented Oct 8, 2023

popey commented Oct 9, 2023

sigmaSd commented Oct 9, 2023

popey commented Oct 9, 2023

cyqsimon commented Oct 10, 2023

konnorandrews commented Oct 10, 2023

sigmaSd commented Oct 10, 2023

cyqsimon commented Oct 11, 2023

popey commented Oct 11, 2023

cyqsimon commented Oct 16, 2023

cyqsimon commented Sep 11, 2023 •

edited

Loading

popey commented Sep 11, 2023 •

edited

Loading

popey commented Oct 8, 2023 •

edited

Loading

cyqsimon commented Oct 8, 2023 •

edited

Loading

cyqsimon commented Oct 8, 2023 •

edited

Loading

cyqsimon commented Oct 8, 2023 •

edited

Loading