procfs::process::all_processes() can blow "open files" ulimit on a system with a large number of processes #188

obaudys · 2022-07-22T13:34:13Z

Looks like the changes in pull request #171 make the procfs::process::Process struct hold an open file descriptor for the corresponding /proc/<pid> directory, and stats are then lazily loaded with methods.

Unfortunately, keeping a lot of open file descriptors around can hit the "max open files" ulimit - it is 1024 on most linux installs by default. This happened to me today while playing around with nushell, when the ps command did not work, since it had hit the open files limit.

The code in question is this snippet, where procfs::process::all_processes() is called, then more Process instances are created from there which really explodes the number of open file descriptors.

Unfortunately I think the change in #171 - ie. wrapping an open file descriptor in a struct and holding it open until such time as it is needed - is a bad idea. Why not just hold name of the /prod/<pid> directory as a String? Convention dictates that a file is opened, data is read, and the file is closed in rapid succession. This is effectively how it is being done by the methods in impl Process, except for the open file descriptor for the directory!

To reproduce the issue, I offer this butchered version of examples/ps.rs:

#![allow(clippy::print_literal)]

use procfs::process::Process;

extern crate procfs;

fn main() {
    let mestat = procfs::process::Process::myself().unwrap().stat().unwrap();
    let tps = procfs::ticks_per_second().unwrap();

    println!("{: >10} {: <8} {: >8} {}", "PID", "TTY", "TIME", "CMD");

    let tty = format!("pty/{}", mestat.tty_nr().1);
    let mut persist = Vec::new();
    match procfs::process::all_processes() {
        Ok(all_procs) => {
            for p in all_procs {
                match p {
                    Ok(prc) => {
                        if let Ok(stat) = prc.stat() {
                            // total_time is in seconds
                            let total_time = (stat.utime + stat.stime) as f32 / (tps as f32);
                            println!("{: >10} {: <8} {: >8} {}", stat.pid, tty, total_time, stat.comm);
                        }
                        let newproc = match Process::new(prc.pid()) {
                            Ok(p) => p,
                            Err(e) => panic!("{}",e) // cannot create new Process!
                        };
                        // don't let processes go out of scope, to trigger "Too many open files"
                        persist.push(newproc);
                    },
                    Err(e) => {
                        println!("{}", e);
                    }
                }
            }
        },
        Err(e) => println!("{}",e)
    }
}

Run this with a suitably low ulimit, eg.

ulimit -n 256
cargo run --example bad-ps.rs

Note: I'm creating new Processes and pushing them into a Vec to trigger a panic for illustrative purposes. If I push the original prc objects into the Vector, the ProcessesIter iterator simply finishes early. Either way, the intent is to show that if the Processes returned by all_processes() are not dropped, the open files limit will be reached.

The text was updated successfully, but these errors were encountered:

obaudys · 2022-07-22T15:15:59Z

Ugh, I can see this issue was already foreseen in #125, in this comment in particular.

How does returning an iterator help? If the caller immediately collects everything into a Vec, we would still have an open FD for every running process, right?

Yes. The idea is that for implementing something like top, programs would map over the iterator and pull out the information they need into a different structure. On Linux the default max fds open is 1024, and it's definitely possible to have more than 1024 processes open, so we probably want to avoid pushing up against that limit.

Does this mean the consumer is responsible for making sure the Process objects are short lived? In that case I'll raise a task in nushell. Some documentation highlighting these responsibilities would be nice.

eminence · 2022-07-22T16:05:22Z

Hi, thanks for raising the issue. Yes, the consumer is responsible for making sure the Process objects are short-lived, if that makes sense for the use-case.

The use-case you described (with nushell's ps command) definitely seems like the type of use case where it makes sense to collect all of the necessary information and then drop the Process. However, now that I read the code in question, it's a little more subtle; Nushell is getting a list of all processes, waiting 100ms, and then "refreshing" that list to get updated stats (I think to compute IO deltas, but I don't see that happening anywhere). The natural implementation would be to keep the Process objects around so you can query them again, but that obviously doesn't work well here.

(I also note that their code looks to be written from before #171 was merged, so they're not taking advantage of the benefits of #171 and are getting only the downsides)

There are other use-cases where it makes sense to keep a Process around for a while, perhaps because you want to continually monitor it. The advantage of the current implementation is that there is never a risk that calling process.stat() will return data for the wrong process (which could happen before #171 was merged)

So my first inclination is for nushell to adjust their code to account for this. If you open an issue with nushell, I'd be grateful if you could CC me. I'm interesting in what the nushell team thinks.

obaudys · 2022-07-23T02:32:04Z

I see the purpose of the current Process objects having the stats non-eagerly evaluated: we can get updated process stats each time we call a method on the struct. But this is a big footgun due to file descriptor usage.

Wouldn't it be better to have essentially two APIs? An eagerly evaluated Process that scrapes all stats at the time of instantiation, then closes the FD and does not update; and the current Process that updates stats via method calls? I'll go further and say the eager version should be the default API. We can have both the eager, static version and lazy-updateable version implement the same Trait with the stat(), io() etc methods.

eminence · 2022-07-23T03:03:01Z

we can get updated process stats each time we call a method on the struct. But this is a big footgun due to file descriptor usage.

Can you say more about this, and where you see the potential footgun?

(Edit: not sure if the stat() is the footgun or the FD within the Process)

obaudys · 2022-07-23T04:14:37Z

Maybe footgun is too harsh a term, but having an FD within the process, as a private field, with no documentation about this implementation detail (yet), would lead to unexpected behavior. A reasonable person would expect to be able to create thousands of Process instances and push them into a Vec without running out of file descriptors.

eminence · 2022-07-23T04:39:09Z

Ah, yes, that is quite right. The documentation is clearly insufficient here, and I'll work on that this weekend.

You're also right that we could provide some additional API that grabs all available info about a process and then closes the FD. I'm not inclined to provide one at the moment because it's hard to know what data to load and I think it's too inefficient to load everything. I'd rather have consumers of this API write their own wrappers that fit their need.

(For example, the previous API always loaded the /proc/pid/stat structure, but this was wasteful for consumers that never needed this info)

CC #188

obaudys · 2022-07-27T03:55:59Z

I'd say the addition of the documentation, highlighting the fact that there are open file descriptors, is a satisfactory resolution to this issue.

CC eminence#188

obaudys mentioned this issue Jul 23, 2022

ps command fails silently in linux due to "Too many open files" nushell/nushell#6101

Open

eminence added a commit that referenced this issue Jul 25, 2022

Add some docs on the Process object

281e4b3

CC #188

eminence mentioned this issue Jul 25, 2022

Add some docs on the Process object #190

Merged

obaudys closed this as completed Jul 27, 2022

ludo-c pushed a commit to ludo-c/procfs that referenced this issue Jul 28, 2022

Add some docs on the Process object

72052b5

CC eminence#188

ludo-c pushed a commit to ludo-c/procfs that referenced this issue Aug 2, 2022

Add some docs on the Process object

87a5c62

CC eminence#188

ludo-c pushed a commit to ludo-c/procfs that referenced this issue Aug 2, 2022

Add some docs on the Process object

bd50495

CC eminence#188

tatref mentioned this issue Mar 27, 2023

Don't hide process creation errors in all_processes #260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

procfs::process::all_processes() can blow "open files" ulimit on a system with a large number of processes #188

procfs::process::all_processes() can blow "open files" ulimit on a system with a large number of processes #188

obaudys commented Jul 22, 2022 •

edited

Loading

obaudys commented Jul 22, 2022

eminence commented Jul 22, 2022

obaudys commented Jul 23, 2022

eminence commented Jul 23, 2022 •

edited

Loading

obaudys commented Jul 23, 2022

eminence commented Jul 23, 2022

obaudys commented Jul 27, 2022

procfs::process::all_processes() can blow "open files" ulimit on a system with a large number of processes #188

procfs::process::all_processes() can blow "open files" ulimit on a system with a large number of processes #188

Comments

obaudys commented Jul 22, 2022 • edited Loading

obaudys commented Jul 22, 2022

eminence commented Jul 22, 2022

obaudys commented Jul 23, 2022

eminence commented Jul 23, 2022 • edited Loading

obaudys commented Jul 23, 2022

eminence commented Jul 23, 2022

obaudys commented Jul 27, 2022

obaudys commented Jul 22, 2022 •

edited

Loading

eminence commented Jul 23, 2022 •

edited

Loading