Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fd can get stuck when ran onto the whole FS from the root #288

Closed
Porkepix opened this issue Apr 23, 2018 · 40 comments · Fixed by #325
Closed

fd can get stuck when ran onto the whole FS from the root #288

Porkepix opened this issue Apr 23, 2018 · 40 comments · Fixed by #325
Labels

Comments

@Porkepix
Copy link

Porkepix commented Apr 23, 2018

The ran command is just fd foobar /. It gets stuck and never end. I sadly was unable to understand what's causing it.
I can investigate to give answers if you have some ideas about what else could I check.

Below is the partition's setup:

# lsblk
NAME           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda              8:0    0 223,6G  0 disk  
├─sda1           8:1    0   512M  0 part  /boot/efi
├─sda2           8:2    0   200M  0 part  
│ └─cryptboot  254:3    0   198M  0 crypt /boot
└─sda3           8:3    0 222,9G  0 part  
  └─lvm        254:0    0 222,9G  0 crypt 
    ├─vg0-swap 254:1    0     8G  0 lvm   [SWAP]
    └─vg0-root 254:2    0 214,9G  0 lvm   /

EDIT: Issue is reproducing everytime on this Archlinux setup. Couldn't reproduce it on another Archlinux with unencrypted system.

@sharkdp
Copy link
Owner

sharkdp commented Apr 23, 2018

I can reproduce this on my PC. This seems to be a permission problem with files in /proc and /sys.

Could you try to run

fd -E /proc -E /sys foobar /

where -E (--exclude) excludes these two folders?

(Note: you might have to exclude some other mount points as well)

@Porkepix
Copy link
Author

Indeed this does work. Strange that my personal server doesn't have the same issue though.

@Porkepix
Copy link
Author

Oh, and the issue was present even when running fd as root. So I'm not sure about the permission idea.

@sharkdp
Copy link
Owner

sharkdp commented Apr 23, 2018

Oh, and the issue was present even when running fd as root. So I'm not sure about the permission idea.

I see, interesting. Something else must be going on inside /proc and/or /sys. If anybody has an idea on how to fix this, I'd be glad for any hints.

@Porkepix
Copy link
Author

Excluding only proc is enough, so problem comes from something inside /proc.
343 items inside /proc makes a lot of things to test though.

@Porkepix
Copy link
Author

Even more interesting.
fd -E "/proc/[0-9]*" foobar / is succeeding.
fd -E "/proc/[0-9]*" foobar /proc is failing.

@sharkdp
Copy link
Owner

sharkdp commented Apr 23, 2018

Wait, this also seems to change from time to time. Right now, fd foobar / works perfectly fine on my machine(?!?).

@sharkdp
Copy link
Owner

sharkdp commented Apr 23, 2018

I suspect this could be caused by infinite recursion within /proc (see here or here).

Quote:

Avoiding symbolic links may not be sufficient for avoiding "infinite" recursion in /proc. On the bright side, "infinite" is bounded by PATH_MAX...

Unfortunately, I cannot reproduce the problem at the moment. Could you try

fd --max-depth 50 foobar /

?

@sharkdp
Copy link
Owner

sharkdp commented May 3, 2018

I'm going to close this (for now), as there is no further feedback. Feel free to comment here if this should be re-opened.

The workaround for now is to explicitly exclude /proc via

fd -E /proc ...

@sharkdp sharkdp closed this as completed May 3, 2018
@Porkepix
Copy link
Author

Porkepix commented May 3, 2018

Sorry, didn't knew you was waiting for input: it was added with an edit, so I wasn't notified: original comment was not asking for input. Please post a new comment in such situation.

I tested what you asked for: didn't solved the issue, so I guess this isn't infinite recursion.
And while it hang like that, CPU is using every single core at 100%.

I guess we could then reopen it?

EDIT: Even if I'm the opener I can't reopen myself.

@sharkdp sharkdp reopened this May 3, 2018
@Porkepix
Copy link
Author

Porkepix commented May 3, 2018

By the way, this was on Linux computers.
I'm currently running it on an old mac. I guess it's gonna take long (old, bad shaped HDD, no SSD in it), so I'll let it run, but I think that when I'll be back this evening computer will be brozen with out of memory: currently running for like 10 minutes, CPU is only around 10%, but memory constantly increase. It started low and currently is already at 140MB.

EDIT: CPU lowered to 1.5-2%, but it still continue.

EDIT2: Might not be the bug on the mac, 30 minutes later, it found a file with "foobar" in the name (!) at the 9th level of depth. So I'll see this evening if it ended correctly, the HDD is kinda slow, and probably fd is slowed down by I/O on disk rather than CPU like other SSD-using computers.
However, is it normal for the memory to constantly increase over the time?

@Porkepix
Copy link
Author

Porkepix commented May 3, 2018

So, not reproduced on the mac, it just took quite some time.

@sharkdp
Copy link
Owner

sharkdp commented May 3, 2018

Thank you for investigating.

To summarize (correct me if I'm wrong):

  • fd can get stuck in an infinite (or very long?) loop when searching /.
  • This seems to be caused by something weird going on in /proc.
  • The problem is not 100% reproducible (I'm pretty sure I saw this on my Linux machine, but I can not reproduce it now - nor on three other Linux machines).
  • The problem can be prevented by excluding /proc or /proc/[0-9]*.

@Porkepix
Copy link
Author

Porkepix commented May 4, 2018

Well, that's pretty much it except that I detected a weird case in comment #288 (comment)
It looks like excluding /proc/[0-9]* and searching in / make it work but excluding the same thing and searching in /procmake it fail… which is pure nonsense to me. I was trying the find if there was a culprit file and stopped when I discovered that because that didn't make any sense.

The exact same computer I did the tests on yesterday morning now succeed on fd foobar /… I'll try to test the other one that had issue when coming to work.

@Porkepix
Copy link
Author

Porkepix commented May 4, 2018

Yeah, this seems really random. The laptop now do it again, the one that didn't 2H ago and the desktop don't anymore…
On the laptop it spawns 10 threads, and currently 4 of them are running a core at 100%. I've seen all my threads at 100% previous week.

Do you have any idea how to debug that? See where does it hang like that?

And this is kinda out of topic, but per comment #288 (comment) is it normal that on the mac case memory did increase during the whole search process?

@sharkdp
Copy link
Owner

sharkdp commented May 4, 2018

Do you have any idea how to debug that? See where does it hang like that?

Maybe fd . /?

And this is kinda out of topic, but per comment #288 (comment) is it normal that on the mac case memory did increase during the whole search process?

I'm not sure, I will look into it.

@Porkepix
Copy link
Author

Porkepix commented May 4, 2018

Okay, so running fd . / on the desktop computer reliably fail on /proc/11364/task/11364/net .
That process is a Firefox one: clement 11364 0.0 0.0 0 0 ? Z 10:22 0:00 [firefox] <defunct>

The laptop's case however is more complicated. It seems to stop randomly on any file at some point. I can share some of them here if you think that's useful but I'm not sure about that. Maybe a doing single thread might help. Don't know why one computer is affected while the other isn't though.

@Porkepix
Copy link
Author

Porkepix commented May 4, 2018

Okay, actually I just RTFM'd a bit and used -j 1.

Got stuck on /proc/19848/net

ps aux give that:
clement 19848 0.0 0.0 0 0 ? Z 08:46 0:00 [pingsender] <defunct>

@Porkepix
Copy link
Author

Porkepix commented May 4, 2018

Oh, and surprisingly, using -j 1 still shows two processes. Both use 100% of the CPU.

And last thing but I've no idea about: sometimes when the bug triggers it "breaks" the terminal I used, terminology. ie. display was still refreshed, but I couldn't type anymore, even when I opened new windows. Only happened one time or two, and got fixed when I pkill'd the process from elsewhere.

EDIT: When it happens, ctrl + c don't do the trick, but seems to work if you wait a long time and retry after having done so a first time.

@sharkdp
Copy link
Owner

sharkdp commented May 10, 2018

Thank you very much for your investigation!

It looks like you are up to something with the <defunct> processes! I can now reliably reproduce this by creating a zombie process on purpose:

  1. Run fd foobar /proc => everything is fine
  2. Copy the code from https://stackoverflow.com/a/25228579/704831 into a file called zombie.c
  3. Compile it: gcc -o zombie zombie.c.
  4. Create the zombie process: ./zombie.
  5. Run ps -ef | grep defunct in a new terminal. It should show [zombie] <defunct>.
  6. Run fd foobar /proc while zombie is still running. It will hang.
  7. Stop zombie and run fd foobar /proc again => everything is fine.

Experiment 2:

  1. Run fd foobar / => everything ok
  2. Create the zombie process: ./zombie.
  3. Get the PID of the defunct process via ps -ef | grep defunct
  4. Call fd foobar / -E /proc/<PID> => everything ok
  5. Call fd foobar / => it hangs

@sharkdp
Copy link
Owner

sharkdp commented May 10, 2018

(see the linked ticked in ripgrep for some further debugging)

@sharkdp
Copy link
Owner

sharkdp commented May 10, 2018

This might be a bug in the ReadDir iterator in Rusts standard library. The issue has been reported here: rust-lang/rust#50619

@sharkdp
Copy link
Owner

sharkdp commented Jul 2, 2018

My pull request which fixes a bug in Rusts standard library has been merged (rust-lang/rust#50630). Now we have to wait for the next Rust release in order to fix this bug in fd.

@sharkdp
Copy link
Owner

sharkdp commented Aug 3, 2018

Actually, we have to wait for Rust 1.29. I forgot about the beta stage. This bug is fixed when compiling fd with the current rustc 1.29.0-beta.1.

sharkdp added a commit that referenced this issue Sep 17, 2018
This upgrades the minimum required version of Rust to 1.29 in order to
fix #288.

See also:
- Rust compiler bug ticket: rust-lang/rust#50619
- Rust compiler PR with the fix: rust-lang/rust#50630

closes #288
sharkdp added a commit that referenced this issue Sep 18, 2018
This upgrades the minimum required version of Rust to 1.29 in order to
fix #288.

See also:
- Rust compiler bug ticket: rust-lang/rust#50619
- Rust compiler PR with the fix: rust-lang/rust#50630

closes #288
@sharkdp
Copy link
Owner

sharkdp commented Sep 18, 2018

This is finally fixed! ✨

@sharkdp
Copy link
Owner

sharkdp commented Oct 27, 2018

Fix released in fd-7.2.0.

@monkeyt00l
Copy link

This issue seems to still exist on version 8.6.0

Running "cd /proc && fd teststring" will make fd freeze and use up all cpu cores (until I cancelled the operation after a minute)

@tavianator
Copy link
Collaborator

tavianator commented Dec 11, 2022

Interesting, I can't reproduce that locally. fd -L hangs, but that's somewhat expected since /proc/self/root is a symlink to /. Do you possibly have an alias like fd=fd -L?

Otherwise, does it still reproduce with fd -j1? Can you paste some of the output of cd /proc && strace -f fd teststring once it hangs?

@monkeyt00l
Copy link

Running 'fd -j1' will list files until it stopped at 24677/map_files/
24677 was the process id of firefox (/usr/lib/firefox/firefox -contentproc)

I suspect this may have to do with the sandboxing features in firefox, since it also uses seccomp and blocks ptrace.
Other sandboxed apps behave similarly and for any process that uses seccomp, I usually need root privileges to do something like 'lsof -p '

However, this one seems to be unreliable to reproduce sicne it I could not reproduce it on a clean test install

The strace log shows a bunch of logs that repeats endlessly https://gist.github.com/monkeyt00l/30a2bdcd3544db3fbc896acc934bbc30

The pids in the log also seem to belong to firefox content processes

@tmccombs
Copy link
Collaborator

Hmm, from looking at the strace output I wonder if this is related to #1186 . It seems like in both cases fd is getting stuck in an infinite loop due to an error (in this case, a permission error).

@tavianator
Copy link
Collaborator

I hope so! EACCES is much easier to debug than EIO :)

@tavianator
Copy link
Collaborator

I can reproduce this now:

tavianator@graphene$ (sleep 1& (sleep 2 && fd . /proc/${!}/net --show-errors)& exec /bin/sleep 3)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
[fd error]: Invalid argument (os error 22)
...

That command creates a zombie process (the sleep 1) by replacing the shell with a command that won't wait() for its children (the exec /bin/sleep 3). In the meantime, we wait for the zombie to die and then run fd in its /proc/<PID>/net directory. For a zombie process, the open() will succeed but readdir() will fail with EINVAL. This is key to triggering the error.

Those with a long memory might remember the bug rust-lang/rust#50619, which @sharkdp filed and then fixed as a result of this bug.

Unfortunately, some silly programmer named @tavianator reintroduced the bug in rust-lang/rust#92778. Or to be a little more charitable, the original fix only applied to some platforms, of which Linux used to be one. But now Linux uses a different ReadDir implementation that is better in many ways but regressed this bug. Oops!

I guess I'll fix it in Rust, unless someone beats me to it.

@tavianator
Copy link
Collaborator

Here's the fix: rust-lang/rust#105638

@sharkdp
Copy link
Owner

sharkdp commented Dec 15, 2022

So if I understand correctly, your PR landed in Rust 1.60. Which is precisely our MSRV right now 😄. So there's currently no way to fix this bug by compiling with an older version of rustc, unless we backport fd to 1.27 <= MSRV < 1.60. Which might not be a big deal maybe. And then we set the MaximumSRV to 1.59 for a while?

@tmccombs
Copy link
Collaborator

clap 4.0 has an MSRV of 1.60, so we'd probably have to downgrade clap to 3.x again if we did that.

@tavianator
Copy link
Collaborator

Alternatively we can work around it in ignore by adding

if result.is_err() {
    break;
}

here: https://github.com/BurntSushi/ripgrep/blob/515f120b5c2c7984c8dfa8bafeda42916457b0ba/crates/ignore/src/walk.rs#L1497-L1507

@tavianator
Copy link
Collaborator

Did that: BurntSushi/ripgrep#2378

@johnalanwoods
Copy link

I'm now seeing this issue on macOS, with the latest release fd 10.1.0 - should I open a new ticket?

@sharkdp
Copy link
Owner

sharkdp commented Aug 6, 2024

I'm now seeing this issue on macOS, with the latest release fd 10.1.0 - should I open a new ticket?

That'd be great.

@johnalanwoods
Copy link

Apols, resolved this on my side, it was a heavily under performing disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants