Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for hardlink detection #534

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

KaibutsuX
Copy link

The previous approach for linux was scanning the root of the given file: /home/user/files/video_file1.mp4 => scanned: => / for all files

With a 30 second timeout generally would never complete before timing out. And if it did complete, it was scanning the entire filesystem for every single possible duplicate.

Now the scan will occur multiple times but only at the base of each included directory:
IncludeDir: /home/user1/files
IncludeDir: /home/user2/videos/
Scans => /home/user1/files/ and /home/user2/videos/

Ideally, I think the hard links approach for linux should simply collect all inode values at the start of file enumeration (st_ino) and use those for comparison (without even invoking ffmpeg or any other heuristics) since the inode number is the definition of hard-links (for linux), but that approach requires modifying the FileEntry struct which involves protobuf attributes and I'm not really familiar with that.

The previous approach for linux was scanning the root of the given file:
/home/user/files/video_file1.mp4 => scanned: => / for all files

With a 30 second timeout generally would never complete before timing
out. And if it did complete, it was scanning the entire filesystem for
every single possible duplicate.

Now the scan will occur multiple times but only at the base of each
included directory:
IncludeDir: /home/user1/files
IncludeDir: /home/user2/videos/
Scans => /home/user1/files/ and /home/user2/videos/

Ideally, I think the hard links approach for linux should simply collect
all inode values at the start of file enumeration (st_ino) and use those
for comparison (without even invoking ffmpeg or any other heuristics)
since the inode number is the definition of hard-links (for linux), but
that approach requires modifying the FileEntry struct which involves
protobuf attributes and I'm not really familiar with that.
@0x90d
Copy link
Owner

0x90d commented Sep 23, 2024

I'm not too familiar with linux but isn't your approach failing to detect a hard link if the link is outside of the scan directories? If the problem is that it takes longer than 30 seconds on a system with too many files then I think it would be better to make the timeout customizable.

@KaibutsuX
Copy link
Author

Yes, but my interpretation of the "Scan directories" is that I don't even want the application looking at files outside of the scan directories. So if I have 2 hard links in the scan directories which are themselves links of 5 identical files in non-scanned directories, I don't want to know about the 5, all I care about is the 2 identical ones in the directories I've requested to be scanned.

I agree a custom timeout could potentially help for smaller filesystems, however I think linux filesystems work too fundamentally different than windows. The Path.GetPathRoot function is getting the absolute root of the scanned file. In windows that would mean C:\myfiles\videos\video1.mp4 would get C:\ That might make sense so that you don't scan D:\ and E:, but in linux, there is only 1 root: /. So a file of /home/user1/file.mp4 returns a root of / which means look across the entire filesystem (which also includes mounted filesystems, cds, blurays, sambda mounts, nfs points, other user's directories and thumbdrives) all mounted to the single root point of /.

However, I think my other PR addresses this more simply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants