Better support for hardlink detection #534

KaibutsuX · 2024-09-16T14:46:31Z

The previous approach for linux was scanning the root of the given file: /home/user/files/video_file1.mp4 => scanned: => / for all files

With a 30 second timeout generally would never complete before timing out. And if it did complete, it was scanning the entire filesystem for every single possible duplicate.

Now the scan will occur multiple times but only at the base of each included directory:
IncludeDir: /home/user1/files
IncludeDir: /home/user2/videos/
Scans => /home/user1/files/ and /home/user2/videos/

Ideally, I think the hard links approach for linux should simply collect all inode values at the start of file enumeration (st_ino) and use those for comparison (without even invoking ffmpeg or any other heuristics) since the inode number is the definition of hard-links (for linux), but that approach requires modifying the FileEntry struct which involves protobuf attributes and I'm not really familiar with that.

The previous approach for linux was scanning the root of the given file: /home/user/files/video_file1.mp4 => scanned: => / for all files With a 30 second timeout generally would never complete before timing out. And if it did complete, it was scanning the entire filesystem for every single possible duplicate. Now the scan will occur multiple times but only at the base of each included directory: IncludeDir: /home/user1/files IncludeDir: /home/user2/videos/ Scans => /home/user1/files/ and /home/user2/videos/ Ideally, I think the hard links approach for linux should simply collect all inode values at the start of file enumeration (st_ino) and use those for comparison (without even invoking ffmpeg or any other heuristics) since the inode number is the definition of hard-links (for linux), but that approach requires modifying the FileEntry struct which involves protobuf attributes and I'm not really familiar with that.

0x90d · 2024-09-23T18:34:48Z

I'm not too familiar with linux but isn't your approach failing to detect a hard link if the link is outside of the scan directories? If the problem is that it takes longer than 30 seconds on a system with too many files then I think it would be better to make the timeout customizable.

KaibutsuX · 2024-09-23T18:45:06Z

Yes, but my interpretation of the "Scan directories" is that I don't even want the application looking at files outside of the scan directories. So if I have 2 hard links in the scan directories which are themselves links of 5 identical files in non-scanned directories, I don't want to know about the 5, all I care about is the 2 identical ones in the directories I've requested to be scanned.

I agree a custom timeout could potentially help for smaller filesystems, however I think linux filesystems work too fundamentally different than windows. The Path.GetPathRoot function is getting the absolute root of the scanned file. In windows that would mean C:\myfiles\videos\video1.mp4 would get C:\ That might make sense so that you don't scan D:\ and E:, but in linux, there is only 1 root: /. So a file of /home/user1/file.mp4 returns a root of / which means look across the entire filesystem (which also includes mounted filesystems, cds, blurays, sambda mounts, nfs points, other user's directories and thumbdrives) all mounted to the single root point of /.

However, I think my other PR addresses this more simply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for hardlink detection #534

Better support for hardlink detection #534

KaibutsuX commented Sep 16, 2024

0x90d commented Sep 23, 2024

KaibutsuX commented Sep 23, 2024

Better support for hardlink detection #534

Are you sure you want to change the base?

Better support for hardlink detection #534

Conversation

KaibutsuX commented Sep 16, 2024

0x90d commented Sep 23, 2024

KaibutsuX commented Sep 23, 2024