-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add File Deduplication #6
Comments
I was told to copy this here: Hey, I have been using your program since you posted it on reddit. I hope it's ok to request a feature here. The only issue I have had once you helped me figure out how to use nHA is the sheer amount of duplicates. I don't program, so I have no idea how hard it would be to have the program figure out which images are dupes before downloading them, but I have a feeling that would be a huge headache to implement. I have the idea for what I think would be an easier workaround on your end. If you could make it, so there is a flag(I don't know the correct term) in the .env file so we could choose not to have the images put in a cbz file. Instead have it put each comic into its own subfolder along with the cbz xml file. What this would do is make it easy just to use a 3rd party duplicate file finder to scan the subfolders and delete the dupes. Sure, you would still have to download the dupes, but it then becomes trivial to find and delete them. Sure it won't delete the xml files for the dupe folders, but once the pics are gone it is as simpler as sorting the folders by size and deleting the tiny ones. Then it would just be figuring out a way to batch making cbz files out of those subfolders after the de-duping. This is just a thought. I know nothing about programming, so this might be something that isn't easy to do for your program or hell you might not feel like adding this kind of feature. Either way, I want to thank you for releasing it for free for everyone else. You didn't have to, but you made a lot of people's lives a lot easier in doing so. I also thought of another solution, but I'm not sure if it is possible, do to not knowing how thumbnails are handled. But if there is a program that can use the cbz's thumbnails and compare them against each other, then allow you to choose the cbz with the largest file size to keep and deleting the rest seems possible in my ignorant, uneducated opinion. |
Thank you for ideas, I really appreciate it. :) Currently, all images are temporarily downloaded to Speaking for that setting would be that someone else had already requested it for compatibility reasons with another program and ease of implementation. Speaking against that would be adding more complexity into the The problems I see with the hash generation approach is that I don't think that it is reliable enough. What if 2 unrelated hentai have the same cover because they're from the same magazine or maybe they screwed up and both happen to have a blank first page? And how do I implement a redirect from the deleted works to the kept work? I probably would need to implement changing the kept work after it has already been stored in the library and that sounds like a headache. These are just some thoughts I am having here before I decide into which direction I'm going to go. Let me know what you think. |
When creating the cbz, do you zip with compression or without? If you zip without, you could hash the entire cbz and discard it if it matches another. This would slow things down a bit I think. But it would keep duplication down if someone was just downloading the entire site. |
This would reliably work for complete duplicates, but not for series like Sweet Guy which is uploaded countless times with a varying number of chapters. And then there's still the redirection issue from deleted work to kept work to solve. I would be hesitant to change any library entries after they have been moved into the library, that's why if I decide to do this, the process should be well thought out. I think I am doing That said, I really appreciate your ideas and active participation @billsargent. Thank you. :) |
I was looking at the specs for CBZ and here's the breakdown... CBZ should always be zip only, cb7 is 7zip, cba is ace, cbr is rar, cbt is tar. Your code ls creating zip files using deflate. If you could disable deflate and just use no compression, then theoretically, they should have the same hash. Yeah as you said, some have different numbers of chapters but overall I think it would save space. The ones with varying numbers of chapters could then be left up to the user to fix themselves. See this for how to create zips without compression. I think you are using zipwriter... |
oh I can't post links...? |
I think I am repeating myself, but I still don't see a way how to properly redirect someone from a deleted hentai to the corresponding kept hentai. I think this whole deduplication topic is a can of worms I don't want to touch myself, because I don't have the feeling I could offer a solution that lives up to my reliability standards. I'm currently seriously thinking about adding |
If you could place comicinfo in there as well, it would also help with creating the cbzs locally as well. That way the metadata is preserved. This would take the burden off you and others who have shell scripting or python scripting capabilities can do the deduplication. |
Good point. Then how about calling it |
I agree. That sounds perfect. |
Deduplication has been decided to be out of scope of this project for now, setting |
https://www.reddit.com/r/DataHoarder/comments/1fg5yzy/comment/ln2efs3/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
The text was updated successfully, but these errors were encountered: