-
Is there a plugin that allow to deduplicate nearly identical chunks? My dataset contains thousands of books, some of them had been scanned and OCRed several times. Such books generate chunks that in part are almost identical, the only potential differences are heads, tails and OCR errors. Is there a way to scan Qdrant collection and find such similarities? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Each dataset require individual calibration, so I don't think there is an out-of-the-box solution for this. However, you can try to run similarity search against the whole dataset with duplicates to generate a list of candidates for the further deduplication |
Beta Was this translation helpful? Give feedback.
Each dataset require individual calibration, so I don't think there is an out-of-the-box solution for this. However, you can try to run similarity search against the whole dataset with duplicates to generate a list of candidates for the further deduplication