MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
-
Updated
Jun 4, 2024 - Python
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
Quickly search, compare, and analyze genomic and metagenomic data sets.
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
Detect and visualize text reuse
High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasets
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
There are Python 2.7 codes and learning notes for Spark 2.1.1
A database for signatures of public genomic sources
Chiral version of the MinHashed Atom-Pair Fingerprint
Python library for detecting near duplicate texts in a corpus at scale using Locality Sensitive Hashing, as described in chapter three of Mining Massive Datasets.
cross-architecture binary comparison database
Software to identify plasmid sequence data from metagenome using logistic regression and Minhash
find similar text files quickly
Fast Jaccard similarity search for abstract sets (documents, products, users, etc.) using MinHashing and Locality Sensitve Hashing
Aurora karton for similiarity matching.
Attempt to use MinHash to find duplicates in an Elasticsearch index
Add a description, image, and links to the minhash topic page so that developers can more easily learn about it.
To associate your repository with the minhash topic, visit your repo's landing page and select "manage topics."