Skip to content

Latest commit

 

History

History
43 lines (32 loc) · 1.78 KB

README.md

File metadata and controls

43 lines (32 loc) · 1.78 KB

LSH : Locality sensitive hashing

CS F469 IR Assignment - 2

Problem Statement:

We have to implement Local Sensitive Hashing to find out duplicate or similar DNA sequences within the corpus. The steps involved are Shingling, Minhashing and Local Sensitive hashing. The main idea is to hash similar documents into buckets and the documents in a particular bucket have high probability of being similar or duplicates.

About the project

Dataset used - Kaggle-human-data

Have a look at the file Design Architecture. It includes the concepts used along with the time taken for each implementation step.

Project By:


How to run the code

  1. Clone the repository : https://github.com/KritiJethlia/LSH.git

  2. cd LSH

  3. Run file:

           python3 LSH_program.py
    
  4. Type your query in the terminal and wait till it returns the similar DNA sequence results :)


Dependencies/modules used

  • time
  • collections
  • pandas
  • pickle
  • Numpy
  • random
  • operator
  • sys
  • copy