Implementing Locality Sensitive Hashing for DNA Sequences.
DNA sequence dataset from Kaggle is used. Classes of the DNA sequences are seperated from the data and only the sequence is used.
Size of shingles is taken as input. 5-10 recommended.
Random permutations of pseudo indices are used to generate signatures for sequences (documents). Number of permutations is taken as input.
Number of bands is taken as input and sequences in the same band are hashed into buckets where two sequences from the same band have high probability of going into the same bucket if they are similar.
The GUI is built with Tkinter in python.
Please enter number of bands, number of permutations and size of shingles before giving corpus directory input to start LSH.