Divides the given set of Names into groups using accessible Digital footprints.
Used Libraries
- googlesearch
- BeautifulSoup
- requests
- NLTK
- numpy
- Lsa
- sklearn
1. Obtaining Data of Given Names
-
Get the top 10 links related to the given name using googlesearch Library.
-
Get the text using BeautifulSoup Library.
-
Pre- process all the text( Removing all stopwords, Apply stemming, tokenize) using NLTK library.
2. Forming TF_IDF matrix for Each Name
- Buid a Tf-Idf matrix for each name from the pre-processed text.
3.Create LSI Corpus and Cluster them
- Create LSI corpus using Tf-Idf matrix and cluster them using K-means Algorithm.
DEMO