A python library to train and store a word2vec model trained on wiki data. Model includes most common bigrams.
- Python 3.x with Anaconda
- Gensim
- Wikipedia .xml.bz2 file
Wikipedia file can be downloaded from https://dumps.wikimedia.org/backup-index.html File needs to be of type .xml.bz2 (for storage)
For instance, file can be downloaded with
wget 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2'
In the trainModel folder, runAll.sh is the script that will launch all relevant python files. Only input needed is the name of the wikipedia file
# INPUT
xmlFile="wikiIni.xml.bz2"
# OUTPUT
wikiTextDir=$(python genWikiText.py $xmlFile)
bigramDir=$(python findBigrams.py $wikiTextDir)
python trainWikiText.py $wikiTextDir $bigramDir
- genWikiText.py uses WikiCorpus from gensim to extract text from wikipedia file. All articles are joined and written to text file.
- findBigrams.py goes through text file and finds the most common bigrams in the text. Bigram model is saved in vector format.
- trainWikiText.py trains wikiCorpus on Word2Vec after passing model through bigrams.
End result is a Word2Vec model, saved in a vector format to be loaded and used in another instance.
Here we use the model to determine the similarity of a sentence to a particular theme using the Word2Vec model.
This is done by using the runAll.py file in the usingModel folder. The inputs are the wiki file, the bigram model, a text file of themes to examine and a folder that contains csvs of data.
Main Inputs:
- A trained Word2Vec model
- A trained phrase colocation model (trained to find bigrams)
- A .txt file that has a list of all the themes that need to be investigated
- A folder with CSVs, where each CSV represents a key and a sample of text to analyse
The interests (or themes) are defined in a seperate text file and the similarity of each word in the second column of the csv is used.
# INPUTS
modelFile='[/path/to/wikiModel].vector'
bigramFile='[/path/to/bigramModel].model'
themesFile='[/path/to/list/of/themes/].txt'
folderName='[folder_containts_csv_data]'
# COMMAND
python findTextSim.py $modelFile $bigramFile $themesFile $folderName
Outputs is a folder for each csv that contains affinity of sample text for every theme.