Code is written in Python (2.7), Torch and Lua (Luajit)
Using the pre-trained word2vec vectors from gensim will require downloading it from https://radimrehurek.com/gensim/models/word2vec.html
Co-occurance matrix and other datafiles can be downloaded at https://www.dropbox.com/s/wqduqde7pv8cr76/ELDEN_Corpus.tar.gz?dl=0
This package contains the four steps (folders A to D) of implementation, followed by Evaluation. We suggest running the system in this order.A. Corpus :
- Wikipedia (clean as specified in paper)
- Web Corpus = trainingEntities.py, processMultipleEntities.py, WebScraping.py
B. Dataset :
- TAC2010 = TACforNED
- CoNLL = https://github.com/masha-p/PPRforNED Please cite the respective papers when using these datasets.
C. Preprocess:
- Create entity co-location index.
python2.7 pmi_index.py base_co.npy/None vocab.pickle output_file file_scraped_from_web
- Start PMI Server.
python pmi_service.py
- Train entity embeddings.
th> main.lua <<word2vec.lua>>
- Start Embedding Distance Servers.
th> EDServer.lua
D. Entity Linker:
- Create train and test dataset
python createTrainData.py
- Run Entity Linker
python classify.py
E. Evaluation :
- Head entities versus tail entities statistics
python TailEntities.py
Kindly cite the paper if you are using the software