This library generates semantic embeddings of entities from text that describes them. It can also quantize and compress the obtained models.
The training code is written in Python and it requires Numpy, Scipy, Numexpr, Theano, and it also relies on gensim, which is included as a git submodule. The code for model compression and entity scoring is instead written in Java.
This code was used in the experiments of the paper:
Roi Blanco, Giuseppe Ottaviano, Edgar Meij, Fast and Space-efficient Entity Linking in Queries, ACM WSDM 2015.
The Python code does not require building but it is necessary to download the
git submodules. If you have cloned the repository without --recursive
, you
will need to perform the following commands:
$ git submodule init
$ git submodule update
The Java code can be instead built with:
$ mvn package
To generate entity vectors it is necessary to generate word embeddings on a
large enough corpus. Both word2vec or gensim can be used for the task. The
training code assumes that the words in the corpus have been lower-cased. In the
following we assume that word2vec was used and the result is in
data/word_model.bin
. Entities and descriptions should be in a TSV file
organized as follows (say data/descriptions.tsv
).
entity1\tdescription text 1\n
entity2\tdescription text 2\n
...
For example, the description can be the first paragraph of the entity's Wikipedia page.
The LR entity vectors can be generated as follows:
$ ./entity_vectors.py train lr data/word_model.bin data/descriptions.tsv \
data/entity.lr.model
The Centroid entity vectors can be computed likewise by passing centroid
instead of lr
.
To evaluate the generated entity vectors it is possible to score a set of entities matching given substrings against a given context with the following command:
$ ./entity_vectors eval data/word_model.bin data/entity.lr.model \
data/entity.centroid.model
For example, by entering the string +brad +pitt matches
, all the entities
containing the substrings brad
and pitt
will be found, and scored against
the context brad pitt matches
. A few examples:
> +brad +pitt matches
Brad_Pitt_%28boxer%29 -1.103 | Brad_Pitt_filmography 0.516
University_of_Pittsburgh_at_Bradford -1.627 | Brad_Pitt_%28boxer%29 0.482
Brad_Pitt -1.645 | List_of_awards_and_nominations_received_by_Brad_Pitt 0.370
> +hollywood lyrics
Hollywood_%28Madonna_song%29 -0.014 | Broadway_to_Hollywood 0.594
The_Hollywood_Palace -0.016 | Hollywood_Pacific_Theatre 0.584
Hollywood_Hotel_%28film%29 -0.019 | Hollywood_Speaks 0.573
Left column is scored with the LR model, right column with the Centroid model. It is easy to see that in these examples LR gives significantly better scores.
The word and entity models can be quantized and compressed. Quantization is done
through the script model_quantization.py
. An example Java implementation is
included that uses Golomb coding for compression, and implements the scoring
algorithms using the compressed models.
To quantize the word vectors run
$ ./model_quantization.py quant data/word_model.bin data/word
This will generate both a .txt
file with the quantized coefficients and a
gensim file with the dequantized model. The latter is supposed to be used to
train the entity vectors as before: since a transformation is applied to the
word vectors before quantizing them, an entity model trained on word_model.bin
cannot be used with the quantized model.
When the new entity model is generated, it is possible to quantize it as well:
$ ./model_quantization.py quant_entities data/entity.lr.model data/entity.lr
The .txt
files can then be passed to the Word2VecCompress
program that
generates the compressed binary models:
$ mvn exec:java -Dexec.mainClass="it.cnr.isti.hpc.Word2VecCompress" -Dexec.args="data/word.e0.100.txt data/word.e0.100.bin"
$ mvn exec:java -Dexec.mainClass="it.cnr.isti.hpc.Word2VecCompress" -Dexec.args="data/entity.lr.e0.100.txt data/entity.lr.e0.100.bin"
The resulting files can then be used with the EntityScorer
class.
- Giuseppe Ottaviano giuott@gmail.com