RandomIndexing

Random Indexing

Introduction

Random Indexing (RI) is a word co-occurrence based approach to statistical semantics. RI uses statistical approximations of the full word co-occurrence data to achieve dimensionality reduction. This results in a much quicker running time and fewer required dimensions.

In most co-occurence models, a word-by-word matrix is constructed, where the values denote how many times the column's word occurred in the context of the row's word. RI instead represents co-occurrence through index vectors. Each word is assigned a high-dimensional, random vector that is known as its index vector. These index vectors are very sparse - typically 7 ± 2 non zero bits for a vector of length 2048, which ensures that the the chance of any two arbitrary index vectors having an overlapping meaning (i.e. a cosine similarity that is non-zero) is very low. Word semantics are calculated for each word by keeping a running sum of all of the index vectors for the words that co-occur.

Sahlgren et al. (2008) introduced another variation on RI, where the semantics also capture word order by using a permutation function. For each occurrence of a word, rather than summing the index vectors of the co-occurring words, the permutation function is used to transform the co-occurring words based on their position. For example, consider the sentece, "the quick brown fox jumps over the lazy dog." With a window-size of 2, the semantic vector for "fox" is added with the values Π^-2(quick_index) + Π^-1(brown_index) + Π¹(jumps_index) + Π²(over_index), where Π^k denotes the k^th permutation of the specified index vector.

For more information on Random Indexing see the following papers for a review.

M. Sahlgren, "Vector-based semantic analysis: Representing word meanings based on random labels," in Proceedings of the ESSLLI 2001 Workshop on Semantic Knowledge Acquisition and Categorisation, Helsinki, Finland, 2001.
M. Sahlgren, "An introduction to random indexing," in Proceedings of the Methods and Applicatons of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, 2005.
M. Sahlgren, A. Holst, and P. Kanerva, "Permutations as a means to encode order in word space," in Proceedings of the 30th Annual Meeting of the Cognitive Science Society (CogSci’08), 2008.

S-Space Implementation

The core Random Indexing algorithm is implemented in one file [RandomIndexing.java] (http://fozziethebeat.github.com/S-Space/apidocs/edu/ucla/sspace/ri/RandomIndexing.html) . We also support a command-line executable class, [RandomIndexingMain.java] (http://fozziethebeat.github.com/S-Space/apidocs/edu/ucla/sspace/ri/RandomIndexing.html) that uses the RandomIndexing class to build a semantic space.

The current software processes a corpus as a series of "documents," which provides a way for the user to segment the corpus based on paragraph or sentence boundaries. Segmentation is not require but has the effect of remove co-occurrence counts for words that appear across boundaries, e.g. not counting words in other sentences for sentence-based documents. No specific segmentation choice is regarded as the best and users are encouraged to try different ones.

Software Requirements

Random Indexing requires Java 6.

Running `RandomIndexing` from the command line

RI can be invoke either by using the RandomIndexingMain class or by using the random-indexing.jar file provided in the downloads. Both methods are equivalent. Both support the following options:

Required (at least one of): * -d, --docFile a file where each line is a document * -f, --fileList a list of document files
Algorithm Options:
- -l, --vectorLength length of semantic vectors
- -n, --permutationFunction permutation function to use
- -p, --usePermutations whether to permute index vectors based on word order
- -s, --windowSize how many words to consider in each direction
- -F, --tokenFilter=FILE[include|exclude][,FILE...] specifies a list of one or more files to use for filtering the documents. An option flag may be added to each file to specify how the words in the filter filter should be used: include if only the words in the filter file should be retained in the document; exclude if only the words not in the filter file should be retained in the document. The default value is include. An example configuration might look like: --tokenFilter=include=english-dictionary.txt,exclude=stop-list.txt
- -L, --loadVectors=FILE loads word-to-IndexVector mapping before processing. This enables multiple runs of RandomIndexing to share the same initial index vector mapping. This feature is useful for comparing how different parameters affect the semantic vectors from a single corpus, or for comparing the semantics across multiple corpora while keeping the same index vector mapping.
- -S, --saveVectors=FILE saves the final word-to-IndexVector mapping after processing. This step is necessary for using the --loadVectors option.
Program Options:
- -o, --outputFormat={text|binary|sparse_text|sparse_binary} Specifies the output formatting to use when generating the semantic space (.sspace) file. See [FileFormats] for format details.
- -t, --threads the number of threads to use
- -v, --verbose prints verbose output
- -w, --overwrite specifies whether to overwrite the existing output

The result will produce a semantic space file random-indexing.sspace in the provided output directory. If overwrite is set to false, a new file will be created and put in the output directory without overwriting any existing semantic space files.

Example:

java -server -Xmx2g -jar random-indexing-jar --docFile=my_corpus.txt --threads=4 --verbose output-directory/

This runs the RandomIndexingMain class using the server JVM and a 2GB heap. The corpus is specified as my_corpus.txt and is processed with each line being a different document. The resulting random-indexing.sspace file will be placed in output-directory/. See FileFormats for exact details on the output file formatting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomIndexing

Random Indexing

Introduction

S-Space Implementation

Software Requirements

Running `RandomIndexing` from the command line

Clone this wiki locally

RandomIndexing

Random Indexing

Introduction

S-Space Implementation

Software Requirements

Running RandomIndexing from the command line

Clone this wiki locally

Running `RandomIndexing` from the command line