Skip to content
Guenter Neumann edited this page Nov 6, 2013 · 1 revision

The edit distance EDA casts textual entailment as the problem of mapping the whole content of H into the content of T. Mappings are performed as sequences of editing operations (i.e. insertion, deletion, substitution of text portions) needed to transform T into H, where each edit operation has a cost associated with it. Different costs mean different distance functions. These are the most commonly used:

• Levenshtein or edit distance (Levenshtein 1965): allows insertions, deletions and substitutions. In the simplified definition, all the operations cost 1. This can be rephrased as the minimal number of insertions, deletions and substitutions to make two strings equal. The distance is symmetric.

• Hamming distance (Sankoff and Kruskal 1983): allows only substitutions, which cost 1. The distance is symmetric.

• Episode distance (Das et al. 1997): allows only insertions which cost 1. This distance is not symmetric.

Two different EDAs are available with the current release of the software: EditDistanceEDA and EditDistancePSOEDA.

• EditDistanceEDA uses the weights of the edit distace operations as specified in its configuration file.

• EditDistancePSOEDA calculates the weights automatically by using Particle Swarm Optimization (PSO): Kennedy, J.; Eberhart, R. (1995). Particle Swarm Optimization. Proceedings of IEEE International Conference on Neural Networks IV. pp. 1942–1948. Basically PSO works by having a population (swarm) of candidate solutions (particles). These particles are moved around in the search-space according to a few formulae. The movements of the particles are guided by their own best known position in the search-space as well as the entire swarm's best known position. When improved positions are being discovered these will then come to guide the movements of the swarm. The process is repeated until stopping criteria is meet. It is hoped, but not guaranteed, that a satisfactory solution will eventually be discovered.

Both EditDistanceEDA and EditDistancePSOEDA use the calculation made by the distance components to predict entailment/non-entailment relations among T-H pairs. The available components that can be used with Edit Distance are:

• FixedWeightTokenEditDistance: a token-based version of the Levenshtein distance algorithm, with edit operations defined over sequences of tokens of T and H.

• FixedWeightLemmaEditDistance: a token-based version of the Levenshtein distance algorithm, with edit operations defined over sequences of lemmas of tokens of T and H.

Running EditDistanceEDA and EditDistancePSOEDA requires to have the data sets tokenized and their tokens annotated with their part-of-speech tags. When FixedWeightLemmaEditDistance is used lemmas of tokens are required too. In addition given that the edit distance components can use external resources like WordNet and Wikipedia to make their calculation, these resources may be needed to be installed. The remainder of this document describes the possible configurations for EditDistanceEDA and EditDistancePSOEDA .

Configuration File

We provide 3 configuration files that are distributed with the eop-resources package. The files are ready to be used. The only thing that one needs to set is the path of the models that are provided with the eop-resources package. These are the available models:

• EditDistanceEDA_DE.xml (model for German language)

• EditDistanceEDA_EN.xml (model for English language)

• EditDistanceEDA_IT.xml (model for Italian language)

Each of the files (a file for each of the 3 different supported languages: English, German and Italian) contains different instances of the algorithm that can be tested. The structure and values in these configuration files are explained in the table below.

Issues using the configuration files available as part of EOP resources v1.0.2

The configuration files of edit distance contained in the resources package v1.0.2 (i.e. eop-resources-1.0.2.tar.gz) are not compatible with the code of the release v1.0.2 and as a result they can not be used. As a temporary solution we made them available from hlt-services4.fbk.eu:8080/artifactory/tmp/edits_resources.tgz. This file contains both the configuration files and the models that users have to use with release v1.0.2. It is sufficient to substitute the files contained in the resources package v1.0.2 with them: the configuration files have to be put in configuration-files in eop-resources-1.0.2/ whereas the model files in the model directory of eop-resources-1.0.2/.

Common settings

Section Property Value Requirement
PlatformConfiguration activatedEDA The common setting for selecting the EDA. The default value here is eu.excitementproject.eop.core.EditDistanceEDA. N/A
PlatformConfiguration language For the moment, EditDistanceEDA as well as EditDistancePSOEDA support English (EN), German (DE), and Italian (IT). In principle, the EDA is language-independent. N/A
PlatformConfiguration activatedLAP The linguistic analysis pipeline needed to produce input for the EDA. N/A
eu.excitementproject.eop.core.<br /> [EditDistanceEDA|EditDistancePSOEDA] modelFile The name of the model learnt by the EDA during the training phase. The model will be used by the EDA to annotate the data of the test data. We use a convention that gives informative names to the models -- they include the specified model file name (e.g. EditDistanceEDA_IT_Model) as well as the name of the component used by the EDA and the selected instances (e.g. if EditDistanceEDA_IT_Model is the name of the model as specified in the configuration file and FixedWeightTokenEditDistance is the component to be used with its basic instance then the saved model will be called: EditDistanceEDA_IT_Model_FixedWeightTokenEditDistance_basic). For training, the model file should NOT exist.
eu.excitementproject.eop.core.<br /> [EditDistanceEDA|EditDistancePSOEDA] trainDir The directory containing the training data, as produced by the LAP (in xmi format). The directory should exist.
eu.excitementproject.eop.core.<br /> [EditDistanceEDA|EditDistancePSOEDA] testDir The directory containing the test data, as produced by the LAP (in xmi format). The directory should exist.
eu.excitementproject.eop.core.<br /> [EditDistanceEDA|EditDistancePSOEDA] measure The measure to be optimized: accuracy vs f1 N/A
eu.excitementproject.eop.core.<br /> [EditDistanceEDA|EditDistancePSOEDA] components The component used by the EditDistanceEDA for distance computations. The components may require themselves additional parameters, which are specified in sections specific to each of them. These sections are identified through the name of the component provided as value through this XML tag. These components are available:
  1. FixedWeightTokenEditDistance
  2. FixedWeightLemmaEditDistance
N/A
eu.excitementproject.eop.core.<br /> EditDistanceEDA weights These are real valued weights for each string edit operation, used by the distance computation component. Levenshtein distance can be obtained setting the weights of deletion, insertion and substitution operations to 1. Hamming distance, setting to 1 substitution and to a much more higher value deletion and insertion (e.g. 1000). Episode distance, setting to 1 insertion and to a much more higher value deletion and substitution (e.g. 1000). N/A
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA maxIteration The max number of iteration of PSO. N/A
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA errorTolerance When the error tollerance is satisfied, PSO stops. N/A
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA maxIterationWithoutChanges PSO stops when the max number of iteration is reached that is without any changes in the results. N/A
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA swarmSize The swarm size (i.e. the number of particles). N/A
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA processors The number of processors to be used to run PSO. N/A
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA deleteValuesRange The range of values of the delete edit distance operation that PSO has to select a value from. N/A
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA insertValuesRange The range of values of the insert edit distance operation that PSO has to select a value from. N/A
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA substituteValuesRange The range of values of the substitute edit distance operation that PSO has to select a value from N/A
eu.excitementproject.eop.core.<br /> component.distance.<br /> FixedWeightTokenEditDistance instances The component computes the distance between two strings using the tokens and fixed weights for the string edit operations. The instance specifies the value of a subsection, which contains the parameters needed to use this component. The instance to be used can be one of these 4:
  1. basic
  2. wordnet
  3. wikipedia
  4. wordnet,wikipedia
basic means that only the tokens will be used to compute the edit distance between T and H. wordnet means that the FixedWeightTokenEditDistance will use WordNet as an external resource to compute the edit distance between T and H. Use wikipedia if you want to compute the edit distance between T and H by considering the rules extracted from Wikipedia. wordnet,wikipedia can be used to combine the information provided by the two resources. However with FixedWeightTokenEditDistance the rules in WordNet and Wikipedia will be searched by using the tokens and not the lemmas as in the case of FixedWeightLemmaEditDistance. As a results the contribution of these resources could not be of relevance. To be able to use this components, the LAP should provide token and part-of-speech annotations (Currently only TreeTagger provides this for all three languages, and TextPro for Italian).
N/A
eu.excitementproject.eop.core.<br /> component.distance.<br /> FixedWeightLemmaEditDistance instances The component computes the distance between two strings using the lemmas and fixed weights for the string edit operations. The instance specifies the value of a subsection, which contains the parameters needed to use this component. The instance to be used can be one of these 4:
  1. basic
  2. wordnet
  3. wikipedia
  4. wordent,wikipedia>
basic means that only the tokens will be used to compute the edit distance between T and H. wordnet means that the FixedWeightTokenEditDistance will use WordNet as an external resource to compute the edit distance between T and H. Use wikipedia if you want to compute the edit distance between T and H by considering the rules extracted from Wikipedia. wordnet,wikipedia can be used to combine the information provided by the two resources. To be able to use this components, the LAP should provide token,lemma and part-of-speech annotations (Currently only TreeTagger provides this for all three languages, and TextPro for Italian).
N/A
basic / wordnet / wikipedia stopWordRemoval Can be true or false, and indicates to the distance computation component whether to filter stop words or not
wordnet path The path to the particular WordNet resource used. The English WordNet is freely distributed and is included in the release. The Italian WordNet is also free but must be obtained through request from FBK. Details are provided in the Doc for the Italian knowledge resources. GermaNet is properietary. Details about the resource and how to obtain it are provided in the Doc for the German knowledge resources WordNet has to been installed.
wikipedia path The path to the particular Wikipedia resource used. Italian and English Wikipedia can be used. Wikipedia has to been installed.
Clone this wiki locally