-
Notifications
You must be signed in to change notification settings - Fork 0
Edit Distance EDA
The edit distance EDA casts textual entailment as the problem of mapping the whole content of H into the content of T. Mappings are performed as sequences of editing operations (i.e. insertion, deletion, substitution of text portions) needed to transform T into H, where each edit operation has a cost associated with it. Different costs mean different distance functions. These are the most commonly used:
• Levenshtein or edit distance (Levenshtein 1965): allows insertions, deletions and substitutions. In the simplified definition, all the operations cost 1. This can be rephrased as the minimal number of insertions, deletions and substitutions to make two strings equal. The distance is symmetric.
• Hamming distance (Sankoff and Kruskal 1983): allows only substitutions, which cost 1. The distance is symmetric.
• Episode distance (Das et al. 1997): allows only insertions which cost 1. This distance is not symmetric.
Two different EDAs are available with the current release of the software: EditDistanceEDA and EditDistancePSOEDA.
• EditDistanceEDA uses the weights of the edit distace operations as specified in its configuration file.
• EditDistancePSOEDA calculates the weights automatically by using Particle Swarm Optimization (PSO): Kennedy, J.; Eberhart, R. (1995). Particle Swarm Optimization. Proceedings of IEEE International Conference on Neural Networks IV. pp. 1942–1948. Basically PSO works by having a population (swarm) of candidate solutions (particles). These particles are moved around in the search-space according to a few formulae. The movements of the particles are guided by their own best known position in the search-space as well as the entire swarm's best known position. When improved positions are being discovered these will then come to guide the movements of the swarm. The process is repeated until stopping criteria is meet. It is hoped, but not guaranteed, that a satisfactory solution will eventually be discovered.
Both EditDistanceEDA and EditDistancePSOEDA use the calculation made by the distance components to predict entailment/non-entailment relations among T-H pairs. The available components that can be used with Edit Distance are:
• FixedWeightTokenEditDistance: a token-based version of the Levenshtein distance algorithm, with edit operations defined over sequences of tokens of T and H.
• FixedWeightLemmaEditDistance: a token-based version of the Levenshtein distance algorithm, with edit operations defined over sequences of lemmas of tokens of T and H.
Running EditDistanceEDA and EditDistancePSOEDA requires to have the data sets tokenized and their tokens annotated with their part-of-speech tags. When FixedWeightLemmaEditDistance is used lemmas of tokens are required too. In addition given that the edit distance components can use external resources like WordNet and Wikipedia to make their calculation, these resources may be needed to be installed. The remainder of this document describes the possible configurations for EditDistanceEDA and EditDistancePSOEDA .
We provide 3 configuration files that are distributed with the eop-resources package. The files are ready to be used. The only thing that one needs to set is the path of the models that are provided with the eop-resources package. These are the available models:
• EditDistanceEDA_DE.xml (model for German language)
• EditDistanceEDA_EN.xml (model for English language)
• EditDistanceEDA_IT.xml (model for Italian language)
Each of the files (a file for each of the 3 different supported languages: English, German and Italian) contains different instances of the algorithm that can be tested. The structure and values in these configuration files are explained in the table below.
The configuration files of edit distance contained in the resources package v1.0.2 (i.e. eop-resources-1.0.2.tar.gz) are not compatible with the code of the release v1.0.2 and as a result they can not be used. As a temporary solution we made them available from hlt-services4.fbk.eu:8080/artifactory/tmp/edits_resources.tgz. This file contains both the configuration files and the models that users have to use with release v1.0.2. It is sufficient to substitute the files contained in the resources package v1.0.2 with them: the configuration files have to be put in configuration-files in eop-resources-1.0.2/ whereas the model files in the model directory of eop-resources-1.0.2/.
Section | Property | Value | Requirement |
---|---|---|---|
PlatformConfiguration | activatedEDA | The common setting for selecting the EDA. The default value here is eu.excitementproject.eop.core.EditDistanceEDA. | N/A |
PlatformConfiguration | language | For the moment, EditDistanceEDA as well as EditDistancePSOEDA support English (EN), German (DE), and Italian (IT). In principle, the EDA is language-independent. | N/A |
PlatformConfiguration | activatedLAP | The linguistic analysis pipeline needed to produce input for the EDA. | N/A |
eu.excitementproject.eop.core.<br /> [EditDistanceEDA|EditDistancePSOEDA] | modelFile | The name of the model learnt by the EDA during the training phase. The model will be used by the EDA to annotate the data of the test data. We use a convention that gives informative names to the models -- they include the specified model file name (e.g. EditDistanceEDA_IT_Model) as well as the name of the component used by the EDA and the selected instances (e.g. if EditDistanceEDA_IT_Model is the name of the model as specified in the configuration file and FixedWeightTokenEditDistance is the component to be used with its basic instance then the saved model will be called: EditDistanceEDA_IT_Model_FixedWeightTokenEditDistance_basic). | For training, the model file should NOT exist. |
eu.excitementproject.eop.core.<br /> [EditDistanceEDA|EditDistancePSOEDA] | trainDir | The directory containing the training data, as produced by the LAP (in xmi format). | The directory should exist. |
eu.excitementproject.eop.core.<br /> [EditDistanceEDA|EditDistancePSOEDA] | testDir | The directory containing the test data, as produced by the LAP (in xmi format). | The directory should exist. |
eu.excitementproject.eop.core.<br /> [EditDistanceEDA|EditDistancePSOEDA] | measure | The measure to be optimized: accuracy vs f1 | N/A |
eu.excitementproject.eop.core.<br /> [EditDistanceEDA|EditDistancePSOEDA] | components |
The component used by the EditDistanceEDA for distance computations. The components may require themselves additional parameters, which are specified in sections specific to each of them.
These sections are identified through the name of the component provided as value through this XML tag.
These components are available:
|
N/A |
eu.excitementproject.eop.core.<br /> EditDistanceEDA | weights | These are real valued weights for each string edit operation, used by the distance computation component. Levenshtein distance can be obtained setting the weights of deletion, insertion and substitution operations to 1. Hamming distance, setting to 1 substitution and to a much more higher value deletion and insertion (e.g. 1000). Episode distance, setting to 1 insertion and to a much more higher value deletion and substitution (e.g. 1000). | N/A |
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA | maxIteration | The max number of iteration of PSO. | N/A |
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA | errorTolerance | When the error tollerance is satisfied, PSO stops. | N/A |
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA | maxIterationWithoutChanges | PSO stops when the max number of iteration is reached that is without any changes in the results. | N/A |
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA | swarmSize | The swarm size (i.e. the number of particles). | N/A |
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA | processors | The number of processors to be used to run PSO. | N/A |
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA | deleteValuesRange | The range of values of the delete edit distance operation that PSO has to select a value from. | N/A |
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA | insertValuesRange | The range of values of the insert edit distance operation that PSO has to select a value from. | N/A |
eu.excitementproject.eop.core.<br /> EditDistancePSOEDA | substituteValuesRange | The range of values of the substitute edit distance operation that PSO has to select a value from | N/A |
eu.excitementproject.eop.core.<br /> component.distance.<br /> FixedWeightTokenEditDistance | instances |
The component computes the distance between two strings using the tokens and fixed weights for the string edit operations. The instance specifies the value of a subsection, which contains
the parameters needed to use this component. The instance to be used can be one of these 4:
|
N/A |
eu.excitementproject.eop.core.<br /> component.distance.<br /> FixedWeightLemmaEditDistance | instances |
The component computes the distance between two strings using the lemmas and fixed weights for the string edit operations. The instance specifies the value of a subsection, which contains
the parameters needed to use this component. The instance to be used can be one of these 4:
|
N/A |
basic / wordnet / wikipedia | stopWordRemoval | Can be true or false, and indicates to the distance computation component whether to filter stop words or not | |
wordnet | path | The path to the particular WordNet resource used. The English WordNet is freely distributed and is included in the release. The Italian WordNet is also free but must be obtained through request from FBK. Details are provided in the Doc for the Italian knowledge resources. GermaNet is properietary. Details about the resource and how to obtain it are provided in the Doc for the German knowledge resources | WordNet has to been installed. |
wikipedia | path | The path to the particular Wikipedia resource used. Italian and English Wikipedia can be used. | Wikipedia has to been installed. |