-
Notifications
You must be signed in to change notification settings - Fork 2
EditDistance
EditDistance, i.e. a fairly basic implementation of the system described in (Recognizing textual entailment with tree edit distance algorithms, Kouylekov and Magnini 2005), casts textual entailment as the problem of mapping the whole content of H into the content of T. Mappings are performed as sequences of editing operations (i.e. insertion, deletion, substitution of text portions) needed to transform T into H, where each edit operation has a cost associated with it. Different costs mean different distance functions. These are the most commonly used:
• Levenshtein or edit distance (Levenshtein 1965): allows insertions, deletions and substitutions. In the simplified definition, all the operations cost 1. This can be rephrased as the minimal number of insertions, deletions and substitutions to make two strings equal. The distance is symmetric.
• Hamming distance (Sankoff and Kruskal 1983): allows only substitutions, which cost 1. The distance is symmetric.
• Episode distance (Das et al. 1997): allows only insertions which cost 1. This distance is not symmetric.
Given a certain configuration, the Edit Distance EDA can be trained over a specific data set in order to optimize its performance. In the training phase this class produces a distance model for the data set, which includes a distance threshold that best separates the positive and negative examples in the training data. During the test phase it applies the calculated threshold, so that pairs resulting in a distance below the threshold are classified as ENTAILMENT, while pairs above the threshold are classified as NONENTAILMENT.
As it has also been described in other sections of the present documentation, EDAs are algorithms designed to take a decision (i.e. entailment/non-entailment) whereas components are instead algorithms calculating measures that are used by EDAs to predict entailment/non-entailment relations. In the rest of the section we will see what the available Edit Distance EDAs are, the distance components that can be used with and the configuration files needed to run those EDAs.
Two different types of Edit Distance EDA are available with the current release of the software:
EditDistanceEDA uses the distance components with the weights of the edit distance operations as specified in its configuration file.
EditDistancePSOEDA calculates the weights automatically on a specified data set by using Particle Swarm Optimization (PSO): Kennedy, J.; Eberhart, R. (1995). Particle Swarm Optimization. Proceedings of IEEE International Conference on Neural Networks IV. pp. 1942–1948. Basically PSO works by having a population (swarm) of candidate solutions (particles). These particles are moved around in the search-space according to a few formulae. The movements of the particles are guided by their own best known position in the search-space as well as the entire swarm's best known position. When improved positions are being discovered these will then come to guide the movements of the swarm. The process is repeated until stopping criteria is meet. It is hoped, but not guaranteed, that a satisfactory solution will eventually be discovered. The calculated weights are then pass to the distance components to make their calculation.
These distance components can be used with the above Edit Distance EDAs to calculate a distance between T/H pairs.
FixedWeightTokenEditDistance is a token-based distance algorithm, with edit operations defined over sequences of tokens of T and H.
FixedWeightLemmaEditDistance is a token-based distance algorithm, with edit operations defined over sequences of lemmas of tokens of T and H.
Both the components can be set to work with different weights of the distance edit operations so that basically they can produce a number of distance functions like those reported above (i.e. Levenshtein, Hamming, Episode).
Configuration files are used to create and use a particular instance of an EDA. With them it is for example possible to specify which resources the EDA has to use (e.g. WordNet), the data set to be annotated and so on. With the platform we provide 3 configuration files for EditDistanceEDA and other 3 for EditDistancePSOEDA.
EditDistanceEDA files:
• EditDistanceEDA_DE.xml (configuration file for German language)
• EditDistanceEDA_EN.xml (configuration file for English language)
• EditDistanceEDA_IT.xml (configuration file for Italian language)
EditDistancePSOEDA files:
• EditDistancePSOEDA_DE.xml (configuration file for German language)
• EditDistancePSOEDA_EN.xml (configuration file for English language)
• EditDistancePSOEDA_IT.xml (configuration file for Italian language)
All these file are distributed as part of the eop-resources package and they can be seen as the starting point to work with Edit Distance EDA. The structure and values in these configuration files are explained in the table below, whereas Appendix A and Appendix B report an example of configuration file for Edit Distance EDA and EditDistancePSO EDA.
Configuration files could contain different sections. In the specific case of Edit Distance there is a global section (i.e. PlatformConfiguration), shared among all components and EDAs, a section (i.e. EditDistanceEDA) containing the parameter specific for setting the Edit Distance EDA, a section reserved to the edit distance components use by EditDistanceEDA (i.e. FixedWeightEditDistance) and a section containing the model learned during the training phase of the EDA (i.e. model).
This section refers to the global information of the platform. activatedEDA and activatedLAP information are basically used by the Runner class (it is in the util package) to call both the needed LAP and the EDA itself. When the EDA is used without the Runner class, this information can be useful to see which pipeline has been used or can be used for preprocessing the data set. language is instead used by the EDA to set its configuration.
Property | Value |
---|---|
activatedEDA | The common setting for selecting the EDA. The default value here is eu.excitementproject.eop.core.EditDistanceEDA |
language | For the moment, EditDistanceEDA as well as EditDistancePSOEDA support English (EN), German (DE), and Italian (IT). In principle, the EDA is language-independent. |
activatedLAP | The linguistic analysis pipeline needed to produce input for the EDA. |
This section contains the parameters of EditDistanceEDA. They include the directory where the data set for training the EDA is (i.e. trainDir) and the path of the data set (i.e. testDir) to be annotated (currently the information about the test data set is not used by the EDA). With the parameter measure it is instead possible to specify the measure (e.g. accuracy) that has to be optimized during the training phase of the EDA. The configuration file of EditDistancePSOEDA can contain additional parameters specific of the PSO algorithm (e.g. maxIteration) used for the weights optimization.
Property | Value |
---|---|
trainDir | The directory containing the training data, as produced by the LAP (in xmi format). |
testDir | The directory containing the test data, as produced by the LAP (in xmi format). |
measure | The measure to be optimized: accuracy vs f1. |
components | The component used by the EditDistanceEDA for distance computations. The components may require themselves additional parameters, which are specified in sections specific to each of them. These sections are identified through the name of the component provided as value through this XML tag. These components are available: FixedWeightTokenEditDistance and FixedWeightLemmaEditDistance |
weights | These are real valued weights for each string edit operation, used by the distance computation component. Levenshtein distance can be obtained setting the weights of deletion, insertion and substitution operations to 1. Hamming distance, setting to 1 substitution and to a much more higher value deletion and insertion (e.g. 1000). Episode distance, setting to 1 insertion and to a much more higher value deletion and substitution (e.g. 1000). |
maxIteration | The max number of iterations of PSO. |
errorTolerance | When the error tollerance is satisfied, PSO stops. |
maxIterationWithoutChanges | PSO stops when the max number of iterations is reached that is without any changes in the results. |
swarmSize | The swarm size (i.e. the number of particles). |
processors | The number of processors to be used to run PSO. |
deleteValuesRange | The range of values of the delete edit distance operation that PSO has to select a value from. |
insertValuesRange | The range of values of the insert edit distance operation that PSO has to select a value from. |
substituteValuesRange | The range of values of the substitute edit distance operation that PSO has to select a value from. |
This section is about the parameters of the edit distance components that can be used by EditDistanceEDA and EditDistancePSOEDA to perform their annotation.
Property | Value |
---|---|
instances | Instances specify the subsection containing the component's parameters to be used and its value can be one of these 4: basic, wordnet, wikipedia, wordnet,wikipedia. basic means that only the tokens will be used to compute the edit distance between T and H. wordnet means that the component will use WordNet as an external resource to compute the edit distance between T and H. Use wikipedia if you want to compute the edit distance between T and H by considering the rules extracted from Wikipedia. wordnet,wikipedia can be used to combine the information provided by the two resources. However with FixedWeightTokenEditDistance the rules in WordNet and Wikipedia will be searched by using the tokens and not the lemmas as in the case of FixedWeightLemmaEditDistance. As a results the contribution of these resources could not be of relevance. To be able to use this components, the LAP should provide token and part-of-speech annotations (Currently only TreeTagger provides this for all three languages, and TextPro for Italian). |
stopWordRemoval | Can be ''POS'',''LIST'',''POS,LIST'' or ''false'', and indicates to the distance computation component whether to filter stop words or not. If ''POS'' is selected stop words will be selected on the base of their part of speech. If ''LIST'' is selected stop words will be selected in a file. |
pathStopWordFile | The path to the particular file with a stop word in every lines. It's used only if stopWordRemoval ''LIST'' is selected. |
ignoreCase | Can be ''true'' or ''false'', and indicates to the distance computation component whether to consider case for the match between words or not. |
normalizationType | Can be ''default'' or ''long''. If the first one is selected the distance between T and H will be normalized on the number of possible operation to transform T in H. With ''long'' distance will be normalized on the sum of words of T plus words of H. |
wordnet | The path to the particular WordNet resource used. |
wikipedia | The path to the particular Wikipedia resource used. Italian and English Wikipedia can be used. |
This section refers to the model. The value of the parameters in the section have been learned by the EDA automatically during the training phase. They do not have been changed by users.
<?xml version="1.0" encoding="UTF-8"?><!--
Language: English
EDA: EditDistanceEDA
Description: Given a certain configuration (i.e. the configuration file), the edit distance EDA can be
trained over a specific data set (i.e. trainDir) in order to optimize its performance (i.e. accuracy or
F1 measure). In the training phase the EDA produces a distance model for the data set, which includes a
distance threshold that best separates the positive and negative examples in the training data. The calculated
threshold is then saved in the configuration file itself. During the test phase the configuration file
is read and the reported threshold used so that T-H pairs resulting in a distance below the threshold are
classified as ENTAILMENT, while pairs above the threshold are classified as NONENTAILMENT.
EditDistanceEDA uses the weights of the 3 different edit distance operations (i.e. delete, insert, substitute)
reported in the configuration file to calculate the distance between T and H. To calculate this distance
EditDistance can use either the FixedWeightTokenEditDistance or FixedWeightLemmaEditDistance component.
FixedWeightTokenEditDistance calculates the distance between T and H by using the tokens whereas FixedWeightLemmaEditDistance
uses the lemma of the tokens (in this case a pipeline producing the lemma had to be used for preprocessing the data set).
In addition FixedWeightTokenEditDistance and FixedWeightLemmaEditDistance can exploit external resources like
WordNet and Wikipedia.
From the configuration file it is possible to select different instances of EditDistanceEDA; this is done
by selecting the component to be used (i.e. FixedWeightTokenEditDistance or FixedWeightLemmaEditDistance) and
then one of the available configurations of the selected component (e.g. basic, wordnet, wikipedia).
Basically the configuration file represents a single experiment where information about the used EDA, its
parameters, the data set used to calculate the threshold and the threshold itself are all available. Sharing
a configuration file means allowing other users to replicate the same experiment under the same condition.
--><configuration>
<!-- Platform configuration section; the information in this section is used by the EOPRunner class being
able to perform both the processing of the data set and running the EDA -->
<section name="PlatformConfiguration">
<!-- The EDA to be used: EditDistanceEDA -->
<property name="activatedEDA">eu.excitementproject.eop.core.EditDistanceEDA</property>
<!-- The language: [EN] -->
<property name="language">EN</property>
<!-- The linguistic annotation pipeline to preprocess the data to be annotated: [OpenNLPTaggerEN|TreeTaggerEN] -->
<!-- Differently to OpenNLPTagger, TreeTagger can produce the lemma and it is the pipeline to be
used when FixedWeightLemmaEditDistance is selected. Be sure to have TreeTagger installed before using it -->
<!-- <property name="activatedLAP">eu.excitementproject.eop.lap.dkpro.TreeTaggerEN</property> -->
<property name="activatedLAP">eu.excitementproject.eop.lap.dkpro.OpenNLPTaggerEN</property>
</section>
<!-- FixedWeightTokenEditDistance uses the token to calculate the distance between each pair T-H -->
<section name="eu.excitementproject.eop.core.component.distance.FixedWeightTokenEditDistance">
<!-- Do not consider the stop words: [POS|LIST|POS,LIST|false] -->
<!-- POS eliminate only some part of speech, LIST eliminate only the words listed in a file -->
<property name="stopWordRemoval">POS</property>
<!-- Do not consider the case: [true|false] -->
<property name="ignoreCase">true</property>
<!-- Path to the stop word list -->
<property name="pathStopWordFile">../core/src/main/resources/external-data/edit/stopwords_EN.txt</property>
<!-- Normalization type for the distance: [default|long] -->
<property name="normalizationType">default</property>
<!-- The configuration to be used by the component: [basic|wordnet|wikipedia|wordnet,wikipedia] -->
<property name="instances">basic</property>
<!-- This configuration does not use any external resources -->
<subsection name="basic"/>
<!-- This configuration uses WordNet as an external resources -->
<subsection name="wordnet">
<!-- path of the WordNet files -->
<property name="path">/opt/share/eop-resources/ontologies/EnglishWordNet-dict/</property>
</subsection>
<!-- This configuration uses Wikipedia as an external resources -->
<subsection name="wikipedia">
<!-- connection to the Wikipedia data base -->
<property name="dbconnection">jdbc:mysql://nathrezim:3306/wikikb</property>
<property name="dbuser">root</property>
<property name="dbpasswd">nat_2k12</property>
</subsection>
</section>
<!-- FixedWeightLemmaEditDistance uses the lemma to calculate the distance between each pair T-H -->
<section name="eu.excitementproject.eop.core.component.distance.FixedWeightLemmaEditDistance">
<!-- Do not consider the stop words: [POS|LIST|POS,LIST|false] -->
<!-- POS eliminate only some part of speech, LIST eliminate only the words listed in a file -->
<property name="stopWordRemoval">POS</property>
<!-- Do not consider the case: [true|false] -->
<property name="ignoreCase">true</property>
<!-- Path to the stop word list -->
<property name="pathStopWordFile">../core/src/main/resources/external-data/edit/stopwords_EN.txt</property>
<!-- Normalization type for the distance: [default|long] -->
<property name="normalizationType">default</property>
<!-- The configuration to be used by the component: [basic|wordnet|wikipedia|wordnet,wikipedia] -->
<property name="instances">basic</property>
<!-- This configuration does not use any external resources -->
<subsection name="basic"/>
<!-- This configuration uses WordNet as an external resources -->
<subsection name="wordnet">
<!-- path of the WordNet files -->
<property name="path">/opt/share/eop-resources/ontologies/EnglishWordNet-dict/</property>
</subsection>
<!-- This configuration uses Wikipedia as an external resources -->
<subsection name="wikipedia">
<!-- connection to the Wikipedia data base -->
<property name="dbconnection">jdbc:mysql://nathrezim:3306/wikikb</property>
<property name="dbuser">root</property>
<property name="dbpasswd">nat_2k12</property>
</subsection>
</section>
<!-- EditDistanceBasicEDA uses the weights in the configuration file to calculate the entailment -->
<section name="eu.excitementproject.eop.core.EditDistanceEDA">
<!-- weights of the edit distance operations -->
<property name="match">0.0</property>
<property name="delete">0.0</property>
<property name="insert">1.0</property>
<property name="substitute">1.0</property>
<!-- <property name="trainDir">/tmp/</property> -->
<property name="trainDir">/tmp/ENG/dev/</property>
<!-- <property name="testDir">/tmp/</property> -->
<property name="testDir">/tmp/ENG/test</property>
<!-- measure to be optimized: [accuracy|f1] -->
<property name="measure">accuracy</property>
<!-- component to be used by EDA: [FixedWeightTokenEditDistance|FixedWeightLemmaEditDistance]
FixedWeightLemmaEditDistance can be used only when the preprocessing pipeline provides lemmas-->
<property name="components">eu.excitementproject.eop.core.component.distance.FixedWeightTokenEditDistance</property>
</section>
<!-- The information in this section are calculated automatically during the training phase by the EDA and
it represents the learnt model. -->
<section name="model">
<!-- threshold -->
<property name="threshold">0.5741758241758221</property>
<!-- the accuracy obtained on the training data set -->
<property name="trainingAccuracy">0.6575</property>
</section>
</configuration>
<?xml version="1.0" encoding="UTF-8"?><!--
Language: English
EDA: EditDistancePSOEDA
Description: Given a certain configuration (i.e. the configuration file), the edit distance EDA can be
trained over a specific data set (i.e. trainDir) in order to optimize its performance (i.e. accuracy or
F1 measure). In the training phase the EDA produces a distance model for the data set, which includes a
distance threshold that best separates the positive and negative examples in the training data. The calculated
threshold is then saved in the configuration file itself. Differently to EditDistanceEDA this EDA optimizes
also the weight of the 3 edit distance operations (i.e. delete, insert, substitute) on the the used data.
During the test phase the configuration file is read and the reported threshold with the calculated weights of the
edit distance operations used so that T-H pairs resulting in a distance below the threshold are classified as
ENTAILMENT, while pairs above the threshold are classified as NONENTAILMENT.
To calculate the distance between T-H pairs distance EditDistancePSOEDA can use either the FixedWeightTokenEditDistance
or FixedWeightLemmaEditDistance component. FixedWeightTokenEditDistance calculates the distance between T and H
by using the tokens whereas FixedWeightLemmaEditDistance uses the lemma of the tokens (in this case a pipeline
producing the lemma had to be used for preprocessing the data set). In addition FixedWeightTokenEditDistance
and FixedWeightLemmaEditDistance can exploit external resources like WordNet and Wikipedia.
From the configuration file it is possible to select different instances of EditDistancePSOEDA; this is done
by selecting the component to be used (i.e. FixedWeightTokenEditDistance or FixedWeightLemmaEditDistance) and
then one of the available configurations of the selected component (e.g. basic, wordnet, wikipedia).
Basically the configuration file represents a single experiment where information about the used EDA, its
parameters, the data set used to calculate the threshold and the threshold itself are all available. Sharing
a configuration file means allowing other users to replicate the same experiment under the same condition.
--><configuration>
<!-- Platform configuration section; the information in this section is used by the EOPRunner class being
able to perform both the processing of the data set and running the EDA -->
<section name="PlatformConfiguration">
<!-- The EDA to be used: EditDistancePSOEDA -->
<property name="activatedEDA">eu.excitementproject.eop.core.EditDistancePSOEDA</property>
<!-- The language: [EN] -->
<property name="language">EN</property>
<!-- The linguistic annotation pipeline to preprocess the data to be annotated: [OpenNLPTaggerEN|TreeTaggerEN] -->
<!-- Differently to OpenNLPTagger, TreeTagger can produce the lemma and it is the pipeline to be
used when FixedWeightLemmaEditDistance is selected. Be sure to have TreeTagger installed before using it -->
<!-- <property name="activatedLAP">eu.excitementproject.eop.lap.dkpro.TreeTaggerEN</property> -->
<property name="activatedLAP">eu.excitementproject.eop.lap.dkpro.OpenNLPTaggerEN</property>
</section>
<!-- FixedWeightTokenEditDistance uses the token to calculate the distance between each pair T-H -->
<section name="eu.excitementproject.eop.core.component.distance.FixedWeightTokenEditDistance">
<!-- Do not consider the stop words: [POS|LIST|POS,LIST|false] -->
<!-- POS eliminate only some part of speech, LIST eliminate only the words listed in a file -->
<property name="stopWordRemoval">POS</property>
<!-- Do not consider the case: [true|false] -->
<property name="ignoreCase">true</property>
<!-- Path to the stop word list -->
<property name="pathStopWordFile">../core/src/main/resources/external-data/edit/stopwords_EN.txt</property>
<!-- Normalization type for the distance: [default|long] -->
<property name="normalizationType">default</property>
<!-- The configuration to be used by the component: [basic|wordnet|wikipedia|wordnet,wikipedia] -->
<property name="instances">basic</property>
<!-- This configuration does not use any external resources -->
<subsection name="basic"/>
<!-- This configuration uses WordNet as an external resources -->
<subsection name="wordnet">
<!-- path of the WordNet files -->
<property name="path">/opt/share/eop-resources/ontologies/EnglishWordNet-dict/</property>
</subsection>
<!-- This configuration uses Wikipedia as an external resources -->
<subsection name="wikipedia">
<!-- connection to the Wikipedia data base -->
<property name="dbconnection">jdbc:mysql://nathrezim:3306/wikikb</property>
<property name="dbuser">root</property>
<property name="dbpasswd">nat_2k12</property>
</subsection>
</section>
<!-- FixedWeightLemmaEditDistance uses the lemma to calculate the distance between each pair T-H -->
<section name="eu.excitementproject.eop.core.component.distance.FixedWeightLemmaEditDistance">
<!-- Do not consider the stop words: [POS|LIST|POS,LIST|false] -->
<!-- POS eliminate only some part of speech, LIST eliminate only the words listed in a file -->
<property name="stopWordRemoval">POS</property>
<!-- Do not consider the case: [true|false] -->
<property name="ignoreCase">true</property>
<!-- Path to the stop word list -->
<property name="pathStopWordFile">../core/src/main/resources/external-data/edit/stopwords_EN.txt</property>
<!-- Normalization type for the distance: [default|long] -->
<property name="normalizationType">default</property>
<!-- The configuration to be used by the component: [basic|wordnet|wikipedia|wordnet,wikipedia] -->
<subsection name="basic"/>
<!-- This configuration uses WordNet as an external resources -->
<subsection name="wordnet">
<!-- path of the WordNet files -->
<property name="path">/opt/share/eop-resources/ontologies/EnglishWordNet-dict/</property>
</subsection>
<!-- This configuration uses Wikipedia as an external resources -->
<subsection name="wikipedia">
<!-- connection to the Wikipedia data base -->
<property name="dbconnection">jdbc:mysql://nathrezim:3306/wikikb</property>
<property name="dbuser">root</property>
<property name="dbpasswd">nat_2k12</property>
</subsection>
</section>
<!-- EditDistancePSOEDA optimizes the weights of the edit distance operations automatically by using
particle swarm optimization (PSO) -->
<section name="eu.excitementproject.eop.core.EditDistanceEDA">
<!-- Particle Swarm Optimization (PSO) Section -->
<!-- The following 3 parameters determine when PSO stops -->
<!-- max number of iterations -->
<property name="maxIteration">20</property>
<!-- minimum error criteria -->
<property name="errorTolerance">0.1</property>
<!-- max number of iterations without any changes in accuracy -->
<property name="maxIterationWithoutChanges">5</property>
<!-- swarm size -->
<property name="swarmSize">20</property>
<!-- processors used by PSO -->
<!-- for the time being multi-threading is not supported and this parameter
do not have to be changed -->
<property name="processors">1</property>
<!-- range of values of the edit distance operations where PSO has to select a value from -->
<property name="deleteValuesRange">0,5</property>
<property name="insertValuesRange">0,5</property>
<property name="substituteValuesRange">0,5</property>
<!-- <property name="trainDir">/tmp/</property> -->
<property name="trainDir">/tmp/ENG/dev/</property>
<!-- <property name="testDir">/tmp/</property> -->
<property name="testDir">/tmp/ENG/test</property>
<!-- measure to be optimized: accuracy, f1 -->
<property name="measure">accuracy</property>
<!-- component to be used by EDA: [FixedWeightTokenEditDistance|FixedWeightLemmaEditDistance]
FixedWeightLemmaEditDistance can be used only when the preprocessing pipeline provides lemmas -->
<property name="components">eu.excitementproject.eop.core.component.distance.FixedWeightTokenEditDistance</property>
</section>
<!-- The information in this section are calculated automatically during the training phase by the EDA and
it represents the learnt model. -->
<section name="model">
<!-- threshold -->
<property name="threshold">0.6079350735597624</property>
<!-- the accuracy obtained on the training data set -->
<property name="trainingAccuracy">0.635</property>
<!-- weights of the edit distance operations -->
<property name="match">0.0</property>
<property name="delete">0.4946647050742291</property>
<property name="insert">2.6648230238837645</property>
<property name="substitute">2.4827393631805306</property>
</section>
</configuration>