-
Notifications
You must be signed in to change notification settings - Fork 0
Quick Start
These quick start guides are intended to provide you with some steps that you can follow to become more confident with EOP. The steps have to be completed in the same order as they are shown in the section whereas no advanced installation options are discussed here; just the basics that will work for most of the users who want to get started.
- Downloading and Installing EOP [d:1, t:2]
- [Annotating a single text/hypothesis pair by using a pre-trained model](#Annotating_a_single text/hypothesis_pair_by_using_a_pre-trained_model) [d:1, t:1]
- [Annotating multiple text/hypothesis pairs by using a pre-trained model] (#Annotating_multiple_text/hypothesis_pairs_by_using_a_pre-trained model) [d:1, t:2]
- Creating, testing and evaluating a new model [d:1, t:3]
For each of the proposed use cases you can see (on their right) the level of difficulty of the task (e.g. d:1) and the lead time (e.g. t:2) needed to complete it. Then for each of these measures a number ranging from 1 to 5 is reported:
- d:1 (easy), d:5 (difficult)
- t:1 (no time consuming), t:5 (time consuming)
These values were assigned by a group of software testing people with medium computer skills who were asked to do the proposed tasks.
Goal: downloading and installing EOP by using the .tar.gz (gzip) archive file of its distribution.
Prerequisite: EOP hardware and software requirements to use EOP via Command Line Interface are meet.
Main steps:
- Downloading the EOP code
- Downloading the files of the resources (i.e. configuration files, models and lexical resources)
- Installing EOP
EOP provides different distributions for users. In the running example we will use the source code in the .tar.gz (gzip) archive file that has to be downloaded before using it.
- Download the Excitement Open Platform archive file from the following link and save it for example into your home directory:
https://github.com/hltfbk/Excitement-Open-Platform/archive/v{_version_}.tar.gz
where version refers to the EOP release version you want to install (e.g. 1.1.0).
Resources like WordNet as well as the configuration files of the platform and the pre-trained models are distributed in a separate archive file that has to be downloaded before using it:
- Download the archive file of the resources and save it into the same directory where the Excitement Open Platform archive file is (e.g. your home directory):
http://hlt-services4.fbk.eu:8080/artifactory/repo/eu/excitementproject/eop-resources/eop-resources-{version}.tar.gz
where version refers to the EOP release version you want to install (e.g. 1.1.0).
To install EOP automatically we use a shell script whereas a manual installation of the platform can be done by following the steps reported in the Step by Step tutorial.
-
Download the shell script from this link and save it into the same directory where both the EOP code and the file of the resources are (e.g. your home directory):
http://hlt-services4.fbk.eu:8080/artifactory/simple/private-internal/eu/excitementproject/eop-resources/script/install.sh
-
From the directory where the script has been save, run the script to install EOP:
> ./install.sh {version}
where version is the EOP version to install; for example to install EOP v1.1.0 this command is needed: ./install.sh 1.1.0
Goal: Given two text fragments, one named Text and the other named Hypothesis, the Entailment task consists in recognizing whether the hypothesis can be inferred from the text. The purpose of this section consists in annotating a single text/hypothesis pair (i.e. predicting if the hypothesis can be inferred from the text) by using one of the pre-trained models of Edit Distance EDA (for details about the EOP architecture and the available modules, visit the page [Components-Description](Components Description)).
Prerequisite: EOP has to be already installed.
Main steps:
- pre-processing the text/hypothesis pair by the needed linguistic pipeline
- annotating the pair with Edit Distance EDA
The EOPRunner class is a utility class provided with EOP that is able to call both the linguistic analysis pipeline to pre-process the data to be annotated and the selected entailment decision algorithm (EDA). It is the simplest way to run EOP.
Go into the EOP-{version} directory, i.e.
> cd ~/Excitement-Open-Platform-{version}/target/EOP-{version}/
and call the EOPRunner class with the needed parameters as reported below, i.e.
> java -Djava.ext.dirs=../EOP-{version}/ eu.excitementproject.eop.util.runner.EOPRunner
-config ./eop-resources-{version}/configuration-files/EditDistanceEDA_EN.xml
-test -text "Hubble is a telescope"
-hypothesis "Hubble is an instrument"
-output ./eop-resources-{version}/results/
where:
- EditDistanceEDA_EN.xml is the configuration file containing the linguistic analysis pipeline, the EDA and the pre-trained model that have to be used to annotate the data.
- test means that the selected EDA has to perform its annotation by using a pre-trained model.
- text is the text.
- hypothesis is the hypothesis.
- output is the directory where the result file (results.xml) containing the prediction has to be stored.
The prediction is saved in the results.xml file in the eop-resources-{version}/results/ directory and you can take a look at it using the following command:
> cat eop-resources-{version}/results/results.xml
Below we report an example of the results.xml file:
<entailment-corpus lang="null">
<pair id="1" entailment="Entailment" benchmark="N/A" confidence="0.24084249084248882" task="EOP test">
<t>Hubble is a telescope</t>
<h>Hubble is an instrument</h>
</pair>
</entailment-corpus>
The prediction made by the EDA (i.e. entailment="Entailment") means that for Edit Distance EDA there is a relation of Entailment between the text: Hubble is a telescope and the hypothesis: Hubble is an instrument. Eventually confidence value (i.e. 0.2408424908424888) refers to how much EDA is sure about its decision.
Goal: Given two text fragments, one named text and the other named hypothesis, the Entailment task consists in recognizing whether the hypothesis can be inferred from the text. The purpose of this section consists in annotating the English RTE-3 data set containing multiple text/hypothesis pairs (i.e. for each of the pairs in the data set we want to know if the hypothesis can be inferred from the text) by using one of the pre-trained models of Edit Distance EDA.
Prerequisite: EOP has to be already installed.
Main steps:
- pre-processing the data set of text/hypothesis pairs by the needed linguistic pipeline
- annotating the data set with Edit Distance EDA
To do this task we will use the EOPRunner class. Go into the EOP-{version} directory, i.e.
> cd ~/Excitement-Open-Platform-{version}/target/EOP-{version}/
EOPRunner calls the specified LAP for pre-processing the data set and puts the produced files into the directory always specified in the EDA's configuration file by its own parameter (e.g. testDir). Before running EOPRunner you should check that that directory (e.g. /tmp/EN/test/) exists and in case create one (e.g. mkdir -p /tmp/EN/test/). After that you can call the EOPRunner class with the needed parameters as reported below:
> java -Djava.ext.dirs=../EOP-{version}/ eu.excitementproject.eop.util.runner.EOPRunner
-config ./eop-resources-{version}/configuration-files/EditDistanceEDA_EN.xml
-test -testFile ./eop-resources-{version}/data-set/English_test.xml
-output ./eop-resources-{version}/results/
where:
- EditDistanceEDA_EN.xml is the configuration file containing the linguistic analysis pipeline, the EDA and the pre-trained model that have to be used to annotate the data.
- test means that the selected EDA has to perform its annotation by using a pre-trained model.
- textFile is the data set of the text/hypothesis pairs that has to be annotated.
- output is the directory where the result file (EditDistanceEDA_EN.xml_results.xml) containing the predictions has to be stored.
The annotations are saved in the EditDistanceEDA_EN.xml_results.txt file in the eop-resources-{version}/results/ directory and you can take a look at it by the following command:
> cat eop-resources-{version}/results/EditDistanceEDA_EN.xml_results.txt
Here is an example of such a file:
747 NONENTAILMENT NonEntailment 0.2258241758241779
795 ENTAILMENT Entailment 0.5741758241758221
60 ENTAILMENT Entailment 0.24084249084248882
546 NONENTAILMENT NonEntailment 0.15309690309690516
.....................
.....................
509 ENTAILMENT Entailment 0.07417582417582214
The first and the second column report the T/H pairs ID and the annotation as reported in the gold standard. The third column contains the prediction made by the EDA whereas the last is the confidence level of the prediction (i.e. how much EDA is sure about its decision).
Goal: The purpose of this section consists in using TIE EDA to create a new model on the English RTE-3 development data set to be tested on the English RTE-3 test data set.More specifically we we will use the multiple pairs in the training data set to build a model that can predict, that is for each of the pairs in the test data set, if the hypothesis H can be inferred from the text T.
Prerequisite: EOP has to be already installed.
Main steps:
- Training TIE on the training data set
- pre-processing the English RTE-3 data set by the needed linguistic pipeline
- learning the new model with TIE EDA
- Testing the learned model on the test data set
- pre-processing the English RTE-3 test set by the needed linguistic pipeline
- testing the new model on the RTE-3 test data set
Also for doing this task we will use the EOPRunner class as described below.
Go into the EOP-{version} directory, i.e.
> cd ~/Excitement-Open-Platform-{version}/target/EOP-{version}/
EOPRunner calls the specified LAP for pre-processing the data set and puts the produced files into the directory always specified in the EDA's configuration file by its own parameter (e.g. trainDir). Before running EOPRunner you should check that that directory (e.g. /tmp/EN/dev/) exists and in case create one (e.g. mkdir -p /tmp/EN/dev/). After that you can call the EOPRunner class with the needed parameters as reported below:
java -Djava.ext.dirs=../EOP-{version}/ eu.excitementproject.eop.util.runner.EOPRunner
-config ./eop-resources-{version}/configuration-files/MaxEntClassificationEDA_Base+OpenNLP_EN.xml
-train -trainFile ./eop-resources-{version}/data-set/English_dev.xml
where:
- MaxEntClassificationEDA_Base+OpenNLP_EN.xml is the configuration containing the linguistic analysis pipeline, the EDA and the model that has to be created training the EDA on the training data set.
- train means that the selected EDA has to be trained on the specified training data set.
- trainFile is the data set of the text/hypothesis pairs that has to be used to train the EDA.
At the end of this phase the new model MaxEntClassificationEDAModel_Base+OpenNLP_EN should be available from the eop-resources-{version}/model/ directory. With TIE when a user uses a model file with the same name as an existing one, e.g., myModelFIle, then the old myModelFile will be overwritten. In order to be on the safe side, actually the existing old myModelFile is copied to myModelFile_old and stored in the same directory before it is overwritten.
In this phase the model learned in the previous phase is used to annotate the test data set. EOPRunner calls the specified LAP for pre-processing the data set and puts the produced files into the directory always specified in the EDA's configuration file by its own parameter (e.g. testDir). Before running EOPRunner you should check that that directory (e.g. /tmp/EN/test/) exists and in case create one (e.g. mkdir -p /tmp/EN/test/). After that you can call the EOPRunner class with the needed parameters as reported below:
java -Djava.ext.dirs=../EOP-{version}/ eu.excitementproject.eop.util.runner.EOPRunner
-config ./eop-resources-{version}/configuration-files/MaxEntClassificationEDA_Base+OpenNLP_EN.xml
-test -testFile ./eop-resources-{version}/data-set/English_test.xml
-output ./eop-resources-{version}/results/
where:
- MaxEntClassificationEDA_Base+OpenNLP_EN.xml is the configuration containing the linguistic analysis pipeline, the EDA and the pre-trained model that have to be used to annotate the data.
- test means that the selected EDA has to make its annotation by using a pre-trained model.
- textFile is the data set of the text/hypothesis pairs that has to be annotated.
- output is the directory where the result file (MaxEntClassificationEDA_Base+OpenNLP_EN.xml_results.txt) containing the predictions has to be stored.
The annotation produced above can be evaluated in terms of accuracy, Precision, Recall, and F1 measure by using the scorer available with EOP. The evaluation can be done either during the testing phase (to do that it is sufficient to add the parameter -score to the command line of the previous example) or after the testing phase, e.g.
java -Djava.ext.dirs=../EOP-{version}/ eu.excitementproject.eop.util.runner.EOPRunner
-score -results ./eop-resources-{version}/results/MaxEntClassificationEDA_Base+OpenNLP_EN.xml_results.txt
where:
- results is the file containing the produced annotations, in this case: ./eop-resources-{version}/results/MaxEntClassificationEDA_Base+OpenNLP_EN.xml_results.txt
It produces the file:
./eop-resources-{version}/results/MaxEntClassificationEDA_Base+OpenNLP_EN.xml_results.txt_report.xml
containing the calculated results, e.g.
<?xml version="1.0" encoding="UTF-8"?>
<Result EDA_Configuration="MaxEntClassificationEDA_Base+OpenNLP_EN.xml_results.txt">
<Total_Pairs>800</Total_Pairs>
<Accuracy>0.615</Accuracy>
<Positive_Pairs Number="410">
<Precision>0.60991377</Precision>
<Recall>0.6902439</Recall>
<F_Measure>0.6475972</F_Measure>
<Classified_As_Positive>283</Classified_As_Positive>
<Classified_As_Negative>127</Classified_As_Negative>
</Positive_Pairs>
<Negative_Pairs Number="390">
<Precision>0.6220238</Precision>
<Recall>0.53589743</Recall>
<F_Measure>0.57575756</F_Measure>
<Classified_As_Positive>181</Classified_As_Positive>
<Classified_As_Negative>209</Classified_As_Negative>
</Negative_Pairs>
</Result>