Skip to content
rzanoli edited this page Feb 14, 2014 · 38 revisions

These quick start guides are intended to provide you with some steps that you can follow to become more confident with EOP. The steps have to be completed in the same order as they are shown in the section whereas no advanced installation options are discussed here; just the basics that will work for most of the users who want to get started.

Contents:

  1. Downloading and Installing EOP [d:2, t:2]
  2. [Annotating a single text/hypothesis pair by using a pre-trained model](#Annotating_a_single text/hypothesis_pair_by_using_a_pre-trained_model) [d:1, t:1]
  3. [Annotating multiple text/hypothesis pairs by using a pre-trained model] (#Annotating_multiple_text/hypothesis_pairs_by_using_a_pre-trained model) [d:1, t:2]
  4. Creating, testing and evaluating a new model [d:1, t:3]

For each of the proposed use cases you can see (on their right) the level of difficulty of the task (e.g. d:1) and the lead time (e.g. t:2) needed to complete it. Then for each of these measures a number ranging from 1 to 5 is reported:

  • d:1 (easy), d:5 (difficult)
  • t:1 (no time consuming), t:5 (time consuming)

These values were assigned by a group of software testing people with medium computer skills who were asked to do the proposed tasks.


1. Downloading and Installing EOP

Goal: downloading and installing EOP by using the .tar.gz (gzip) archive file of its distribution.

Prerequisite: EOP hardware and software requirements to use EOP via Command Line Interface are meet.

Main steps:

  1. downloading the .tar.gz (gzip) archive file
  2. building the code
  3. downloading and installing the files of the resources (i.e. configuration files, models and lexical resources) needed to use the platform.

Downloading the .tar.gz (gzip) archive file

EOP provides different distributions for users. In the running example we will use the source code in the .tar.gz (gzip) archive file that has to be downloaded and unpacked before using it.

  • Download the Excitement Open Platform archive file from:
    https://github.com/hltfbk/Excitement-Open-Platform/archive/v{_version_}.tar.gz

where version refers to the EOP release version you want to install (e.g. 1.1.0).

  • Copy the archive file from the directory where it has been saved into the directory where you want to use it, e.g. your home directory:

    > cp Excitement-Open-Platform-{version}.tar.gz  ~/
    
  • Go into your home directory and extract/unpack it, i.e.

    > cd ~/
    > tar -xvzf Excitement-Open-Platform-{version}.tar.gz
    

It will create the directory Excitement-Open-Platform-{version} containing the source code.

Building the EOP code

To compile the source code and "assemble" the produced files, directories and the needed dependencies, the Maven tool needs to be used:

From your home directory go into the Excitement-Open-Platform-{version} directory, i.e.

> cd Excitement-Open-Platform-{version}

Then, build the EOP code using the Maven command. Before that ensure JAVA_HOME is correctly set.

> mvn package assembly:assembly

The command creates a directory called target in Excitement-Open-Platform-{version} containing a zip file (i.e. eop-{version}-bin.zip) of the generated binary code.

Go into the target directory, i.e.

> cd target

and from this directory unzip the new Zip File created before (eop-{version}-bin.zip), i.e.

> unzip eop-{version}-bin.zip

It creates a new directory (i.e. EOP-{version}) containing the binary files (i.e. jar files) that you have to use to run EOP.

Downloading and installing the file of the resources

Resources like WordNet and Wikipedia as well as the configuration files of the platform and the pre-trained models are distributed in a separate archive file that has to be downloaded and unpacked before using it:

  • Follow this link to download the archive file of the resources:
    http://hlt-services4.fbk.eu:8080/artifactory/repo/eu/excitementproject/eop-resources/eop-resources-{version}.tar.gz

  • Copy the archive file into the EOP-{version} directory created in the previous point, e.g.

    > cp eop-resources-{version}.tar.gz   ~/Excitement-Open-Platform-{version}/target/EOP-{version}/
    
  • From the EOP-{version} directory where the archive file has been saved, extract/unpack it, i.e.

    > cd  ~/Excitement-Open-Platform-{version}/target/EOP-{version}/
    > tar -xvzf eop-resources-{version}.tar.gz
    

It will create the directory eop-resources-{version} containing all the needed files.


2. Annotating a single text/hypothesis pair by using a pre-trained model

Goal: Given two text fragments, one named Text and the other named Hypothesis, the Entailment task consists in recognizing whether the hypothesis can be inferred from the text. The purpose of this section consists in annotating a single text/hypothesis pair (i.e. predicting if the hypothesis can be inferred from the text) by using one of the pre-trained models of Edit Distance EDA (for details about the EOP architecture and the available modules, visit the page [Components-Description](Components Description)).

Prerequisite: EOP has to be already installed.

Main steps:

  • pre-processing the text/hypothesis pair by the needed linguistic pipeline
  • annotating the pair with Edit Distance EDA

The EOPRunner class is a utility class provided with EOP that is able to call both the linguistic analysis pipeline to pre-process the data to be annotated and the selected entailment decision algorithm (EDA). It is the simplest way to run EOP.

Go into the EOP-{version} directory, i.e.

> cd  ~/Excitement-Open-Platform-{version}/target/EOP-{version}/

and call the EOPRunner class with the needed parameters as reported below, i.e.

> java -Djava.ext.dirs=../EOP-{version}/ eu.excitementproject.eop.util.runner.EOPRunner 
       -config ./eop-resources-{version}/configuration-files/EditDistanceEDA_EN.xml
       -test -text "Hubble is a telescope"
       -hypothesis "Hubble is an instrument"
       -output ./eop-resources-{version}/results/

where:

  • EditDistanceEDA_EN.xml is the configuration file containing the linguistic analysis pipeline, the EDA and the pre-trained model that have to be used to annotate the data.
  • test means that the selected EDA has to perform its annotation by using a pre-trained model.
  • text is the text.
  • hypothesis is the hypothesis.
  • output is the directory where the result file (results.xml) containing the prediction has to be stored.

The prediction is saved in the results.xml file in the eop-resources-{version}/results/ directory and you can take a look at it using the following command:

> cat eop-resources-{version}/results/results.xml

Below we report an example of the results.xml file:

<entailment-corpus lang="null">
  <pair id="1" entailment="Entailment" benchmark="N/A" confidence="0.24084249084248882" task="EOP test">
    <t>Hubble is a telescope</t>
    <h>Hubble is an instrument</h>
  </pair>
</entailment-corpus>

The prediction made by the EDA (i.e. entailment="Entailment") means that for Edit Distance EDA there is a relation of Entailment between the text: Hubble is a telescope and the hypothesis: Hubble is an instrument.


3. Annotating multiple text/hypothesis pairs by using a pre-trained model

Goal: Given two text fragments, one named text and the other named hypothesis, the Entailment task consists in recognizing whether the hypothesis can be inferred from the text. The purpose of this section consists in annotating the English RTE-3 data set containing multiple text/hypothesis pairs (i.e. for each of the pairs in the data set we want to know if the hypothesis can be inferred from the text) by using one of the pre-trained models of Edit Distance EDA.

Prerequisite: EOP has to be already installed.

Main steps:

  • pre-processing the data set of text/hypothesis pairs by the needed linguistic pipeline
  • annotating the data set with Edit Distance EDA

To do this task we will use the EOPRunner class. Go into the EOP-{version} directory, i.e.

> cd  ~/Excitement-Open-Platform-{version}/target/EOP-{version}/

EOPRunner calls the specified LAP for pre-processing the data set and puts the produced files into the directory always specified in the EDA's configuration file by its own parameter (e.g. testDir). Before running EOPRunner you should check that that directory (e.g. /tmp/EN/test/) exists and in case create one (e.g. mkdir -p /tmp/EN/test/). After that you can call the EOPRunner class with the needed parameters as reported below:

> java -Djava.ext.dirs=../EOP-{version}/ eu.excitementproject.eop.util.runner.EOPRunner
       -config ./eop-resources-{version}/configuration-files/EditDistanceEDA_EN.xml
       -test -testFile ./eop-resources-{version}/data-set/English_test.xml
       -output ./eop-resources-{version}/results/

where:

  • EditDistanceEDA_EN.xml is the configuration file containing the linguistic analysis pipeline, the EDA and the pre-trained model that have to be used to annotate the data.
  • test means that the selected EDA has to perform its annotation by using a pre-trained model.
  • textFile is the data set of the text/hypothesis pairs that has to be annotated.
  • output is the directory where the result file (EditDistanceEDA_EN.xml_results.xml) containing the predictions has to be stored.

The annotations are saved in the EditDistanceEDA_EN.xml_results.txt file in the eop-resources-{version}/results/ directory and you can take a look at it by the following command:

> cat eop-resources-{version}/results/EditDistanceEDA_EN.xml_results.txt

Here is an example of such a file:

747     NONENTAILMENT   NonEntailment   0.2258241758241779
795     ENTAILMENT      Entailment      0.5741758241758221
60      ENTAILMENT      Entailment      0.24084249084248882
546     NONENTAILMENT   NonEntailment   0.15309690309690516
.....................
.....................
509     ENTAILMENT      Entailment      0.07417582417582214

The first and the second column report the T/H pairs ID and the annotation as reported in the gold standard. The third column contains the prediction made by the EDA whereas the last is the confidence level of the prediction (i.e. how much EDA is sure about its decision).


4. Creating, testing and evaluating a new model

Goal: The purpose of this section consists in using TIE EDA to create a new model on the English RTE-3 development data set to be tested on the English RTE-3 test data set.More specifically we we will use the multiple pairs in the training data set to build a model that can predict, that is for each of the pairs in the test data set, if the hypothesis H can be inferred from the text T.

Prerequisite: EOP has to be already installed.

Main steps:

  • Training TIE on the training data set
  • pre-processing the English RTE-3 data set by the needed linguistic pipeline
  • learning the new model with TIE EDA
  • Testing the learned model on the test data set
  • pre-processing the English RTE-3 test set by the needed linguistic pipeline
  • testing the new model on the RTE-3 test data set

Also for doing this task we will use the EOPRunner class as described below.

Training TIE on the training data set

Go into the EOP-{version} directory, i.e.

> cd  ~/Excitement-Open-Platform-{version}/target/EOP-{version}/

EOPRunner calls the specified LAP for pre-processing the data set and puts the produced files into the directory always specified in the EDA's configuration file by its own parameter (e.g. trainDir). Before running EOPRunner you should check that that directory (e.g. /tmp/EN/dev/) exists and in case create one (e.g. mkdir -p /tmp/EN/dev/). After that you can call the EOPRunner class with the needed parameters as reported below:

java -Djava.ext.dirs=../EOP-{version}/ eu.excitementproject.eop.util.runner.EOPRunner
     -config ./eop-resources-{version}/configuration-files/MaxEntClassificationEDA_Base+OpenNLP_EN.xml
     -train -trainFile ./eop-resources-{version}/data-set/English_dev.xml

where:

  • MaxEntClassificationEDA_Base+OpenNLP_EN.xml is the configuration containing the linguistic analysis pipeline, the EDA and the model that has to be created training the EDA on the training data set.
  • train means that the selected EDA has to be trained on the specified training data set.
  • trainFile is the data set of the text/hypothesis pairs that has to be used to train the EDA.

At the end of this phase the new model MaxEntClassificationEDAModel_Base+OpenNLP_EN should be available from the eop-resources-{version}/model/ directory. With TIE when a user uses a model file with the same name as an existing one, e.g., myModelFIle, then the old myModelFile will be overwritten. In order to be on the safe side, actually the existing old myModelFile is copied to myModelFile_old and stored in the same directory before it is overwritten.

Testing the learned model on the test data set

In this phase the model learned in the previous phase is used to annotate the test data set. EOPRunner calls the specified LAP for pre-processing the data set and puts the produced files into the directory always specified in the EDA's configuration file by its own parameter (e.g. testDir). Before running EOPRunner you should check that that directory (e.g. /tmp/EN/test/) exists and in case create one (e.g. mkdir -p /tmp/EN/test/). After that you can call the EOPRunner class with the needed parameters as reported below:

java -Djava.ext.dirs=../EOP-{version}/ eu.excitementproject.eop.util.runner.EOPRunner
     -config ./eop-resources-{version}/configuration-files/MaxEntClassificationEDA_Base+OpenNLP_EN.xml
     -test -testFile ./eop-resources-{version}/data-set/English_test.xml
     -output ./eop-resources-{version}/results/                        

where:

  • MaxEntClassificationEDA_Base+OpenNLP_EN.xml is the configuration containing the linguistic analysis pipeline, the EDA and the pre-trained model that have to be used to annotate the data.
  • test means that the selected EDA has to make its annotation by using a pre-trained model.
  • textFile is the data set of the text/hypothesis pairs that has to be annotated.
  • output is the directory where the result file (MaxEntClassificationEDA_Base+OpenNLP_EN.xml_results.txt) containing the predictions has to be stored.

Evaluating the created model

The annotation produced above can be evaluated in terms of accuracy, Precision, Recall, and F1 measure by using the scorer available with EOP. The evaluation can be done either during the testing phase (to do that it is sufficient to add the parameter -score to the command line of the previous example) or after the testing phase, e.g.

java -Djava.ext.dirs=../EOP-{version}/ eu.excitementproject.eop.util.runner.EOPRunner
     -score -results ./eop-resources-{version}/results/MaxEntClassificationEDA_Base+OpenNLP_EN.xml_results.txt

where:

  • results is the file containing the produced annotations, in this case: ./eop-resources-{version}/results/MaxEntClassificationEDA_Base+OpenNLP_EN.xml_results.txt

It produces the file:
./eop-resources-{version}/results/MaxEntClassificationEDA_Base+OpenNLP_EN.xml_results.txt_report.xml

containing the calculated results, e.g.

<?xml version="1.0" encoding="UTF-8"?>
<Result EDA_Configuration="MaxEntClassificationEDA_Base+OpenNLP_EN.xml_results.txt">
<Total_Pairs>800</Total_Pairs>
<Accuracy>0.615</Accuracy>
<Positive_Pairs Number="410">
    <Precision>0.60991377</Precision>
    <Recall>0.6902439</Recall>
    <F_Measure>0.6475972</F_Measure>
    <Classified_As_Positive>283</Classified_As_Positive>
    <Classified_As_Negative>127</Classified_As_Negative>
</Positive_Pairs>
<Negative_Pairs Number="390">
    <Precision>0.6220238</Precision>
    <Recall>0.53589743</Recall>
    <F_Measure>0.57575756</F_Measure>
    <Classified_As_Positive>181</Classified_As_Positive>
    <Classified_As_Negative>209</Classified_As_Negative>
</Negative_Pairs>
</Result>

Clone this wiki locally