Skip to content

TREMA-UNH/trec-car-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TREC Car Tools

Travis badge

Development tools for participants of the TREC Complex Answer Retrieval track.

Data release support for v1.5 and v2.0. and v2.6

Note that in order to allow to compile your project for two trec-car format versions, the maven artifact Id was changed to treccar-tools-v2 with version 2.0, and the package path changed to treccar_v2

Current support for

  • Python 3.6
  • Java 1.8

If you are using Anaconda, install the cbor library for Python 3.6:

conda install -c laura-dietz cbor=1.0.0 

How to use the Python bindings for trec-car-tools?

  1. Get the data from http://trec-car.cs.unh.edu
  2. Clone this repository
  3. python setup.py install

Look out for test.py for an example on how to access the data.

How to use the java 1.8 (or higher) bindings for trec-car-tools through maven?

add to your project's pom.xml file (or similarly gradel or sbt):

    <repositories>
        <repository>
            <id>jitpack.io</id>
            <url>https://jitpack.io</url>
        </repository>
    </repositories>

add the trec-car-tools dependency:

        <dependency>     
	    <groupId>com.github.TREMA-UNH</groupId>
	    <artifactId>trec-car-tools-java</artifactId>
	    <version>20</version>
        </dependency>

compile your project with mvn compile

Tool support

This package provides support for the following activities.

  • read_data: Reading the provided paragraph collection, outline collections, and training articles
  • format_runs: writing submission files

Reading Data

If you use python or java, please use trec-car-tools, no need to understand the following. We provide bindings for haskell upon request. If you are programming under a different language, you can use any CBOR library and decode the grammar below.

CBOR is similar to JSON, but it is a binary format that compresses better and avoids text file encoding issues.

Articles, outlines, paragraphs are all described with CBOR following this grammar. Wikipedia-internal hyperlinks are preserved through ParaLinks.

     Page         -> $pageName $pageId [PageSkeleton] PageType PageMetadata
     PageType     -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
     PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors WikiDataQid SiteId PageTags
     RedirectNames       -> [$pageName] 
     DisambiguationNames -> [$pageName] 
     DisambiguationIds   -> [$pageId] 
     CategoryNames       -> [$pageName] 
     CategoryIds         -> [$pageId] 
     InlinkIds           -> [$pageId] 
     InlinkAnchors       -> [$anchorText] 
     WikiDataQid         -> [$qid] 
     SiteId              -> [$siteId] 
     PageTags            -> [$pageTags] 
     
     PageSkeleton -> Section | Para | Image | ListItem | Infobox
     Section      -> $sectionHeading [PageSkeleton]
     Para         -> Paragraph
     Paragraph    -> $paragraphId, [ParaBody]
     ListItem     -> $nestingLevel, Paragraph
     Image        -> $imageURL [PageSkeleton]
     ParaBody     -> ParaText | ParaLink
     ParaText     -> $text
     ParaLink     -> $targetPage $targetPageId $linkSection $anchorText
     Infobox      -> $infoboxName [($key, [PageSkeleton])]

You can use any CBOR serialization library. Below a convenience library for reading the data into Python (3.5)

  • ./read_data/trec_car_read_data.py Python 3.5 convenience library for reading the input data (in CBOR format). -- If you use anaconda, please install the cbor library with conda install -c auto cbor=1.0 -- Otherwise install it with pypi install cbor

Ranking Results

Given an outline, your task is to produce one ranking for each section $section (representing an information need in traditional IR evaluations).

Each ranked element is an (entity,passage) pair, meaning that this passage is relevant for the section, because it features a relevant entity. "Relevant" means that the entity or passage must/should/could be listed in this section.

The section is represented by the path of headings in the outline $pageTitle/$heading1/$heading1.1/.../$section in URL encoding.

The entity is represented by the DBpedia entity id (derived from the Wikipedia URL). Optionally, the entity can be omitted.

The passage is represented by the passage id given in the passage corpus (an MD5 hash of the content). Optionally, the passage can be omitted.

The results are provided in a format that is similar to the "trec_results file format" of trec_eval. More info on how to use trec_eval and source.

Example of ranking format

     Green\_sea\_turtle\Habitat  Pelagic\_zone  12345          0     27409 myTeam 
     $qid                        $entity        $passageId     rank  sim   run_id 

Integration with other tools

It is recommended to use the format_runs package to write run files. Here an example:

    with open('runfile', mode='w', encoding='UTF-8') as f:
        writer = configure_csv_writer(f)
        for page in pages:
            for section_path in page.flat_headings_list():
                ranking = [RankingEntry(page.page_name, section_path, p.para_id, r, s, paragraph_content=p) for p,s,r in ranking]
                format_run(writer, ranking, exp_name='test')

        f.close()

This ensures that the output is correctly formatted to work with trec_eval and the provided qrels file.

Run trec_eval version 9.0.4 as usual:

  trec_eval -q release.qrel runfile > run.eval

The output is compatible with the eval plotting package minir-plots. For example run

  python column.py --out column-plot.pdf --metric map run.eval
  python column_difficulty.py --out column-difficulty-plot.pdf --metric map run.eval run2.eval

Moreover, you can compute success statistics such as hurts/helps or a paired-t-test as follows.

  python hurtshelps.py --metric map run.eval run2.eval
  python paired-ttest.py --metric map run.eval run2.eval

Creative Commons License
TREC-CAR Dataset by Laura Dietz, Ben Gamari is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.