TREC Car Tools

Development tools for participants of the TREC Complex Answer Retrieval track.

Data release support for v1.5 and v2.0. and v2.6

Note that in order to allow to compile your project for two trec-car format versions, the maven artifact Id was changed to treccar-tools-v2 with version 2.0, and the package path changed to treccar_v2

Current support for

Python 3.6
Java 1.8

If you are using Anaconda, install the cbor library for Python 3.6:

conda install -c laura-dietz cbor=1.0.0

How to use the Python bindings for trec-car-tools?

Get the data from http://trec-car.cs.unh.edu
Clone this repository
python setup.py install

Look out for test.py for an example on how to access the data.

How to use the java 1.8 (or higher) bindings for trec-car-tools through maven?

add to your project's pom.xml file (or similarly gradel or sbt):

    <repositories>
        <repository>
            <id>jitpack.io</id>
            <url>https://jitpack.io</url>
        </repository>
    </repositories>

add the trec-car-tools dependency:

        <dependency>     
	    <groupId>com.github.TREMA-UNH</groupId>
	    <artifactId>trec-car-tools-java</artifactId>
	    <version>20</version>
        </dependency>

compile your project with mvn compile

Tool support

This package provides support for the following activities.

read_data: Reading the provided paragraph collection, outline collections, and training articles
format_runs: writing submission files

Reading Data

If you use python or java, please use trec-car-tools, no need to understand the following. We provide bindings for haskell upon request. If you are programming under a different language, you can use any CBOR library and decode the grammar below.

CBOR is similar to JSON, but it is a binary format that compresses better and avoids text file encoding issues.

Articles, outlines, paragraphs are all described with CBOR following this grammar. Wikipedia-internal hyperlinks are preserved through ParaLinks.

     Page         -> $pageName $pageId [PageSkeleton] PageType PageMetadata
     PageType     -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
     PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors WikiDataQid SiteId PageTags
     RedirectNames       -> [$pageName] 
     DisambiguationNames -> [$pageName] 
     DisambiguationIds   -> [$pageId] 
     CategoryNames       -> [$pageName] 
     CategoryIds         -> [$pageId] 
     InlinkIds           -> [$pageId] 
     InlinkAnchors       -> [$anchorText] 
     WikiDataQid         -> [$qid] 
     SiteId              -> [$siteId] 
     PageTags            -> [$pageTags] 
     
     PageSkeleton -> Section | Para | Image | ListItem | Infobox
     Section      -> $sectionHeading [PageSkeleton]
     Para         -> Paragraph
     Paragraph    -> $paragraphId, [ParaBody]
     ListItem     -> $nestingLevel, Paragraph
     Image        -> $imageURL [PageSkeleton]
     ParaBody     -> ParaText | ParaLink
     ParaText     -> $text
     ParaLink     -> $targetPage $targetPageId $linkSection $anchorText
     Infobox      -> $infoboxName [($key, [PageSkeleton])]

You can use any CBOR serialization library. Below a convenience library for reading the data into Python (3.5)

./read_data/trec_car_read_data.py Python 3.5 convenience library for reading the input data (in CBOR format). -- If you use anaconda, please install the cbor library with conda install -c auto cbor=1.0 -- Otherwise install it with pypi install cbor

Ranking Results

Given an outline, your task is to produce one ranking for each section $section (representing an information need in traditional IR evaluations).

Each ranked element is an (entity,passage) pair, meaning that this passage is relevant for the section, because it features a relevant entity. "Relevant" means that the entity or passage must/should/could be listed in this section.

The section is represented by the path of headings in the outline $pageTitle/$heading1/$heading1.1/.../$section in URL encoding.

The entity is represented by the DBpedia entity id (derived from the Wikipedia URL). Optionally, the entity can be omitted.

The passage is represented by the passage id given in the passage corpus (an MD5 hash of the content). Optionally, the passage can be omitted.

The results are provided in a format that is similar to the "trec_results file format" of trec_eval. More info on how to use trec_eval and source.

Example of ranking format

     Green\_sea\_turtle\Habitat  Pelagic\_zone  12345          0     27409 myTeam 
     $qid                        $entity        $passageId     rank  sim   run_id

Integration with other tools

It is recommended to use the format_runs package to write run files. Here an example:

    with open('runfile', mode='w', encoding='UTF-8') as f:
        writer = configure_csv_writer(f)
        for page in pages:
            for section_path in page.flat_headings_list():
                ranking = [RankingEntry(page.page_name, section_path, p.para_id, r, s, paragraph_content=p) for p,s,r in ranking]
                format_run(writer, ranking, exp_name='test')

        f.close()

This ensures that the output is correctly formatted to work with trec_eval and the provided qrels file.

Run trec_eval version 9.0.4 as usual:

  trec_eval -q release.qrel runfile > run.eval

The output is compatible with the eval plotting package minir-plots. For example run

  python column.py --out column-plot.pdf --metric map run.eval
  python column_difficulty.py --out column-difficulty-plot.pdf --metric map run.eval run2.eval

Moreover, you can compute success statistics such as hurts/helps or a paired-t-test as follows.

  python hurtshelps.py --metric map run.eval run2.eval
  python paired-ttest.py --metric map run.eval run2.eval

TREC-CAR Dataset by Laura Dietz, Ben Gamari is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.

Name		Name	Last commit message	Last commit date
Latest commit History 218 Commits
.travis		.travis
python3		python3
trec-car-tools-example		trec-car-tools-example
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.mkd		README.mkd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TREC Car Tools

How to use the Python bindings for trec-car-tools?

How to use the java 1.8 (or higher) bindings for trec-car-tools through maven?

Tool support

Reading Data

Ranking Results

Integration with other tools

About

Releases

Packages

Contributors 7

Languages

License

TREMA-UNH/trec-car-tools

Folders and files

Latest commit

History

Repository files navigation

TREC Car Tools

How to use the Python bindings for trec-car-tools?

How to use the java 1.8 (or higher) bindings for trec-car-tools through maven?

Tool support

Reading Data

Ranking Results

Integration with other tools

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages