Development tools for participants of the TREC Complex Answer Retrieval track.
Data release support for v1.5 and v2.0. and v2.6
Note that in order to allow to compile your project for two trec-car format versions, the maven artifact Id was changed to treccar-tools-v2
with version 2.0, and the package path changed to treccar_v2
Current support for
- Python 3.6
- Java 1.8
If you are using Anaconda, install the cbor
library for Python 3.6:
conda install -c laura-dietz cbor=1.0.0
- Get the data from http://trec-car.cs.unh.edu
- Clone this repository
python setup.py install
Look out for test.py for an example on how to access the data.
add to your project's pom.xml file (or similarly gradel or sbt):
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
add the trec-car-tools dependency:
<dependency>
<groupId>com.github.TREMA-UNH</groupId>
<artifactId>trec-car-tools-java</artifactId>
<version>20</version>
</dependency>
compile your project with mvn compile
This package provides support for the following activities.
read_data
: Reading the provided paragraph collection, outline collections, and training articlesformat_runs
: writing submission files
If you use python or java, please use trec-car-tools
, no need to understand the following. We provide bindings for haskell upon request. If you are programming under a different language, you can use any CBOR library and decode the grammar below.
CBOR is similar to JSON, but it is a binary format that compresses better and avoids text file encoding issues.
Articles, outlines, paragraphs are all described with CBOR following this grammar. Wikipedia-internal hyperlinks are preserved through ParaLink
s.
Page -> $pageName $pageId [PageSkeleton] PageType PageMetadata
PageType -> ArticlePage | CategoryPage | RedirectPage ParaLink | DisambiguationPage
PageMetadata -> RedirectNames DisambiguationNames DisambiguationIds CategoryNames CategoryIds InlinkIds InlinkAnchors WikiDataQid SiteId PageTags
RedirectNames -> [$pageName]
DisambiguationNames -> [$pageName]
DisambiguationIds -> [$pageId]
CategoryNames -> [$pageName]
CategoryIds -> [$pageId]
InlinkIds -> [$pageId]
InlinkAnchors -> [$anchorText]
WikiDataQid -> [$qid]
SiteId -> [$siteId]
PageTags -> [$pageTags]
PageSkeleton -> Section | Para | Image | ListItem | Infobox
Section -> $sectionHeading [PageSkeleton]
Para -> Paragraph
Paragraph -> $paragraphId, [ParaBody]
ListItem -> $nestingLevel, Paragraph
Image -> $imageURL [PageSkeleton]
ParaBody -> ParaText | ParaLink
ParaText -> $text
ParaLink -> $targetPage $targetPageId $linkSection $anchorText
Infobox -> $infoboxName [($key, [PageSkeleton])]
You can use any CBOR serialization library. Below a convenience library for reading the data into Python (3.5)
./read_data/trec_car_read_data.py
Python 3.5 convenience library for reading the input data (in CBOR format). -- If you use anaconda, please install the cbor library withconda install -c auto cbor=1.0
-- Otherwise install it withpypi install cbor
Given an outline, your task is to produce one ranking for each section $section (representing an information need in traditional IR evaluations).
Each ranked element is an (entity,passage) pair, meaning that this passage is relevant for the section, because it features a relevant entity. "Relevant" means that the entity or passage must/should/could be listed in this section.
The section is represented by the path of headings in the outline $pageTitle/$heading1/$heading1.1/.../$section
in URL encoding.
The entity is represented by the DBpedia entity id (derived from the Wikipedia URL). Optionally, the entity can be omitted.
The passage is represented by the passage id given in the passage corpus (an MD5 hash of the content). Optionally, the passage can be omitted.
The results are provided in a format that is similar to the "trec_results file format" of trec_eval. More info on how to use trec_eval and source.
Example of ranking format
Green\_sea\_turtle\Habitat Pelagic\_zone 12345 0 27409 myTeam
$qid $entity $passageId rank sim run_id
It is recommended to use the format_runs
package to write run files. Here an example:
with open('runfile', mode='w', encoding='UTF-8') as f:
writer = configure_csv_writer(f)
for page in pages:
for section_path in page.flat_headings_list():
ranking = [RankingEntry(page.page_name, section_path, p.para_id, r, s, paragraph_content=p) for p,s,r in ranking]
format_run(writer, ranking, exp_name='test')
f.close()
This ensures that the output is correctly formatted to work with trec_eval
and the provided qrels file.
Run trec_eval version 9.0.4 as usual:
trec_eval -q release.qrel runfile > run.eval
The output is compatible with the eval plotting package minir-plots. For example run
python column.py --out column-plot.pdf --metric map run.eval
python column_difficulty.py --out column-difficulty-plot.pdf --metric map run.eval run2.eval
Moreover, you can compute success statistics such as hurts/helps or a paired-t-test as follows.
python hurtshelps.py --metric map run.eval run2.eval
python paired-ttest.py --metric map run.eval run2.eval
TREC-CAR Dataset by Laura Dietz, Ben Gamari is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.wikipedia.org.