Skip to content

StonyBrookNLP/PeKo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PeKo: A Large Scale Precondition Knowledge Dataset

Overview

PeKo (Precondition Knowledge) is a large scale crowdsourced event precondition knowledge dataset introduced in our paper "Modeling Preconditions in Text with a Crowd-sourced Dataset" at EMNLP Findings 2020

Preprint is available from here

Crowdsourcing Precondition Knowledge

Crowdsourcing Task

Data Preparation

We extract events and their temporal relations from news articles using CAEVO (Chambers et al., 2014), a temporal relation extraction system. We used CAEVO on a random sample of 6,837 articles inthe New York Times Annotated Corpus (Sandhaus, 2008). On average CAEVO extracted around 63 events per article, which yielded a total of 3,906 possible relation candidates per document. We filtered these to retain only pairs of events that have a BEFORE or AFTER temporal relation between them. We call the temporally preceding event the candidate precondition, and the temporally subsequent event in the pair the target event.

Crowdsourcing Task

The annotators were presented with a text snippet and two event mentions highlighted as shown below. To prune out event extraction errors from CAEVO, the annotators were first asked if the highlighted text denoted valid events. If both triggers were deemed valid, then the annotators evaluated whether or not the candidate precondition event was an actual precondition for the target event. Specifically they check if the candidate event is necessary for the target event to happen.

HIT example

As the result of crowdsouring, we have 10,806 preconditions out of 28,948 instances in total.

Tasks

We now propose two tasks that test for the ability to recognize and generate preconditions in textual contexts. Here we describe evaluations to benchmark the performance of current models on these tasks and to better understand the challenges involved.

PeKo Task 1: Precondition Identification

Given a text snippet with a target and candidate event pair, the task is to classify if the candidate event is a precondition for the target in the context described by the text snippet. This is a standard sentence-level classification task.

Result Table

PeKo Task 2: Precondition Generation Task

Here we introduce Precondition Generation as a more general challenge that a dataset like PeKo now enables. Given a target event t, generate an event p that is a precondition for t. We benchmark performance on evaluation instances drawn from both PeKo and an out-of-domain dataset ATOMIC.

Generation Result Table

Download

The dataset can be downloaded from here

Citation

Please use the following bibtex entry:

@article{kwon2020modeling,
title={Modeling Preconditions in Text with a Crowd-sourced Dataset},
author={Kwon, Heeyoung and Koupaee, Mahnaz and Singh, Pratyush and Sawhney, Gargi and Shukla, Anmol and Kallur, Keerthi Kumar and Chambers, Nathanael and Balasubramanian, Niranjan},
journal={arXiv preprint arXiv:2010.02429},
year={2020}
}

Dataset Information

data
 ├── peko_all.jsonl             # PeKo dataset
 ├── peko_gen_train.txt         # PeKo generation instances
 ├── peko_gen_dev.txt
 ├── peko_gen_test.txt
 ├── temp_gen_train.txt         # Generation instances for temporal model
 ├── temp_gen_dev.txt
 ├── LM_gen_train.txt           # Generation instances for plain language model
 ├── LM_gen_dev.txt
 └── atomic_samples.txt         # ATOMIC samples for generation task
  • peko_all.jsonl: PeKo dataset, each line contains a single json document.

    • sent_id: sentence ID
    • source: a candidate precondition event
    • target: a target event
    • label: 1 for precondition, 0 for non-precondition
    • n_yes: the number of votes for precondition
    • n_vote: the number of annotator
    • sent: sentence(s), tokens are separated by space
  • {peko/temp/LM}_gen_*.txt

    Tab separated text files. The first column contains full text, which is used for the generation target and the second column contains a precondition-masked-out instance.

  • atomic_samples.txt

    The file contains generation seeds from ATOMIC dataset

Contributors

  • Heeyoung Kwon (Stony Brook University)
  • Mahnaz Koupaee (Stony Brook University)
  • Pratyush Singh (Stony Brook University)
  • Gargi Sawhney (Stony Brook University)
  • Anmol Shukla (Stony Brook University)
  • Keerthi Kumar Kallur (Stony Brook University)
  • Nate Chambers (US Naval Academy)
  • Niranjan Balasubramanian (Stony Brook University)

About

A Large scale Precondition Knowledge dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published