TNE is an NLU task, which focus on relations between noun phrases (NPs) that can be mediated via prepositions. The dataset contains 5,497 documents, annotated exhaustively with all possible links between the NPs in each document.
For more details check out our paper "Text-based NP Enrichment", and website.
- Key Links
- TNE Dataset: Download
- Paper: "Text-based NP Enrichment"
- Models Code: https://github.com/yanaiela/TNE/tree/main/tne/modeling
- Leaderboard:
- TNE: Leaderboard
- TNE-OOD: Leaderboard
- Evaluator Code: https://github.com/allenai/tne-evaluator
- Website: https://yanaiela.github.io/TNE
from datasets import load_dataset
dataset = load_dataset("tne")
The dataset is spread across four files, for the four different splits: train, dev, test and ood. Each file is in a jsonl format, containing a dictionary of a single document. A document consists of:
id
: a unique identifier of a document, beginning withr
and followed by a numbertext
: the text of the document. The title and subtitles (if exists) are separated with two new lines. The paragraphs are separated by a single new line.tokens
: a list of string, containing the tokenized tokensnps
: a list of dictionaries, where each key is an np identifier, which contains another dictionary (note that this field was changed in v1.1 from a dictionary where each key is the np id to the dictionary, to a list of dictionaries, to match the huggingface datasets library):text
: the text of the npstart_index
: an integer indicating the starting index in the textend_index
: an integer indicating the ending index in the textstart_token
: an integer indicating the first token of the np out of the tokenized tokensend_token
: an integer indicating the last token of the np out of the tokenized tokensid
: the same as the key
np_relations
: these are the relation labels of the document. It is a list of dictionaries, where each dictionary contains:anchor
: the id of the anchor npcomplement
: the id of the complement nppreposition
: the preposition that links between the anchor and the complementcomplement_coref_cluster_id
: the coreference id, which the complement is part of.
coref
: the coreference labels. It contains a list of dictionaries, where each dictionary contains:id
: the id of the coreference clustermembers
: the ids of the nps members of such clusternp_type
: the type of cluster. It can be eitherstandard
: regular coreference clustertime/date/measurement
: a time / date / measurement np. These will be singletons.idiomatic
: an idiomatic expression
metadata
: metadata of the document. It contains the following:annotators
: a dictionary with anonymized annotators idcoref_worker
: the coreference worker idconsolidator_worker
: the consolidator worker idnp-relations_worker
: the np relations worker id
url
: the url where the document was taken from (not always existing)source
: the original file name where the document was taken from
Install dependencies
conda create -n tne python==3.7 anaconda
conda activate tne
pip install -r requirements.txt
We train the models using allennlp
To run the coupled-large model, run:
allennlp train tne/modeling/configs/coupled_large.jsonnet \
--include-package tne \
-s models/coupled_spanbert_large
After training a model (or using the trained one), you can get the predictions file using:
allennlp predict models/coupled_spanbert_large/model.tar.gz data/test.jsonl --output-file coupled_large_predictions.jsonl --include-package tne --use-dataset-reader --predictor tne_predictor
We release the best model we achieved: coupled-large
and it can be downloaded here.
If there's interest in other models from the paper, please let me know via email or open an issue,
and I will upload them as well.
@article{tne,
author = {Elazar, Yanai and Basmov, Victoria and Goldberg, Yoav and Tsarfaty, Reut},
title = "{Text-based NP Enrichment}",
journal = {Transactions of the Association for Computational Linguistics},
volume = {10},
pages = {764-784},
year = {2022},
month = {07},
issn = {2307-387X},
doi = {10.1162/tacl_a_00488},
url = {https://doi.org/10.1162/tacl\_a\_00488},
eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00488/2037151/tacl\_a\_00488.pdf},
}
To submit your model's prediction to the leaderboard, you need to create an answer file. You can find details on the submission pocess here, and follow the evaluation code and tests here
04/08/2022
TNE is officially published at TACL12/07/2022
We presented the paper at NAACL2203/05/2022
The TNE dataset is on huggingface's datasets library12/04/2022
Released v1.1 of the dataset. Changed thenps
field from a dictionary of diciontaries, to a list of dictionaries, to match with huggingface'sdatasets
library.27/09/2021
TNE was released: paper + dataset + exploration + demo
I found it easier to use the allennlp framework, but I might consider using hf infrastructure as well in the future. Feel free to upload the dataset there, or suggest an implementation using hf codebase.
It happens! Please open an issue and I'll do my best to address it.
I uploaded the best model we trained from the paper. If there's interest, I can upload the others as well. Open an issue or email me.
We decided to keep the labels hidden, to avoid overfitting on this dataset. However, once you have a good model, you can upload your predictions to the leaderboard (and the ood leaderboard), and find out your score!