Contents of this data release:
.
├── data/
├── ent2count-sorted.txt
├── etype2count-sorted.txt
├── fex-dev-fnames.txt
├── fex-test-fnames.txt
├── ner-dev-fnames.txt
├── ner-test-fnames.txt
├── ner-train-fnames.txt
├── README.md
├── sentcount2papercount-sorted.txt
├── sentlen2sentcount-sorted.txt
├── sfex-dev-fnames.txt
├── sfex-test-fnames.txt
├── sfex-train-fnames.txt
└── tok2count-sorted.txt
Description of contents:
Data Description:
data/
: Raw annotations created with the BRAT annotation tool. Every annotated
paper contains a .ann and a .txt file. The .ann file contains, entity and
relation annotations. The .txt file contains 3 lines, the doi of the paper, the
title of the paper and the synthesis-procedure text of the paper. This directory
contains annotations for 230 papers. The annotations are in the BRAT Standoff
format.
fex-{dev/test}-fnames.txt
: The files in the "data/" subdirectory corresponding
to the dev and test splits for an unsupervised frame extraction (fex) task. Since
the task is primarily trained on unlabelled data we only release test and development
data.
ner-{dev/test/train}-fnames.txt
: The files in the "data/" subdirectory
corresponding to the dev, test and train files for the named entity extraction
(ner) task.
sfex-{dev/test/train}-fnames.txt
: The files in the "data/" subdirectory
corresponding to the dev, test and train files for the supervised frame extraction
task.
Dataset statistics:
ent2count-sorted.txt
: The entities labelled in the data vs the number of
times these entities were labelled.
etype2count-sorted.txt
: The entity types labelled vs the number of
times these entity types were labelled.
sentcount2papercount-sorted.txt
: The number of sentences that in a given
synthesis procedure vs the number of papers with the said number of sentences.
Sentence segmentation was performed automatically with the ChemDataExtractor
python package.
sentlen2sentcount-sorted.txt
: The number of tokens in a sentence vs the number
of sentences with the said number of tokens. Sentence tokenization was performed
automatically with the ChemDataExtractor tool.
tok2count-sorted.txt
: The tokens vs the number of times the tokens occur in
the dataset. Sentence tokenization was performed automatically with the
ChemDataExtractor tool.
README.md
: This file.
License MIT Open Source License
Contact edwardk@mit.edu, smysore@cs.umass.edu
Citation
@inproceedings{mysore-etal-2019-materials,
title = "The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures",
author = "Mysore, Sheshera and
Jensen, Zachary and
Kim, Edward and
Huang, Kevin and
Chang, Haw-Shiuan and
Strubell, Emma and
Flanigan, Jeffrey and
McCallum, Andrew and
Olivetti, Elsa",
booktitle = "Proceedings of the 13th Linguistic Annotation Workshop",
month = aug,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W19-4007",
doi = "10.18653/v1/W19-4007",
pages = "56--64"
}