GitHub - olivettigroup/annotated-materials-syntheses

Contents of this data release:

    .
    ├── data/
    ├── ent2count-sorted.txt
    ├── etype2count-sorted.txt
    ├── fex-dev-fnames.txt
    ├── fex-test-fnames.txt
    ├── ner-dev-fnames.txt
    ├── ner-test-fnames.txt
    ├── ner-train-fnames.txt
    ├── README.md
    ├── sentcount2papercount-sorted.txt
    ├── sentlen2sentcount-sorted.txt
    ├── sfex-dev-fnames.txt
    ├── sfex-test-fnames.txt
    ├── sfex-train-fnames.txt
    └── tok2count-sorted.txt

Description of contents:

Data Description:

data/: Raw annotations created with the BRAT annotation tool. Every annotated paper contains a .ann and a .txt file. The .ann file contains, entity and relation annotations. The .txt file contains 3 lines, the doi of the paper, the title of the paper and the synthesis-procedure text of the paper. This directory contains annotations for 230 papers. The annotations are in the BRAT Standoff format.

fex-{dev/test}-fnames.txt: The files in the "data/" subdirectory corresponding to the dev and test splits for an unsupervised frame extraction (fex) task. Since the task is primarily trained on unlabelled data we only release test and development data.

ner-{dev/test/train}-fnames.txt: The files in the "data/" subdirectory corresponding to the dev, test and train files for the named entity extraction (ner) task.

sfex-{dev/test/train}-fnames.txt: The files in the "data/" subdirectory corresponding to the dev, test and train files for the supervised frame extraction task.

Dataset statistics:

ent2count-sorted.txt: The entities labelled in the data vs the number of times these entities were labelled.

etype2count-sorted.txt: The entity types labelled vs the number of times these entity types were labelled.

sentcount2papercount-sorted.txt: The number of sentences that in a given synthesis procedure vs the number of papers with the said number of sentences. Sentence segmentation was performed automatically with the ChemDataExtractor python package.

sentlen2sentcount-sorted.txt: The number of tokens in a sentence vs the number of sentences with the said number of tokens. Sentence tokenization was performed automatically with the ChemDataExtractor tool.

tok2count-sorted.txt: The tokens vs the number of times the tokens occur in the dataset. Sentence tokenization was performed automatically with the ChemDataExtractor tool.

README.md: This file.

License MIT Open Source License

Contact edwardk@mit.edu, smysore@cs.umass.edu

Citation

@inproceedings{mysore-etal-2019-materials,
    title = "The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures",
    author = "Mysore, Sheshera  and
      Jensen, Zachary  and
      Kim, Edward  and
      Huang, Kevin  and
      Chang, Haw-Shiuan  and
      Strubell, Emma  and
      Flanigan, Jeffrey  and
      McCallum, Andrew  and
      Olivetti, Elsa",
    booktitle = "Proceedings of the 13th Linguistic Annotation Workshop",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-4007",
    doi = "10.18653/v1/W19-4007",
    pages = "56--64"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
LICENSE		LICENSE
README.md		README.md
ent2count-sorted.txt		ent2count-sorted.txt
etype2count-sorted.txt		etype2count-sorted.txt
fex-dev-fnames.txt		fex-dev-fnames.txt
fex-test-fnames.txt		fex-test-fnames.txt
ner-dev-fnames.txt		ner-dev-fnames.txt
ner-test-fnames.txt		ner-test-fnames.txt
ner-train-fnames.txt		ner-train-fnames.txt
sentcount2papercount-sorted.txt		sentcount2papercount-sorted.txt
sentlen2sentcount-sorted.txt		sentlen2sentcount-sorted.txt
sfex-dev-fnames.txt		sfex-dev-fnames.txt
sfex-test-fnames.txt		sfex-test-fnames.txt
sfex-train-fnames.txt		sfex-train-fnames.txt
tok2count-sorted.txt		tok2count-sorted.txt

License

olivettigroup/annotated-materials-syntheses

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages