GitHub - StonyBrookNLP/SuMe: [LREC2022] SuMe: A Dataset towards Summarizing biomedical Mechanisms

What is SuMe?

SuMe is a dataset towards Summarizing Biomedical Mechanisms.

This is the official page for the paper SuMe: A Dataset Towards Summarizing Biomedical Mechanisms , accepted at LREC2022.

SuMe is the first dataset towards summarizing biomedical mechanisms and the underlying relations between entities. The dataset contains 22K mechanism summarization instances collected semi-automatically and an evaluation partition of 125 instances that were corrected by domain experts. In addition it contains larger set of 611K abstracts for conclusion generation which we use as a pretraining task for mechanism generation models.

Example

In the following example we see an example of an entry in the SuMe dataset. Some supporting text was removed to save space. The input is the supporting sentences with the main two entities. The output is the relation type and a sentence concluding the mechanism underlying the relationship.

Dataset Statistics

We construct SuMe using biomedical abstracts from the PubMed open access subset. Starting from 1.1M scientific papers, we followed the following sequence of bootstrapping steps to prepare the SuMe dataset.

Finding Conclusion Sentences
Extracting Main Entities & Relation. We run biomedical relation extractor, REACH which can identify entities and the relations between entities.
Filtering for Mechanism Sentences We separate out the abstracts for which the conclusion sentences are predicted to have non-mechanism related conclusions as additional related data that can be use for pretraining the generation models we eventually train for the mechanism summarization task. Dataset Statistics: Each dataset contains a number of unique abstracts, a supporting set, a mechanism sentence a pair of entities. The first entity is called the regulator entity (regulator) and the second one is called the regulated entity (regulated)

Download the data

The dataset contains four different subsets.

The training set with about 21k abstracts. You can download training set from here.

The validation set with about 1k abstract which the hyperparameters are tuned with can be found here.

The test sets is accessible via this link here

The best model, which is pretrained with pretraining data and then fine tuned on training set is accessible here.

License

The dataset is collected using open source NIH active directory for PMC papers. We generally follow their license as mentioned here

Liked us? Cite us!

Please use the following bibtex entry:

@inproceedings{bastan-etal-2022-sume,
    title = "{S}u{M}e: A Dataset Towards Summarizing Biomedical Mechanisms",
    author = "Bastan, Mohaddeseh  and
      Shankar, Nishant  and
      Surdeanu, Mihai  and
      Balasubramanian, Niranjan",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.748",
    pages = "6922--6931",
    
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
assets		assets
src		src
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is SuMe?

SuMe is a dataset towards Summarizing Biomedical Mechanisms.

Example

Dataset Statistics

Download the data

License

Liked us? Cite us!

About

Releases

Packages

Languages

StonyBrookNLP/SuMe

Folders and files

Latest commit

History

Repository files navigation

What is SuMe?

SuMe is a dataset towards Summarizing Biomedical Mechanisms.

Example

Dataset Statistics

Download the data

License

Liked us? Cite us!

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages