TSE-NER

_This is the theory and backbone of TSE-NER, for the code of the website, please visit https://github.com/mvallet91/SmartPub/ _

This work is part of the following research:

TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications (2018)
SmartPub: A Platform for Long-Tail Entity Extraction from Scientific Publications (2018)
Coner: A Collaborative Approach for Long-Tail Named Entity Recognition in Scientific Publications (2019)

The main goal of TSE-NER is to generate training data for long-tail entities and train a NER tagger, label such entities in text, and use them for document search and exploration.

Please refer to the paper TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications (2018) for more information.

This project can be approached in two main ways: as developer or user

For developers, we try to provide all the code required, however, we take advantage of some great resources such as Gensim, Stanford NLP, and GROBID, so it may require some effort.
For users, this first implementation will only be available as a basic search engine in our website, where arXiv articles can be searched, and their main Dataset and Method entities are displayed and can be used for exploration.

Following the main goal described before, SmartPub-TSENER is divided in 3 main modules:

The main goal of TSE-NER is to (1) Generate training data for long-tail entities and train a NER tagger, (2) label entities in text (documents), and (3) use named entities in documents for search and exploration.

Therefore in this repository we provide the code used for each one of the 3 main modules, as well as our approach for data collection and preparation:

Data Collection, Extraction and Preparation

Our corpus consists of scientific publications, mainly from Computer Science, but also from the Biomedical domain (PubMed Central) and master theses from TU Delft. The collection and extraction steps are source-dependent, for example:

Computer-Science related topics of arXiv: We have chosen arXiv because it's openly available content, by using a very friendly crawler (by Knoth, P. and Zdrahal, Z. (2012) CORE: Three Access Levels to Underpin Open Access https://github.com/ronentk/sci-paper-miner). We select over 40 CS-related topics, from Mathematical Software to Information Retrieval. In addition, the content from arXiv is readily available in XML format, so there is no need to use GROBID for text extraction.
PubMed Central (PMC): We take advantage of the Open Subset of publications, available using OAI-PMH and ftp. These publications have metadata and full-text in XML format, and we use the PubmMed Parser (by Titipat Achakulvisut and Daniel E. Acuna (2015) "Pubmed Parser" http://github.com/titipata/pubmed_parser.) to extract and store the information in MongoDB.
TU Delft Master Theses: The collection is similar to PMC, we use OAI-PMH to get the metadata and download links for the pdf of student's mather theses (with permission from the library, of course!), however, the actual content has to be extracted using GROBID (Grobid (2008-2017) https://github.com/kermitt2/grobid), which not always guarantees the best performance since theses are from different faculties and follow a wide variety of formats.

Review the notebook Pipeline Preparation for more information, and a step-by-step example of our workflow.

The important part is that we need the full text of each article in a database (we use MongoDB), so we can index all the content in Elasticsearch (for easy queries). This allows for the quick communication required for the processing in the modules.

In addition, we need to prepare data and train word2vec and doc2vec models used in the expansion and filtering steps.

Module 1: NER Training

This first module provides with the environment for anyone interested to train a NER model (Stanford NER) for the labelling of long-tail entities.

Module 2: NER Labelling

Once a model is trained, it can be used to label certain types of long-tail entities in text. By selecting a model, and introducing a piece of text, the system will return a list of entities found.

For Modules 1 and 2, review the notebook Pipeline TSENER for more information, and a step-by-step example of our workflow.

Module 3: NER Search and Navigation System

This is a basic approach at an interface for a collection of documents, it can be simply a metadata repository with links to the actual content, allowing for a richer navigation than current systems.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
collection_extraction		collection_extraction
data		data
data_preparation		data_preparation
embedding_models		embedding_models
m1_postprocessing		m1_postprocessing
m1_preprocessing		m1_preprocessing
m2_labelling		m2_labelling
processing_files		processing_files
prop_files		prop_files
pubmed_parser		pubmed_parser
pyhelpers		pyhelpers
sci_paper_miner		sci_paper_miner
stanford_files		stanford_files
.gitignore		.gitignore
CORE.ipynb		CORE.ipynb
LICENSE		LICENSE
PMC_Crawl_Scripts.ipynb		PMC_Crawl_Scripts.ipynb
Pipeline_Preparation.ipynb		Pipeline_Preparation.ipynb
Pipeline_TSENER.ipynb		Pipeline_TSENER.ipynb
README.md		README.md
SmartPub-TSENER.sublime-project		SmartPub-TSENER.sublime-project
TUD_Library_Scripting_Testing.ipynb		TUD_Library_Scripting_Testing.ipynb
Topic_Modelling.ipynb		Topic_Modelling.ipynb
WWW2018.ipynb		WWW2018.ipynb
__init__.py		__init__.py
bio_index.txt		bio_index.txt
coner_evaluation.py		coner_evaluation.py
config.py		config.py
dblp_xml.py		dblp_xml.py
dblp_xml_from_pdf		dblp_xml_from_pdf
dblp_xml_processing.py		dblp_xml_processing.py
m1_index.yml		m1_index.yml
m2_index.yml		m2_index.yml
module_1.py		module_1.py
module_2.py		module_2.py
nohup.out		nohup.out
nohup5.out		nohup5.out
nohup_ds_50.out		nohup_ds_50.out
paper_miner.py		paper_miner.py
requirements.txt		requirements.txt
tsener_api.py		tsener_api.py
tud_large_deleted.pickle		tud_large_deleted.pickle
tud_metadata.pickle		tud_metadata.pickle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TSE-NER

Data Collection, Extraction and Preparation

Module 1: NER Training

Module 2: NER Labelling

Module 3: NER Search and Navigation System

About

Releases

Packages

Contributors 2

Languages

License

mvallet91/SmartPub-TSENER

Folders and files

Latest commit

History

Repository files navigation

TSE-NER

Data Collection, Extraction and Preparation

Module 1: NER Training

Module 2: NER Labelling

Module 3: NER Search and Navigation System

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages