Skip to content
Anna Price edited this page Oct 17, 2019 · 8 revisions

NLP-Bio-Tools is a collection of applications for the text mining, Natural Language Processing (NLP) and Machine Learning (ML) classification of biomedical pdf documents.

The toolkit consists of three applications (pdf2nlp, mlpipe, loadmodel) which are designed to be run with Docker. The pre-built Docker images can be downloaded from Docker Hub. Docker can be run on Windows, Mac and Ubuntu. To run the Docker images you will need to use the directory structure used in this repository.

Workflow

The workflow for NLP-Bio-Tools is shown above. The first stage of the workflow is to extract and tokenize the text from the input pdfs (pdf2nlp). The workflow then diverges depending on whether you want to train a new ML model (mlpipe) or use a previously saved model to make predictions on new data (loadmodel).

I.e. When building a new ML model:

pdf2nlp -> mlpipe

When evaluating a saved ML model on new data:

pdf2nlp -> loadmodel

A further breakdown of the NLP pipeline can be seen below: