Scalable PDF-to-text extraction workflow. This workflow is designed to run on
HPC systems and uses Parsl to submit jobs to the
scheduler. The workflow supports multiple PDF parsers including
Marker, Nougat,
and Oreo
(implemented here). Each parser converts PDFs to JSON lines files containing
the extracted text and optional metadata. The instructions and examples below
are for running the workflow on the Polaris
supercomputer at Argonne National Laboratory, but the workflow can be adapted
to run on other HPC systems by adding an appropriate
Parsl configuration.
pdfwf
supports several PDF parsers which have separate installation instructions below:
Note: You may need different virtual environments for each parser.
Once you've installed your desired PDF parser, you can install pdfwf
into your virtual environment by running:
git clone git@github.com:ramanathanlab/pdfwf.git
cd pdfwf
pip install --upgrade pip setuptools wheel
pip install -e .
Running the workflow requires having at least one of these tools installed (e.g., marker
, nougat
, oreo
, etc.). See Installation for details.
The pdfwf
workflow can be run at scale using Parsl via the CLI as follows:
> python -m pdfwf.convert --help
usage: convert.py [-h] --config CONFIG
PDF conversion workflow
optional arguments:
-h, --help show this help message and exit
--config CONFIG Path to workflow configuration file
Nougat
, Marker
, and Oreo
are designed and trained to parse academic papers of any scientific domain. Since there are no canonical metrics on how to evaluate parser output quality, it is non-trivial to compare accuracy in a meaningful way. Regardless, we conducted a small experiment on
Nougat appears to be more accurate as evidenced by slightly higher transcription quality overall (as measured by BLEU score) and the ability to detect rare domain-specific terms, in particular. Oreo, on the other hand, is faster by a factor of
While PDFs of any kind can be parsed (e.g., blog articles, corporate documents) with any of these frameworks, it is unclear how accurate each of them is. Since scientific PDFs tend to have a complex layout, parsing PDFs should provide reasonable output as long as the layout is somewhat comparable to that of a scientific paper.
The computing platform, virtual environment, parser settings, and other settings are specified via a YAML configuration file.
An example YAML file for the pdfwf
workflow is provided below. This file can be used to run the workflow using the CLI after updating the values to be appropriate for your system.
# The directory containing the pdfs to convert
pdf_dir: /lus/eagle/projects/argonne_tpc/hippekp/small-pdf-set
# The directory to place the converted pdfs in
out_dir: runs/output-text
# The number of PDFs to parse in each parsl task
chunk_size: 5
# The settings for the pdf parser
parser_settings:
# The name of the parser to use
name: marker
# The compute settings for the workflow
compute_settings:
# The name of the compute platform to use
name: polaris
# The number of compute nodes to use
num_nodes: 1
# Make sure to update the path to your conda environment and HF cache
worker_init: "module use /soft/modulefiles; module load conda/2024-04-29; conda activate marker-wf; export HF_HOME=<path-to-your-HF-cache-dir>"
# The scheduler options to use when submitting jobs
scheduler_options: "#PBS -l filesystems=home:eagle:grand"
# Make sure to change the account to the account you want to charge
account: <your-account-name-to-charge>
# The HPC queue to submit to
queue: debug
# The amount of time to request for your job
walltime: 01:00:00
We provide example configurations for each parser in these files:
- marker: examples/marker/marker_test.yaml
- nougat: examples/nougat/nougat_test.yaml
- oreo: examples/oreo/oreo_test.yaml
- pymupdf: examples/pymupdf/pymupdf_test.yaml
Note: Please see the comments in the example YAML files for documentation on the settings.
Once you've updated the YAML file with your environment, project information, and file paths the workflow can be run directly from a login node using the CLI as follows:
nohup python -m pdfwf.convert --config <your-config.yaml> &> nohup.out &
The output of the workflow will be written to the out_dir
specified in the
configuration file. The <out_dir>/parsed_pdfs
directory will contain the
parsed PDFs in JSON lines format. Each line of the JSONL file will contain
(at least) a "text" field containing the parsed text for a given PDF and a
"path" field containing the path to the PDF. See the example below:
{"path": "/path/to/1.pdf", "text": "This is the text of the first PDF."}
{"path": "/path/to/2.pdf", "text": "This is the text of the second PDF."}
Note: If the parser fails to parse a PDF, the JSONL file will not contain an entry for that PDF.
See the Monitoring the Workflow section for description of the other log files that are generated during the workflow.
Once you've started the workflow, you can monitor the outputs by watching the
output files in the out_dir
specified in the configuration file. Below are
some useful commands to monitor the workflow. First, cd
to the out_dir
.
To see the number of PDFs that have been parsed:
cat parsed_pdfs/* | grep '{"path":' | wc -l
To watch the stdout and stderr of the tasks:
tail -f parsl/000/submit_scripts/*
To check the Parsl workflow log:
tail -f parsl/000/parsl.log
To see the basic workflow log:
cat pdfwf.log
If you'd like to stop the workflow while it's running, you need to
stop the Python process, the Parsl high-throughput executor process, and then qdel
the job ID.
The process IDs can be determined using the ps
command. The job ID can be found using qstat
.
This setup will install the Marker
tool in a new environment.
On a compute node of Polaris, follow the following instructions:
Note setting up the local.env assumes conda, see Non-Conda Env
mkdir wf-validation
cd wf-validation/
# Create a base conda environment
module use /soft/modulefiles; module load conda/2024-04-29
conda create -n marker-wf python=3.10
conda activate marker-wf
# install torch(vision)
pip3 install torch torchvision
# Install Marker
pip install marker-pdf
pip install PyMuPDF
pip install pypdf
On a compute node, run:
# Create a conda environment with python3.10
module use /soft/modulefiles; module load conda/2024-04-29
conda create -n nougat-wf python=3.10
conda activate nougat-wf
# Install Nougat
python3 -m pip install --upgrade pip setuptools wheel chardet
pip install -r requirements/nougat_requirements.txt
# Note: If your system has CUDA 12.1, Nougat environment installation should now be complete. At the time of writing, Polaris uses CUDA 11.8. So, install the right torch binary using the command below.
conda install pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia
This should conclude the installation of Nougat. Now, get a Polaris compute node to confirm it runs successfully in your environment.
# Copy a sample pdf file to an arbitrary location.
# Run Nougat inference on the pdfs with the command below.
nougat /path/to/your_file.pdf -o /path/to/output_dir -m 0.1.0-base -b 10
# This command should write a your_file.mmd file into your output directory which will be automatically created if it doesn't exist.
If Nougat
inference ran successfully, proceed to install the pdfwf
workflow using the instructions in Installation.
On a compute node, run:
module use /soft/modulefiles; module load conda/2024-04-29
conda create -n pdfwf python=3.10 -y
conda activate pdfwf
mamba install pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia -y
pip install -r requirements/oreo_requirements.txt
git clone git@github.com:ultralytics/yolov5.git
Then set the yolov5_path
option to the path of the cloned yolov5
repository.
On any node, run:
module use /soft/modulefiles; module load conda/2024-04-29
conda create -n pymupdf-wf python=3.10 -y
conda activate pymupdf-wf
pip install -r requirements/pymupdf_requirements.txt
to create a conda environment that serves the PDF extraction tools PyMuPDF
and pypdf
.
Both tools are lightweight and operational from the same conda environment.
Both PDF extraction tools PyMuPDF
and pypdf
are fairly lightweight and operate in the conda environment pymupdf-wf
.
For running smaller jobs without using the Parsl workflow, the CLI can be used.
The CLI provides different commands for running the various parsers. The CLI
can run the marker
, nougat
, and oreo
parsers. The CLI does not submit jobs
to the scheduler and is intended for use on small datasets that can be
processed on a single interactive node or workstation.
Usage:
$ [OPTIONS] COMMAND [ARGS]...
Options:
--install-completion
: Install completion for the current shell.--show-completion
: Show completion for the current shell, to copy it or customize the installation.--help
: Show this message and exit.
Commands:
marker
: Parse PDFs using the marker parser.nougat
: Parse PDFs using the nougat parser.oreo
: Parse PDFs using the oreo parser.
Parse PDFs using the marker parser.
Usage:
$ pdfwf marker [OPTIONS]
Options:
-p, --pdf_path PATH
: The directory containing the PDF files to convert (recursive glob). [required]-o, --output_dir PATH
: The directory to write the output JSON lines file to. [required]-nc, --num_conversions INTEGER
: Number of pdfs to convert (useful for debugging, by default convert every document) [default: 0].
--help
: Show this message and exit.
Parse PDFs using the nougat parser.
Usage:
$ pdfwf nougat [OPTIONS]
Options:
-p, --pdf_path PATH
: The directory containing the PDF files to convert (recursive glob). [required]-o, --output_dir PATH
: The directory to write the output JSON lines file to. [required]-bs, --batchsize INTEGER
: Number of pages per patch. Maximum 10 for A100 40GB. [default: 10]-c, --checkpoint PATH
: Path to existing or new Nougat model checkpoint (to be downloaded) [default: nougat_ckpts/base]-m, --mmd_out PATH
: The directory to write optional mmd outputs along with jsonls.-r, --recompute
: Override pre-existing parsed outputs.-f, --full_precision
: Use float32 instead of bfloat32.-md, --markdown
: Output pdf content in markdown compatible format. [default: True]-s, --skipping
: Skip if the model falls in repetition. [default: True]-n, --nougat_logs_path PATH
: The path to the Nougat-specific logs. [default: pdfwf_nougat_logs]-nc, --num_conversions INTEGER
: Number of pdfs to convert (useful for debugging, by default convert every document) [default: 0].--help
: Show this message and exit.
Parse PDFs using the oreo parser.
Usage:
$ pdfwf oreo [OPTIONS]
Options:
-p, --pdf_path PATH
: The directory containing the PDF files to convert (recursive glob). [required]-o, --output_dir PATH
: The directory to write the output JSON lines file to. [required]-d, --detection_weights_path PATH
: Weights to layout detection model.-t, --text_cls_weights_path PATH
: Model weights for (meta) text classifier.-s, --spv05_category_file_path PATH
: Path to the SPV05 category file.-d, --detect_only
: File type to be parsed (ignores other files in the input_dir)-m, --meta_only
: Only parse PDFs for meta data-e, --equation
: Include equations into the text categories-t, --table
: Include table visualizations (will be stored)-f, --figure
: Include figure (will be stored)-s, --secondary_meta
: Include secondary meta data (footnote, headers)-a, --accelerate
: If true, accelerate inference by packing non-meta text patches-b, --batch_yolo INTEGER
: Main batch size for detection/# of images loaded per batch [default: 128]-v, --batch_vit INTEGER
: Batch size of pre-processed patches for ViT pseudo-OCR inference [default: 512]-c, --batch_cls INTEGER
: Batch size K for subsequent text processing [default: 512]-x, --bbox_offset INTEGER
: Number of pixels along which [default: 2]-nc, --num_conversions INTEGER
: Number of pdfs to convert (useful for debugging, by default convert every document) [default: 0].--help
: Show this message and exit.
For development, it is recommended to use a virtual environment. The following commands will create a virtual environment, install the package in editable mode, and install the pre-commit hooks.
python3.10 -m venv venv
source venv/bin/activate
pip install -U pip setuptools wheel
pip install -e '.[dev,docs]'
pre-commit install
To test the code, run the following command:
pre-commit run --all-files
tox -e py310
To generate the CLI documentation, run:
typer pdfwf.cli utils docs --output CLI.md
module load python/3.11.6--gcc--8.5.0
python -m venv pdfwf-venv
source pdfwf-venv/bin/activate
pip install -U pip setuptools wheel
pip install torch
pip install numpy
pip install -r requirements/nougat_requirements.txt
pip install -r requirements/oreo_requirements.txt
pip install -e .
python -m nltk.downloader words
python -m nltk.downloader punkt