marker: Convert PDF to markdown quickly with high accuracy #134
Labels
Automation
Automate the things
man-pages
linux man pages
markdown
Helpful markdown examples, tips and tools
Marker
Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
settings.py
for a language list.More Details
How it works
Marker is a pipeline of deep learning models:
Relying on autoregressive forward passes to generate text is slow and prone to hallucination/repetition. From the nougat paper:
We observed [repetition] in 1.5% of pages in the test set, but the frequency increases for out-of-domain documents.
In my anecdotal testing, repetitions happen on 5%+ of out-of-domain (non-arXiv) pages.Nougat is an amazing model, but I wanted a faster and more general purpose solution. Marker is 10x faster and has low hallucination risk because it only passes equation blocks through an LLM forward pass.
Examples
Performance
The above results are with marker and nougat setup so they each take ~3GB of VRAM on an A6000.
See below for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.
Limitations
PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:
Installation
This has been tested on Mac and Linux (Ubuntu and Debian). You'll need python 3.9+ and poetry.
First, clone the repo:
git clone https://github.com/VikParuchuri/marker.git
cd marker
Linux
scripts/install/tesseract_5_install.sh
.scripts/install/ghostscript_install.sh
.cat scripts/install/apt-requirements.txt | xargs sudo apt-get install -y
tessdata
withfind / -name tessdata
. Make sure to use the one corresponding to the latest tesseract version if you have multiple.local.env
file in the rootmarker
folder withTESSDATA_PREFIX=/path/to/tessdata
inside itpoetry install
poetry shell
to activate your poetry venvpip install torch
to install other torch dependencies.Mac
scripts/install/brew-requirements.txt
tessdata
withbrew list tesseract
local.env
file in the rootmarker
folder withTESSDATA_PREFIX=/path/to/tessdata
inside itpoetry install
poetry shell
to activate your poetry venvUsage
First, some configuration:
local.env
file. For example,TORCH_DEVICE=cuda
orTORCH_DEVICE=mps
.cpu
is the default.INFERENCE_RAM
to your GPU VRAM (per GPU). For example, if you have 16 GB of VRAM, setINFERENCE_RAM=16
.VRAM_PER_TASK
to adjust this if you notice tasks failing with GPU out of memory errors.marker/settings.py
. You can override any settings in thelocal.env
file, or by setting environment variables.ENABLE_EDITOR_MODEL
.OCR_ENGINE
setting.Convert a single file
Run
convert_single.py
, like this:--parallel_factor
is how much to increase batch size and parallel OCR workers by. Higher numbers will take more VRAM and CPU, but process faster. Set to 1 by default.--max_pages
is the maximum number of pages to process. Omit this to convert the entire document.Make sure the
DEFAULT_LANG
setting is set appropriately for your document.Convert multiple files
Run
convert.py
, like this:--workers
is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Parallelism will not increase beyondINFERENCE_RAM / VRAM_PER_TASK
if you're using GPU.--max
is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.--metadata_file
is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. If not,DEFAULT_LANG
will be used. The format is:--min_length
is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)Convert multiple files on multiple GPUs
Run
chunk_convert.sh
, like this:METADATA_FILE
is an optional path to a json file with metadata about the pdfs. See above for the format.NUM_DEVICES
is the number of GPUs to use. Should be2
or greater.NUM_WORKERS
is the number of parallel processes to run on each GPU. Per-GPU parallelism will not increase beyondINFERENCE_RAM / VRAM_PER_TASK
.MIN_LENGTH
is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)Benchmarks
Benchmarking PDF extraction quality is hard. I've created a test set by finding books and scientific papers that have a pdf version and a latex source. I convert the latex to text, and compare the reference to the output of text extraction methods.
Benchmarks show that marker is 10x faster than nougat, and more accurate outside arXiv (nougat was trained on arXiv data). We show naive text extraction (pulling text out of the pdf with no processing) for comparison.
Speed
Accuracy
First 3 are non-arXiv books, last 3 are arXiv papers.
Peak GPU memory usage during the benchmark is
3.3GB
for nougat, and3.1GB
for marker. Benchmarks were run on an A6000.Throughput
Marker takes about 2GB of VRAM on average per task, so you can convert 24 documents in parallel on an A6000.
Running your own benchmarks
You can benchmark the performance of marker on your machine. First, download the benchmark data here and unzip.
Then run
benchmark.py
like this:This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.
Omit
--nougat
to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.Commercial usage
Due to the licensing of the underlying models like layoutlmv3 and nougat, this is only suitable for noncommercial usage.
I'm building a version that can be used commercially, by stripping out the dependencies below. If you would like to get early access, email me at marker@vikas.sh.
Here are the non-commercial/restrictive dependencies:
Other dependencies/datasets are openly licensed (doclaynet, byt5), or used in a way that is compatible with commercial usage (ghostscript).
Thanks
This work would not have been possible without amazing open source models and datasets, including (but not limited to):
Thank you to the authors of these models and datasets for making them available to the community!
The text was updated successfully, but these errors were encountered: