This repository contains the scripts and outputs from our OCR comparison tests.
We identified a few sample documents to run through OCR systems so we could compare the results. The documents we used in our final write up are these:
- A receipt -- This receipt from the Riker's commissary was included in States of Incarceration, a collaborative storytelling project and traveling exhibition about incarceration in America.
- A heavily redacted document -- Carter Page's FISA warrant is a legal filing with a lot of redacted portions, just the kind of exasperating thing reporters deal with all the time.
- Something historical -- Executive Order 9066 authorized the internment of Japanese Americans in 1942. The scanned image available in the national archives is fairly high quality but it is still an old, typewritten document.
- A form -- This Texas campaign finance report, from a Texas Tribune story about abuses in the juvenile justice system has very clean text but the formatting is important to understanding the document.
- Something wrinkled -- in early 2014 a group of divers retrieved hundreds of pages of documents from a lake at Ukrainian President Viktor Yanukovych's vast country estate. The former president or his staff had dumped the records there in the hopes of destroying them, but many pages were still at least somewhat legible. Reporters laid them out to dry and began the process of transcribing the waterlogged papers. We selected a page that is more or less readable to the human eye but definitely warped with water damage.
We also tested the OCR engines against a handful of alternate documents. We've preserved two of those documents here so that you can look them over, too. Both are relatively easy to read and all the OCR engines we tested handled them well.
The first, cepr_oversight_order
is an order giving the Puerto Rico Energy Commission oversight powers over the Puerto Rico Electric Power Authority, after the latter authority's highly unusual $300 Million contract with Whitefish Energy came under scrutiny.
The second, whitefish_energy_vs_commonwealth_puerto_rico
is the full text of a legal filing in the protracted fight over who is responsible for delays in rebuilding Puerto Rico's electric grid. These two articles are a great place to get more context:
- Puerto Rico moves to cancel contract with Whitefish Energy to repair electric grid, The Washington Post, Oct 29, 2017; and
- Puerto Rico Grid Contractor Dispute Devolves Into Litigation, The Wall Street Journal, Nov 22, 2017
The /lib/
directory includes the scripts that we used to test each OCR client. Each tool requires some setup, but once you've got a tool installed, you can invoke it with:
ruby ./lib/ocr.rb {command}
For example once you have installed Tesseract, ruby ./lib/ocr.rb tesseract documents
will use Tesseract to OCR all the images in the "documents" directory.
Once you have set up Google Cloud services and stored your credentials, ruby ./lib/ocr.rb google google_cloud_vision/credentials.json documents/historical-executive_order_9066-japanese_internment.jpg
will use Google Cloud Vision to OCR a single image.
These scripts in this repository depend on a few ruby gems. Install them with:
- Install Bundler first:
gem install bundler
- Then install gems in the Gemfile:
bundle install
This script uses mutool
, a PDF processing tool included in mupdf
, to convert multi-page PDFs into images. Install with:
- Mac/Homebrew
brew install mupdf
- Ubuntu
apt install mupdf-tools
Each of the cloud services we tested requires you to authenticate your account. Our scripts look for those credentials in the credentials.json
file in each directory.
The Ruby Gems that Google Cloud Vision requires are included in the bundle install for this repository.
Google Cloud Vision requires authentication credentials. Use the example in google_cloud_vision/credentials.sample.json
to create your own credentials.json
and make sure to point to it when you invoke ./lib/ocr.rb
, eg.
ruby ./lib/ocr.rb google google_cloud_vision/credentials.json documents/document.jpg
The Ruby Gems that Microsoft Azure requires are included in the bundle install for this repository.
Use the example in azure/credentials.sample.json
to create your own credentials.json
and make sure to point to it when you invoke ./lib/ocr.rb
.
Abbyy provides a python script, which is what we used to test documents in Abbyy. You can feed your id and password to the script when you run it:
ABBYY_APPID="{YOUR APPID}" ABBYY_PWD="{YOUR PASSWORD}" python process.py {PATH TO IMAGE} {PATH TO OUTPUT}
The free and open source tools that we tested are all command line applications that you'll run locally.
tesseract
is far and away the best maintained and easiest to use of the command line tools we tested. You should be able to install it with a package manager.
MacOS: brew install tesseract --with-all-languages
Ubuntu/Debian: apt install tesseract tesseract-ocr-*
Calamari depends on OCRopus's tools to improve contrast, and to deskew and split images. Unfortunately, Calamari requires python 3.x, and OCRopus requires python 2.x. Because TensorFlow has issues with Python 3.7, we used Python 3.6. In retrospect, using kraken
might been much smoother, but here's what we actually did:
We used pyenv
and virtualenv
to manage multiple Python instances. (If you're using pyenv
please also note their installation instructions).
We installed Python 3.6 with pyenv
, and then used virtualenv
to create a space to install Calamari and its dependencies.
# from the root of this directory first install Python 3.6 and create a virtual env.
mkdir -p venv
pyenv install 3.6.8
virtualenv -p ~/.pyenv/versions/3.6.8/bin/python venv/calamari
# activate the virtualenv
source venv/calamari/bin/activate
# Clone the calamari code
cd ..
git clone https://github.com/Calamari-OCR/calamari.git
cd calamari
# then install the dependencies and library.
pip install -r requirements.txt
python setup.py install
Calamari provides some pre-trained data models to power its recognizer. You should download them into a models
directory in the Calamari directory.
git clone https://github.com/Calamari-OCR/calamari_models.git models
If your installation was successful, calamari-predict
will be available at the command line, and you can run ruby ./lib/ocr.rb calamari {filename}
to OCR files with Calamari.
OCRopus requires python 2.7, so it's helpful to use pyenv
to manage instances.
mkdir -p venv
pyenv install 2.7
virtualenv -p ~/.pyenv/versions/2.7/bin/python venv/ocropus
Clone OCRopus with
git clone https://github.com/tmbdev/ocropy.git
# activate the ocropus virtualenv
source venv/ocropus/bin/activate
# find the ocropus source directory
cd ../ocropy
# and install the dependencies
pip install -r requirements.txt
python setup.py install
To get OCRopus working you'll also need to download trained models. Prebuilt models for OCRopus can be found on the OCRopus wiki. You should download the english model into a models
directory in the OCRopus directory.
If your installation was successful, ocropus-rpred
will be available at the command line, and you can run ruby ./lib/ocr.rb calamari {filename}
to OCR files with Calamari.