Tesseract is a great open source library for doing optical character recognition (OCR). But it's a little tricky to use it to make a PDF of images searchable, probably the biggest use case for OCR. Here's how to do that on a Mac.
(I tested all this and it worked as of 9 June 2015 on Mac OS X Yosemite 10.10.3. Your mileage may vary.)
- Install Homebrew or update your copy to the latest version (
brew update
). - Get ghostscript:
brew install gs
- Make sure you get the latest version of tesseract:
brew install --devel tesseract
- Now, let's say you scanned a magazine to input.pdf. Make a tiff first:
gs -sDEVICE=tiff32nc -r300 -o mag.tif input.pdf
- Then OCR them:
tesseract mag.tif output pdf
Then open the resulting output.pdf in Preview.app and start searching for some words. They should highlight in the same location they were in the images. Tada!
Or, just use .app attached to this gist. The source .sh is also attached. I used Platypus to turn the shell script in to an app.
I was inspired by ryanfb's instructions and this discussion on stackflow for how to get ghostscript to give pretty output.