Write all text info from a PDF, even if you can't copy-paste it manually or is from an image and translate it on-the-fly.
Tested on python 3.10
- Tesseract
- plopper
- pdf2image
- natsort
- deep_translator
- Inquirer
- progressbar
- Clone this repo or download the latest PDFtoTXT.py file from releases
$ git clone https://github.com/sbritorodr/pdf_to_txt.git
- Install tesseract using any package manager. (pacman, apt, brew...)
$ sudo pacman -S tesseract plopper
- Don't forget to add trained data to tesseract. Download tessdata files: https://tesseract-ocr.github.io/tessdoc/Data-Files.html and place it in the folder said in the tesseract documentation.
download the language traineddata files required by you and place them in this tessdata directory (/usr/local/share/tessdata).
- Install all pip requirements. Just copypaste this onto your terminal. Use pip3 instead if it doesn't work:
$ pip install -r requirements.txt
- Place
pdftotxt.py
where your pdf's are (Or move your pdf into the folder pdf2txt if you cloned the repo) - Execute the script under python3:
$ python3 pdftotxt.py
- Follow up the instructions. By default, the program picks any pdf from the folder, disables translation and merges all into
./output_ocr_file.txt
- You cannot translate your document if there's +5,000 characters on each page
- If your desired language destination is not avaliable, you can add it by editing the script (lines 70 to 75). Check if it works and create a PR if you want to add this option to the main project:
70 questions = [
71 inq.List('lang',
72 message="Select which language you want to use",
73 choices=['spanish', 'english', 'french','italian', 'portuguese', 'german'] # add here your language/s
74 ),
75 ]
- Delete
pdftotxt.py
- Delete all installed dependencies of pip,
tesseract
andplopper
$ pip uninstall -r requirements.txt
$ sudo pacman -Rs tesseract plopper
- Remove all your tessdata files inside
/usr/local/share/tessdata
if the uninstall has not already deleted it.