The script extracts arabic text from a PDF (with a certain format) translates it using googletrans
and adds the content of every PDF as a row to a csv
file.
- Create a local copy of this repo
- Run
pip install -r requirements.txt
to install packages. This is tested for MacOs - Add PDF's to the
pdf_input
folder - Run
main.py
- All information will be appended in
output.csv
This script use poppler-utils
to convert a PDF into an image and then uses tesseract
to extract the text via OCR. We first extract the latin characters and numbers and then run OCR a second time for the arabic. Finally, we use python's CSV library to append the data to a CSV file and also write name of extracted PDFs into a text file. The script verifies the text file at the start to see if we have already processed the files.