PDF Scraping

The script extracts arabic text from a PDF (with a certain format) translates it using googletrans and adds the content of every PDF as a row to a csv file.

Getting Started

Create a local copy of this repo
Run pip install -r requirements.txt to install packages. This is tested for MacOs
Add PDF's to the pdf_input folder
Run main.py
All information will be appended in output.csv

Logic

This script use poppler-utils to convert a PDF into an image and then uses tesseractto extract the text via OCR. We first extract the latin characters and numbers and then run OCR a second time for the arabic. Finally, we use python's CSV library to append the data to a CSV file and also write name of extracted PDFs into a text file. The script verifies the text file at the start to see if we have already processed the files.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
pdf_input		pdf_input
README.md		README.md
functions.py		functions.py
main.py		main.py
output.csv		output.csv
requirements.txt		requirements.txt
temp-1.jpg		temp-1.jpg
temp.txt		temp.txt
temp_en.txt		temp_en.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Scraping

Getting Started

Logic

About

Releases

Packages

Languages

PhillRob/PDF-scraping

Folders and files

Latest commit

History

Repository files navigation

PDF Scraping

Getting Started

Logic

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages