GitHub

PDF Processing Repository

This repository contains scripts and notebooks designed to process PDF files, extracting tables, text, and performing various manipulations on the data. The repository is structured into directories that separate original data, extracted/manipulated data, and the source code.

recommendation of additional folders

data/: This directory should contain the original data in its raw form (e.g., PDFs or CSV files) that are yet to be processed.

new_data/: This directory will contain extracted, manipulated, or processed data files, such as cleaned-up tables, transposed data, or formatted Excel outputs.

Files

'extract_pages.py': A Python script to extract specific pages from PDFs based on certain criteria. could be to remove x amount pages from start, end or specific pages.

test_one_page.ipynb: A Jupyter Notebook for testing PDF extraction processes on a single page, used to verify extraction logic before scaling to multiple pages.

extract_text_tables_multiple_pages.ipynb: A Jupyter Notebook designed to extract both text and tables from PDFs that have multiple pages, and processes each page accordingly.

remove_pages.py: A Python script to remove certain pages from PDFs, based on criteria such as keywords or page numbers.

test_excel_to_excel.py: A script to convert extracted data into Excel format, ensuring proper formatting and organization across different sheets.

turn_values_excel_copy.py: A script that transposes or turns extracted values into a desired format, and saves the results into an Excel file.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
extract_pages.py		extract_pages.py
extract_text_tables_multiple_pages.ipynb		extract_text_tables_multiple_pages.ipynb
readme.md		readme.md
remove_pages.py		remove_pages.py
test_excel_to_excel.py		test_excel_to_excel.py
test_one_page.ipynb		test_one_page.ipynb
turn_values_excel copy.py		turn_values_excel copy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Processing Repository

recommendation of additional folders

Files

About

Releases

Packages

Languages

aosinda/pdf_processing

Folders and files

Latest commit

History

Repository files navigation

PDF Processing Repository

recommendation of additional folders

Files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages