PDFConverter is a Python project that needs to be converted into an executable file in order to quickly interpret and convert a large number of tables into PDF format without requiring extensive user interaction.
You can also check the branchs docs or the desktop application used for testing the call of this Script.
Example of Script call:
python pdfconverter.py --ImportPath "C:\\users\\dvp10\\desktop\\EDITAL (2).pdf" --ExportPath "C:\\users\\dvp10\\desktop" --PageNumber "all"
You can find me on likedin by here linkedin.com/in/monambike/. If you want to see videos about my work you can check my YouTube channel youtube.com/@monambike_portfolio and if you want to see my artworks you can check at my instagram instagram.com/monambike_portfolio.
The license for this repository is available here. Please refer to the provided link for detailed information regarding the terms and conditions governing the use of this project.
- Libraries
- Formatting
List of libraries used for the development of the Python script:
- Pandas, for text conversion and DataFrame manipulation;
- Tabula, for reading PDF files;
- Other standard libraries of the Python language were also used, such as Glob for retrieving only PDF files, OS for system operations, argparse for receiving and manipulating command-line arguments, among others.
Types of formatting and the files to which they were applied. When a file is shown to be exported (in table format), it means that all the formatting above the export will be applied.
Formatting related to reading.
Removes all double quotes from the DataFrame to avoid future issues.
Replaces all semicolons in the DataFrame with commas to avoid conflicts.
Deletes all empty rows in the DataFrame.
Deletes all empty columns in the DataFrame.
Converts the header to body to remove unnecessary and detrimental formatting.
Removes line breaks that occur when the PDF has a very long line.
Formatting related to conversion.
Starts the first export, which is the export of the unformatted file that will be formatted later.
EXPORT | |
---|---|
Folder Name: | withoutFormatting |
Folder Path: | (lattice/stream) + "\\withoutFormatting" |
Description: |
The 'withoutFormatting' file is exported at this moment without any formatting. |
Removes empty data in the header.
If it is:
"<data>";"Unnamed: 0";"<data>"
It becomes:
"<data>";"<data>"
Removes line breaks if they occur in the middle of the data.
If it is:
"<data
data>"
It becomes:
"<data data>"
Removes semicolon ';'
if it is at the end of the line.
If it is:
"<data>";"<data>";
It becomes:
"<data>";"<data>"
Removes leading spaces in the lines.
If it is:
"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>"
It becomes:
"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>"
Removes the line if it has quotes at the beginning and end, and on top of that, it has only one column or less.
If it is:
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>
<data>"
It remains the same:
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>
<data>"
Starts the export of the file to handle the exception when converting a table that has empty cells.
EXPORT | |
---|---|
Folder Name: | tableWithBlankCells |
Folder Path: | (lattice/stream) + "\\tableWithBlankCells" |
Description: |
The file 'tableWithBlankCells' is exported at this moment with all the formatting applied above. |
Removes data that is empty "";
and ;""
.
If it is:
"";"<data>";"<data>";"<data>"
"<data>";"<data>";"";"<data>"
"<data>";"<data>";"<data>";""
It becomes:
"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"
Inserts a line break if there are double quotes side by side.
If it is:
"<data>";"<data>""<data>";"<data>"
It becomes:
"<data>";"<data>"
"<data>";"<data>"
If there is a semicolon followed by a space, it is replaced by a line break.
If it is:
"<Lorem ipsum>";"<Lorem ipsum>"; "<Lorem ipsum>";"<Lorem ipsum>"
It becomes:
"<Lorem ipsum>";"<Lorem ipsum>"
"<Lorem ipsum>";"<Lorem ipsum>"
Removes the preceding content if there is a space between the separators and the quotes.
If it is:
"<Lorem ipsum>";"<Lorem ipsum>"; "<data>";"<data>"
It becomes:
"<data>";"<data>"
Removes the line if it has quotes at the beginning and end, and on top of that, it has only one column or less.
If it is:
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>
<data>"
It remains the same:
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>
<data>"
Starts the export of the main file.
EXPORT | |
---|---|
Folder Name: | main |
Folder Path: | (lattice/stream) + "\\main" |
Description: |
The file 'main' is exported at this moment with all the formatting applied above. |
Deletes the line if it doesn't start with quotes.
If it is:
"<data>";"<data>";"<data>"
<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"
It becomes:
"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"
Deletes the line if it doesn't end with quotes.
If it is:
"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>
"<data>";"<data>";"<data>"
It becomes:
"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"
Empty lines that only have line breaks '\n'
or don't have a double quote anywhere will be deleted.
If it is:
Lorem
"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"
Lorem ipsum
"<data>";"<data>"
It becomes:
"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"
"<data>";"<data>"
Only writes the line if it has at least three columns or more.
If it is:
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"
It becomes:
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>";"<data>"
"<data>";"<data>";"<data>"
Starts the export of the main file with some stricter formatting modifications.
EXPORT | |
---|---|
Folder Name: | fullClear |
Folder Path: | (lattice/stream) + "\\fullClear" |
Description: |
The file 'fullClear' is exported at this moment with all the formatting applied above. |