This document describes the process for running this application on your local machine.
Important
This software was developed and tested only on Ubuntu 22.04.
git clone https://github.com/stoyanK7/BG-DE-Anki-Decks.git
cd BG-DE-Anki-Decks
pipenv sync --dev
pipenv shell
./run.sh
python3 src/XX_step_you_want_to_run.py
ruff format . && ruff check . --fix .
flowchart LR
inputPdfFile["`**data/input/Goethe-Zertifikat_B1_Wortliste.pdf**
-----------
Input file. Downloaded from Goethe Institut's website.`"]
convertPdfToTxtScript[["`**convert_pdf_to_txt.py**
-----------
Convert *data/input/\*.pdf* to a text file - *data/output/raw/\*.txt*`"]]
rawTxtFile["`**data/output/raw/Goethe-Zertifikat_B1_Wortliste.txt**
-----------
Text representation of the input PDF file.`"]
manuallyEdit["`**Manually edit data/output/raw/\*.txt**
-----------
This step is inevitable. It's faster to catch edge cases and fix them manually than trying to come up with an algorithm.
This step can occur during any of the steps below.`"]
cleanTxtScript[["`**clean_txt.py**
-----------
Clean *data/output/raw/\*.txt*`"]]
cleanedTxtFile["Cleaned TXT file"]
preprocessTxtScript[["Preprocess Txt"]]
preprocessedTxtFile["preprocessed TXT file"]
parseTxtScript[["Parse TXT"]]
rawCsvFile[("Raw CSV file")]
cleanCsvScript[["Clean CSV"]]
cleanedCsvFile[("Cleaned CSV file")]
preprocessCsvScript[["Preprocess CSV"]]
preprocessedCsvFile[("Preprocessed CSV file")]
convertPdfToTxtScript -->|Reads| inputPdfFile
convertPdfToTxtScript -->|Writes| rawTxtFile
manuallyEdit -->|Edits| rawTxtFile
cleanTxtScript -->|Reads| rawTxtFile
cleanTxtScript -->|Writes| cleanedTxtFile
preprocessTxtScript -->|Reads| cleanedTxtFile
preprocessTxtScript -->|Writes| preprocessedTxtFile
parseTxtScript -->|Reads| preprocessedTxtFile
parseTxtScript -->|Writes| rawCsvFile
cleanCsvScript -->|Reads| rawCsvFile
cleanCsvScript -->|Writes| cleanedCsvFile
preprocessCsvScript -->|Reads| cleanedCsvFile
preprocessCsvScript -->|Writes| preprocessedCsvFile