Sample data for this problem statement can be found in the data
folder.
There are two sets of examples, and each set includes a json file and a image file.
For example, data/1.json
and data/1.jpg
are the json and image files for the first example.
The json file contains the following information -
words
: list of words extracted from the page. Each word is represented by a dictionary with the following keys:text
: the text of the wordw_min
: the x-coordinate of the top-left corner of the wordh_min
: the y-coordinate of the top-left corner of the wordw_max
: the x-coordinate of the bottom-right corner of the wordh_max
: the y-coordinate of the bottom-right corner of the word
table
: the location of the transaction table represented as a dictionary with the following keys:w_min
: the x-coordinate of the top-left corner of the tableh_min
: the y-coordinate of the top-left corner of the tablew_max
: the x-coordinate of the bottom-right corner of the tableh_max
: the y-coordinate of the bottom-right corner of the table
header
: the location of the transaction table header represented as a dictionary with the following keys:w_min
: the x-coordinate of the top-left corner of the headerh_min
: the y-coordinate of the top-left corner of the headerw_max
: the x-coordinate of the bottom-right corner of the headerh_max
: the y-coordinate of the bottom-right corner of the header
Notes
- The coordinates are normalized to be between 0 and 1.
- OCR has already been done and the corresponding text is already provided to you. You do not need to do any OCR or Text Extraction.
Given a bank statement (pdf), we are interested in extracting the following information from the document:
- Statement Period, which includes a start date and an end date
- Opening balance amount
- Transaction table headers
- Balance amounts from the transaction table
The expected outputs for the two examples are provided in the tests/test_extractors
file.
A visual example of extracted data is shown below:
Evaluation will be done by running tests located in the tests
folder.
The tests will check the following:
- The statement period is extracted correctly (
test_statement_period
) - The opening balance amount is extracted correctly (
test_opening_balance
) - The transaction table headers are extracted correctly (
test_table_headers
) - The balance amounts from the transaction table are extracted correctly (
test_table_balance
)
Note: Do not modify the tests.
The following are optional tasks that will be evaluated separately:
- Write an API server that accepts a json file and an image file and returns all the extracted data.
- Clone this repository to your local machine.
- You should modify the
src/extractors.py
file. The functions have already been defined for you. - Run tests using
pytest
to check your implementation. - Create a tarball of your code and share it via email.
Note: If you have used any external libraries, please include them in the requirements.txt
file.