This task mirrors a system we recently built internally, and will give you an idea of the problems we need to solve.
Every quarter, new company data is provided to us in PDF format. We need to use an external service to extract this data from the PDF, and then validate it against data we have on file from another source.
Complete the API so that:
A user can provide a PDF and a company name data is extracted from the PDF via the external service and compared to the data stored on file a summary of the data is returned, containing all fields from both sources, noting which fields did not match.
A selection of example PDFs have been uploaded, and the PDF
extraction service has been mocked for use in src/pdf_service.py
- DO NOT
EDIT THIS FILE. There is simple documentation of the service in
PDF_SERVICE_DOCS.md
. You can treat this as just another microservice.
The existing data we have on file is available in the data/database.csv
file.
Treat this code as if it will be deployed to production, following best practices where possible.
The easiest way to set up the repository is to use python-poetry
. The lock file
was generated using version 1.8.3
- Ensure
poetry
is installed - Run
make install
Alternatively it's possible to pip install
directly using the
pyproject.toml
or requirements.txt
.