Skip to content

Python package to analyze scanned questionnaires and forms with AWS Textract and convert the results to an XLSX

License

Notifications You must be signed in to change notification settings

Futsch1/form-analyzer

Repository files navigation

form-analyzer - A library that uses AWS Textract to automatically evaluate filled forms

Build Documentation Status Coverage Status Maintainability

Python package to analyze scanned questionnaires and forms with AWS Textract and convert the results to an XLSX.

No thorough Python programming abilities are required, but a basic understanding is needed.

Prerequisites

  • Install form-analyzer using pip
pip install form-analyzer

Example

For a comprehensive example, see the example folder in this project

Prepare questionnaires

In order to process your input data, the questionnaires need to be converted to a proper format. form-analyzer requires PNG files for the upload to AWS Textract. If your data is already in this format, make sure that their lexicographic order corresponds to the number of pages in your form.

Example:

Form1_Page1.png
Form1_Page2.png
Form1_Page3.png
Form2_Page1.png
Form2_Page2.png
Form2_Page3.png

Convert PDF files

form-analyzer can convert PDF input files to properly named PNG files ready for upload. Each PDF page can optionally be post-processed by a custom function to split pages.

Create a Python script like this to convert single page PDF files (assuming that the PDFs are located in the folder "questionnaires"):

import form_analyzer

form_analyzer.pdf_to_image('questionnaires')

The following example shows how to split a single PDF page into two images and how to return only the first page:

import form_analyzer


def one_page_to_two(_: int, image):
    left = image.crop((0, 0, image.width // 2, image.height))
    right = image.crop((image.width // 2, 0, image.width, image.height))

    return [form_analyzer.ProcessedImage(left, '_1'), form_analyzer.ProcessedImage(right, '_2')]


form_analyzer.pdf_to_image('questionnaires', image_processor=one_page_to_two)

form_analyzer.pdf_to_image('questionnaires', 
                           image_processor=lambda image_index, image: [form_analyzer.ProcessedImage(image, '') if image_index == 0 else None])

The argument image_processor specifies a function that receives the current PDF page number (starting with 0) and an Image object. It returns a list of form_analyzer.ProcessedImage objects that contain an Image object and a file name suffix. The list may also contain None, in which case the entry is skipped.

The resulting images are stored in the same folder as the PDF source files.

AWS Textract

The converted images can now be processed by AWS Textract to extract the form data. You can either provide your AWS access key and region as parameters or set them up according to this manual.

It is also possible to upload the images to an AWS S3 bucket and analyze them from there. If that's desired, pass the S3 bucket name and an optional sub folder.

Assuming that the credentials are already set, this script will upload and process the data.

import form_analyzer

form_analyzer.run_textract('questionnaires')

The result data is saved as JSON files in the target folder. Before using AWS Textract, the function checks if result data is already present. If that is the case, the Textract call is skipped.

Work with Textract only

If you do not need the form processing, you can also directly use the generated JSON files with Textract Response Parser.

import glob
import json
import trp

for file_name in glob.glob('*.json'):
    with open(file_name) as f:
        doc = trp.Document([json.load(f)])

    for block in doc.blocks[0]['Blocks']:
        print(block.get('Text'))

Form description

In order to convert your form to a meaningful Excel file, form-analyzer needs to know the expected form fields. A description has to be provided as a Python module.

This module needs to contain two variables:

  • form_fields: The list of form fields
  • keywords_per_page: A list of keywords to expect on each page

form_fields variable

This variable is a list of FormField objects, which each describes a single field in the form. Each FormField object consists of a title and a Selector object. The title is the column header in the Excel file and the Selector defines the type of the form field and its location.

Important: Note that the form description greatly affects the result of the form analyzing process. The AWS Textract process often has slight errors and does not yield 100% correct results. The form descriptions needs to account for that and on the one hand provide a detailed description of where to look for form fields and on the other hand needs to keep search strings generic to help to detect the correct field.

Selectors

Some selectors require a key and all require filter for initialization. The key is the label of the form field which is searched in the extracted form data. It is recommended to not indicate the full label but a unique part of it to compensate for potential detection errors.

  • SingleSelect: Describes a list of checkboxes where only one may be marked
  • MultiSelect: Describes a list of checkboxes where none, one or several may be marked
  • TextField: Describes a text input box or input line where free text can be entered
  • TextFieldWithCheckbox: Describes a text input field with an additional checkbox
  • Number: Special case of TextField where only numbers may be entered
  • Placeholder: Results in an empty column in the Excel file

For single and multi selects, additional and alternative text fields can be given. The content of the additional field is always added to the output and can be used to handle optional free text fields. The alternative text field is used when no selection is made. Both additional and alternative fields can be either TextField, Number or TextFieldWithCheckbox.

Note that all text matching will be done case-insensitive and with a certain fuzziness, so that no exact match is required.

See also the documentation.

Filters

Filters restrict the extracted form fields to search for the current form field. The lower the number of potential extracted form fields, the higher the probability of correct results.

Filters can be combined using the & (and) and | (or) operator.

  • Page: Restricts the search to a certain page (page numbers starting with 0, so 0 is the first page)
  • Pages: Restricts the search to a list of pages
  • Location: Restricts the search to a part of the page indicated by horizontal and vertical ranges as page fractions.
  • Selected: Restricts the search to fields which are selected checkboxes

Location filters apply to all selection possibilities for single and multi selects and to the label for text and number fields.

Note that when working with location filters and scanned form pages, the position of certain fields on the page must be similar for each scan.

See also the documentation.

Examples

from form_analyzer.filters import *
from form_analyzer.selectors import *

# Single select on the first page with two options
single_select = SingleSelect(['First option', 'Second option'], 
                             Page(0))

# Multi select on the top half of the first page
multi_select = MultiSelect(['First option', 'Second option'],
                           Page(0) & Location(vertical=(.0, .5)))

# Text field on the upper left quarter of the first page
text_field = TextField('Field label',
                       Page(0) & Location(horizontal=(.0, .5), vertical=(.0, .5)))

# Single select on the lowest third of the second page or the top half of the third page
single_select_2 = SingleSelect(['First option', 'Second option', 'Third option'],
                               (Page(1) & Location(vertical=(.66, 1))) |
                               (Page(2) & Location(vertical=(.0, .5))))

Keywords per page

The variable keywords_per_page in the form description is used to validate that a correct form is being analyzed. It is a list of a list of strings. For each page, a list of strings can be given where at least one of them has to be found in the strings discovered by Textract on the page.

If the list is empty or empty for a single page, no validation is performed.

Example

# Will search for 'welcome' on the first page and for 'future' or 'past' on the second
keywords_per_page = [['welcome'], ['future', 'past']]

Form analysis

The data returned from AWS Textract and the form description are the inputs for the final analysis step that will try to locate all described form fields, get their value in the respective filled forms and put this in an Excel file.

To run the analysis, use the following where the AWS Textract JSON files and PNGs are located in the folder "questionnaires" and a Python module "my_form" exists in the Python search path that contains the form description (this should usually be the current folder, where a "my_form.py" is located). You can optionally pass the name of the resulting Excel file.

import form_analyzer

form_analyzer.analyze('questionnaires', 'my_form', 'my_form_results')

Results

After analyzing, an Excel file is created. The first column always contains a link to the image of the first page of the form. Each uncertain field (meaning that there was some uncertainty during the analysis and the result might be incorrect) is also linked to the image of the page where the field is located.

Usually, it is required to manually check the results. The Excel file is not perfect and depending on the complexity of the form, the quality of the inputs, the PDF quality etc. the file might contain errors. The number of found uncertain fields is printed after the analysis and can be used as a coarse measure for the quality of the results.

About

Python package to analyze scanned questionnaires and forms with AWS Textract and convert the results to an XLSX

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages