Skip to content

CO.SHS Project | Digitizing Analog Text using OCR and Image Quilting

Notifications You must be signed in to change notification settings

vialab/Textension

Repository files navigation

Textension

The Textension platform automatically adds visualizations and natural language processing applications to analog texts, using any web-based device with a camera. After taking a picture of a particular page or set of pages from a book or uploading an existing image, our system builds an interactive digital object that automatically inserts modular elements in a digital space. Leveraging the findings of previous studies, our framework augments the reading of analog texts with digital tools, making it possible to work with texts in both a digital and analog environments.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

This software was created in PYTHON 2.7 and is not compatible with more up to date versions.

apt-get update && apt-get install -y \
        build-essential \
        python2.7 \
        python-pip \
        tesseract-ocr \
        libtesseract-dev \
        libleptonica-dev \
        libmagickwand-dev

Installing

In order a local (non-containerized) version running on your machine, please run the provided commands (UNIX).

Install and activate a Python 2.7 virtual environment (virtualenv):

cd /path/to/OpenSSH-Doc-Analyzer
virtualenv venv
source venv/bin/activate

Install Python dependencies:

Recommended:

pip install -r requirements.txt

Or, alternatively:

pip install flask numpy scikit-learn scipy spacy pypdf2 pdfminer wand pillow \
google-api-python-client matplotlib opencv-python pandas pyocr textstat tesserocr

Then always do:

python -m spacy download en

Running Flask

After successfully installing you should be able to run the Flask server with the following commands:

export FLASK_APP=file_upload.py
flask run

Debugging

All development was done using [Visual Studio Code] (https://code.visualstudio.com/), and thus the /.vscode files have been provided in order for easy debugging of code. Simply, install the IDE, along with the Python package (in the IDE), select your debug options to Flask (note this is not the same as Flask (Old)) and press play.

The project should be available at the URL:

http://localhost:5000/

Deployment

Deployment for this project has been automated, and so please be aware that pushes to this repository will automatically build, run, and deploy to the VIALAB production servers at https://textension.vialab.ca/. Deployment will automatically handle database connections through the injection of required environment variables in Kubernetes.

Built With

  • Flask - The web framework used (PYTHON 2.7)
  • Jinja2 - Template engine
  • Bootstrap - Front-end component library
  • Docker - Container / Dependency management

Versioning

This project is being developed using an iterative approach. Therefore, now releases have yet been made and the project will be subject to drastic changes. No versioning practices will be followed until release. To see a history of changes made to this project, see commit history.

Authors

  • Adam Bradley, PhD. - Research Associate
  • Christopher Collins, PhD. - Research Supervisor
  • Victor (Jay) Sawal, BSc. - Software Developer

License

This research was conducted as part of the CO.SHS project (co-shs.ca) and has received financial support from the Canada Foundation for Innovation (Cyberinfrastructure Initiative – Challenge 1 – First competition).

Acknowledgments

  • Richard Drake, MSc. - Laboratory Technician (Science Building)

About

CO.SHS Project | Digitizing Analog Text using OCR and Image Quilting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published