Skip to content

CLI-Tool to recognise handwritten text from answer sheets using Tesseract OCR. Using this extracted text to evaluate marks using NLP

Notifications You must be signed in to change notification settings

mayurcybercz/AI-Exam-evaluation

Repository files navigation

CLI-Tool to recognise handwritten text from answer sheets using Tesseract OCR.
Using this extracted text to evaluate marks using NLP.

Installation:
Install Tesseract-OCR-Engine https://github.com/tesseract-ocr/tesseract/wiki
Install python dependencies pytesseract,pillow,pandas,numpy,matplotlib

Usage:
1)Clone the repository into your working directory
2)Make sure you update path of tesseract executable in main.py
3)add image for testing to images folder
4)main.py imagename
It will return a HOCR file,which is very similar to XHTML
5)file_conversion.py hocrfilename.
It will convert HOCR into dataframe and store the output in a pickle file/json file

Phase1 demonstration of the OCR of handwritten text and exploiting into JSON
(Rendered python notebook displayed as markdown using nbconvert)

Phase2 Using nltk to Create A NLP model to evaluate Answers

Download all the packages using the nltk downloader

import nltk
nltk.download()

png

from pytesseract import pytesseract
import sys
import os
#Edit path to tesseract executable if you installation directory changed

pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract'
from datetime import datetime

def replaceMultiple(mainString, toBeReplaces, newString):
   
    for elem in toBeReplaces :
        
        if elem in mainString :
            
            mainString = mainString.replace(elem, newString)
    
    return  mainString

mainStr=str(datetime.now())
file_name = replaceMultiple(mainStr, [':', '-', '.',' '] , "")
def generateFilename():
	mainStr=str(datetime.now())
	file_name = replaceMultiple(mainStr, [':', '-', '.',' '] , "")
	return file_name
from PIL import Image
from IPython.display import display
import matplotlib.pyplot as plt

im = Image.open("testfile1.jpg")
fig, ax = plt.subplots()
ax.imshow(im)
print("(width,height):"+str(im.size))
(width,height):(3000, 3115)
box=(250,180,2800,400)
cropped_image = im.crop(box)
display(cropped_image)
cropped_text= pytesseract.image_to_string(cropped_image, lang = 'eng')
print(cropped_text)

png

Conductor wn magnetic Field Produce voltage :
def createHOCR(imagepath):
	filename= generateFilename()
	pytesseract.run_tesseract(imagepath, filename, lang=None,extension='html', config="hocr")
	print("HOCR file generated: "+str(filename)+".hocr")
createHOCR("testfile.jpg")
HOCR file generated: 20181021042317089205.hocr
from lxml import etree
import pandas as pd
import os
import sys
import generate_filename as gf
def hocr_to_dataframe(fp):

    doc = etree.parse(fp)
    words = []
    wordConf = []

    for path in doc.xpath('//*'):
        if 'ocrx_word' in path.values():
            conf = [x for x in path.values() if 'x_wconf' in x][0]
            wordConf.append(int(conf.split('x_wconf ')[1]))
            words.append(path.text)

    dfReturn = pd.DataFrame({'word' : words,
                             'confidence' : wordConf})

    return(dfReturn)
filename=generateFilename()
dataframe=hocr_to_dataframe("20181021041156998790.hocr")
dataframe
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
word confidence
0 95
1 95
2 Q1. 89
3 Define 96
4 electromagnetic 96
5 induction. 95
6 Sane 23
7 | 90
8 Conductor 93
9 mM 42
10 magnetic 70
11 Field 63
12 produce 67
13 voltage 65
14 ‘Seconaewctntmnstnn 0
15 esionainsnenaneenrenncconanniiti 0
16 Q2. 89
17 What 96
18 are 96
19 3 96
20 examples 96
21 of 95
22 transparent 95
23 objects? 96
24 (Professor 96
25 provides 96
26 5 96
27 as 95
28 input) 90
29 95
30 Q3. 92
31 Complete 96
32 the 96
33 network 95
34 tree. 96
35 95
dataframe.to_json(filename+".json",orient='columns')
print("JSON generated: "+filename+".JSON")
dataframe.to_pickle(filename+".pkl")
print("Pickle generated: "+filename+".pkl")
JSON generated: 20181021042319190731.JSON
Pickle generated: 20181021042319190731.pkl

About

CLI-Tool to recognise handwritten text from answer sheets using Tesseract OCR. Using this extracted text to evaluate marks using NLP

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published