Skip to content

OCR evaluation brought to you by University of Alicante

Notifications You must be signed in to change notification settings

tesseract4java/ocrevalUAtion

 
 

Repository files navigation

ocrevalUAtion Build Status

This set of classes provides basic support to perform the comparison of two text files: a reference file (a ground-truth document) and a the output from an OCR engine (a text file).

Options for specific behavior include: ignore case, ignore diacritics, ignore punctuation, ignore stop-words, Unicode and user-defined equivalences between characters.

It can be used with the graphic user interface (GUI) provided, in addition to command line interface usage.

Supported input formats include: plain text, FineReader 10 XML, PAGE XML, ALTO XML and hOCR HTML.

The output generates a report with statistics (including CER and WER error rates) and a table with the parallell input texts where the differences are highlighted.

A gentle introduction to OCR evaluation and to this tool can be found at https://sites.google.com/site/textdigitisation/

You can download the latest release from here.

Instructions on how to use ocrevalUAtion can be found in the wiki.

About

OCR evaluation brought to you by University of Alicante

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML 79.2%
  • Java 20.1%
  • Other 0.7%