-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is Grobid able to OCR papers ? #507
Comments
Hello @AaronNGray ! GROBID does not perform OCR, it's considered a bit out of scope (though it's debatable) and we prefer to let the user exploit its OCR of choice (with an Abbyy license, the OCR is usually better than with Tesseract for instance). However it's quite easy to simply combine GROBID with Tesseract:
then:
then apply GROBID to the PDF
then:
and then apply GROBID to the PDF. The result is not wonderful, likely due to the low image quality of this old article, but it depends on your requirements: Reynolds74WithText.pdf.tei.xml.gz It would be interesting to compare the final result with Abbyy or other OCR (and with Tesseract 4.0 which uses DL) ! |
Just for reference, your second and third articles already contain a text layer in the PDF. But the quality of the OCR of the second 195609-.pdf is so low that it's like having no OCR... The third article Chomsky_1959.pdf is just good and GROBID works fine with it without anything to do. Here are the results: 195609-WithText.pdf.tei.xml.gz Chomsky_1959.pdf.tei.xml.gz |
The References/Citations in Chomsky_1959.pdf are very badly interpreted ! |
Thanks a lot for Grobid ! I have Grobid running in docker locally now which is nice :) |
Ah yes you're right the reference list of Chomsky_1959.pdf are very bad (I just look at the beginning), the reference section is badly segmented from the body. |
At this point -- I'd recommend that older papers receive special treatment in the process of digitization. They're likely to have aged differently in the way they present references, and are also likely to contain OCR misnakes; that Chomsky paper certainly contains a few in the references section. It's a hard problem, and closes in on some of the issues in text-to-speech recognition around context -- clearly the "h" in Chomesky is an "h" not an "n", but that kind of entity recognition is largely around context, as well as in recognition of the font style and casing. I'd like to see something of a more pluggable system -- if we can run our OCR and PDF ALTO directives as a specialized loop -- Grobid can work its finer magic. With this paper in mind, I wonder if some model hints would be helpful, e.g., "publish date of 1950s" |
Working against a compiled dictionary for subject areas from more modern papers might help. |
Hello Patrice! @kermitt2 Let me ask a relevant question, please. It is related to understanding the reasons for having some text skipped by GROBID. Intro I am using GROBID (dev version) with
The resulting XML markup is available here. I extracted I am able to see that GROBID is missing some pieces of text despite the fact they are recognized by the OCR engine. I would appreciate knowing your thoughts on the reasons for that. Is it because the resulting text is messy and pragmatic segmented decides that it is not a sentence? For example, the following highlighted region is recognized in the way shown after the image.
Or maybe, the problem occurs because PDF contains some missing line that was not recognized by the OCR engine. I would appreciate your thoughts on that. Thank you in advance! |
Perhaps that is because, except the segmentation model, all other models including the fulltext model, header model work on token level. So If they are not able to mark a token as part of say abstract or as paragraph then then the post processing rules are removing it. @kermitt2 can give more insights on why this happens! |
I've double checked. The issue lay down in the fulltext model. There are big chunks of text that are tagged as table. It's a known issue also considering there isn't much training data for the fulltext model.
There there is a WIP PR #963 from @kermitt2 which aims to solve this issue and many other related to tables and figure recognition. Due to scarce time availability the PR has not moved forward too much (plus might need some updates). |
Does Grobid do OCR I am trying to get Grobid to process older PDF's like :-
https://www.cs.cmu.edu/~crary/819-f09/Reynolds74.pdf
https://chomsky.info/wp-content/uploads/195609-.pdf
http://somr.info/lib/Chomsky_1959.pdf
The text was updated successfully, but these errors were encountered: