-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ErrorMessage repeated many times #165
Comments
Just looking at this from our end as well and we're seeing the same issue. Attached a 2MB XML output from JHOVE 1.14.6 using JHOVE PDF-HUL module 1.7. Code where the message appears. jhove/jhove-modules/src/main/java/edu/harvard/hul/ois/jhove/module/pdf/Destination.java Line 93 in 08baeef
|
Started to investigate. I'm not a PDF expert, but have context to add. The messages are generated in this block of code. In the example PDF provided in the original issue, there is an index at the end of the document that is supposed to have internal references to the corresponding section within the document. When you hover over these in the PDF the cursor turns to a hand, but when you click it does not go to the text section. So, every reference in the index is invalid, and indeed there are a few thousand of these. If you open the PDF as text, you can count lines that look like this: So, the messages appear to be legitimate, but obviously they're excessive. We need to decide how to better handle these. Perhaps one of these options: (a) each unique failed reference could be listed once (e.g. one message for all "lab:TUP093") - in this case that would still be a few hundred messages, (b) there could be a max on the number of this type of message generated, or (c) consolidate into a general message (pdf has x failed internal references). Any other ideas? What is the the preferred approach? I'm not sure why the offset is repetitive, or if it should be - needs further investigation. There is another possible issue mixed in that is adding some confusion. The error messages should be labelled as "149" errors, but instead they are getting caught in an Exception catch and being relabelled as 122 errors. I think there should be a change here to preserve the specific error number. There are 2 commits on this branch that may be helpful for continuing if I don't complete this. One has a test using the PDF provided so that you can step through, the other is catching the 149 error. These are rough and incomplete. |
@karenhanson, this is great thanks and a lot clearer now. As you say, there needs to be a decision or two before a resolution. I'll pick this up before the week's out but today is difficult. |
Was just testing v1.28-RC with some of the problem files I'd looked at in the past. For this one, I noticed that something interesting has happened. It went from 2123 messages in 1.24, to 4246 messages in 1.26 when something caused 2 variations of each message to appear. One variation (PDF-HUL-122) uses an offset and class. The other has no offset or class (PDF-HUL-149):
Now in 1.28, we have 763 messages due to new message deduping, but a few less than half of these are still duplicates due to the two variations. Also noting that those that reference an offset all use the same number. Update: this seems like it might be related to work on this ticket: #277 |
Hi @karenhanson , thanks for the insight here. I'll take a look and see what's happening. There has been some changes to error reporting, to prevent the creation of old, IDless messages. I'll investigate what looks like multiple possible causes but unless anything was directly "broken" by 1.28 this won't be addressed until we begin 1.30 development. |
Dev Effort
1D
Description
I've encountered several PDF files, for which error messages are repeated countless times. This poses a problem when the jhove output is embedded in repositories, where the validation error message is stored in an Index to be able to treat files of the same error later on.
So far, two error messages could be identified which show this behavior:
edu.harvard.hul.ois.jhove.module.pdf.PdfMalformedException: Invalid name tree
edu.harvard.hul.ois.jhove.module.pdf.PdfInvalidException: Invalid destination
Attached is a sample file, which repeats "edu.harvard.hul.ois.jhove.module.pdf.PdfInvalidException: Invalid destination" a total of 2122 times, distributed across the following 4 offsets:
1425 time repetitions of error message for Offset: 1146553
663 time repetitions of error message for Offset: 1125321
26 time repetitions of error message for Offset: 1140429
8 time repetitions of error message for Offset: 1113971
Sampe file:
conference_guide_v9_20100901.pdf
The text was updated successfully, but these errors were encountered: