Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ErrorMessage repeated many times #165

Open
asciim0 opened this issue Nov 21, 2016 · 6 comments
Open

ErrorMessage repeated many times #165

asciim0 opened this issue Nov 21, 2016 · 6 comments
Labels
bug A product defect that needs fixing good-first-issue Issue suitable for inexperienced developers P2 Medium priority issues to be scheduled in a future release
Milestone

Comments

@asciim0
Copy link
Contributor

asciim0 commented Nov 21, 2016

Dev Effort

1D

Description

I've encountered several PDF files, for which error messages are repeated countless times. This poses a problem when the jhove output is embedded in repositories, where the validation error message is stored in an Index to be able to treat files of the same error later on.
So far, two error messages could be identified which show this behavior:
edu.harvard.hul.ois.jhove.module.pdf.PdfMalformedException: Invalid name tree
edu.harvard.hul.ois.jhove.module.pdf.PdfInvalidException: Invalid destination

Attached is a sample file, which repeats "edu.harvard.hul.ois.jhove.module.pdf.PdfInvalidException: Invalid destination" a total of 2122 times, distributed across the following 4 offsets:

1425 time repetitions of error message for Offset: 1146553
663 time repetitions of error message for Offset: 1125321
26 time repetitions of error message for Offset: 1140429
8 time repetitions of error message for Offset: 1113971

Sampe file:
conference_guide_v9_20100901.pdf

@ross-spencer
Copy link

ross-spencer commented Dec 5, 2016

Just looking at this from our end as well and we're seeing the same issue.

Attached a 2MB XML output from JHOVE 1.14.6 using JHOVE PDF-HUL module 1.7.

Code where the message appears.

throw new PdfInvalidException ("Invalid destination object");

lindlar-pdf-validation-messages.txt

@ghost ghost added this to the Dev hack week initiation milestone Feb 28, 2019
@ghost ghost added bug A product defect that needs fixing P2 Medium priority issues to be scheduled in a future release labels Feb 28, 2019
rosetta-development added a commit to rosetta-development/jhove that referenced this issue Apr 8, 2019
rgfeldman added a commit to rgfeldman/jhove that referenced this issue Apr 10, 2019
@carlwilson carlwilson added the good-first-issue Issue suitable for inexperienced developers label Apr 23, 2020
@karenhanson
Copy link
Contributor

Started to investigate. I'm not a PDF expert, but have context to add.

The messages are generated in this block of code.

In the example PDF provided in the original issue, there is an index at the end of the document that is supposed to have internal references to the corresponding section within the document. When you hover over these in the PDF the cursor turns to a hand, but when you click it does not go to the text section. So, every reference in the index is invalid, and indeed there are a few thousand of these. If you open the PDF as text, you can count lines that look like this:
2500 0 obj<</Rect[133.417 269.748 164.978 276.302]/Subtype/Link/A<</D(lab:TUP093)/S/GoTo>>/C[1 0 0]/Border[0 0 0]/Type/Annot>> and they line up with the number of messages.

So, the messages appear to be legitimate, but obviously they're excessive. We need to decide how to better handle these. Perhaps one of these options: (a) each unique failed reference could be listed once (e.g. one message for all "lab:TUP093") - in this case that would still be a few hundred messages, (b) there could be a max on the number of this type of message generated, or (c) consolidate into a general message (pdf has x failed internal references). Any other ideas? What is the the preferred approach?

I'm not sure why the offset is repetitive, or if it should be - needs further investigation.

There is another possible issue mixed in that is adding some confusion. The error messages should be labelled as "149" errors, but instead they are getting caught in an Exception catch and being relabelled as 122 errors. I think there should be a change here to preserve the specific error number.

There are 2 commits on this branch that may be helpful for continuing if I don't complete this. One has a test using the PDF provided so that you can step through, the other is catching the 149 error. These are rough and incomplete.

@carlwilson
Copy link
Member

@karenhanson, this is great thanks and a lot clearer now. As you say, there needs to be a decision or two before a resolution. I'll pick this up before the week's out but today is difficult.

@MartinSpeller
Copy link

@karenhanson karenhanson removed their assignment Sep 9, 2020
@karenhanson
Copy link
Contributor

karenhanson commented Mar 20, 2023

Was just testing v1.28-RC with some of the problem files I'd looked at in the past. For this one, I noticed that something interesting has happened. It went from 2123 messages in 1.24, to 4246 messages in 1.26 when something caused 2 variations of each message to appear. One variation (PDF-HUL-122) uses an offset and class. The other has no offset or class (PDF-HUL-149):

   <message offset="1146553" severity="error" id="PDF-HUL-122">edu.harvard.hul.ois.jhove.module.pdf.PdfInvalidException: Invalid indirect destination - referenced object 'lab:MO302' cannot be found</message>
   <message severity="error" id="PDF-HUL-149">Invalid indirect destination - referenced object 'lab:MO302' cannot be found</message>

Now in 1.28, we have 763 messages due to new message deduping, but a few less than half of these are still duplicates due to the two variations. Also noting that those that reference an offset all use the same number.

Update: this seems like it might be related to work on this ticket: #277

@carlwilson
Copy link
Member

Hi @karenhanson , thanks for the insight here. I'll take a look and see what's happening. There has been some changes to error reporting, to prevent the creation of old, IDless messages. I'll investigate what looks like multiple possible causes but unless anything was directly "broken" by 1.28 this won't be addressed until we begin 1.30 development.

@carlwilson carlwilson modified the milestones: Hackathon tasks , OPF Hackathon 2023 Tasks Jun 21, 2023
@carlwilson carlwilson removed this from the OPF Hackathon 2023 Tasks milestone Mar 6, 2024
@carlwilson carlwilson added this to the JHOVE 1.34 milestone Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A product defect that needs fixing good-first-issue Issue suitable for inexperienced developers P2 Medium priority issues to be scheduled in a future release
Projects
Status: No status
Development

No branches or pull requests

6 participants