Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds EPUB validation extension module based on W3C's EPUBCheck #460

Merged
merged 20 commits into from
Dec 10, 2019

Conversation

karenhanson
Copy link
Contributor

@karenhanson karenhanson commented Jul 3, 2019

This change is to add the EPUB-ptc extension module version 1.0. The module wraps the W3C EPUBCheck validation tool as a JHOVE module.

Some notes in advance of full documentation:

  1. Tests use examples derived from IDPF's EPUB 3 samples and test suite on GitHub and National Library of the Netherlands / Research epubPolicyTests, also on GitHub. These may be useful for additional test material.
  2. The checkSignatures() method is based on the Library of Congress EPUB2and EPUB3 documentation. For this valid is always set to UNDETERMINED. This check only looks at "magic numbers" and file extension. If the signature matches, it sets "well formed" to true. Note that it does not perform the additional package checks that contribute to "well formed"-ness in parse() so there is a small inconsistency there.
  3. Well-formed/valid status is currently dependent on the messages returned by EPUBCheck (except when a non EPUB is passed and it throws an exception see this issue). The full list of possible messages and their severity according to EPUBCheck can be seen here.
    • an EPUB is well formed if it has no messages with (severity="FATAL") or (severity="ERROR" and message-id begins with "PKG-"). In other words, fatal errors and package errors make the file "Not well formed".
    • If well formed, an EPUB is not valid if it has any messages with severity="ERROR".
  4. The EPUBCheck provides a lot of interesting properties that I've included in the JHOVE report. Some of them may affect preservation decisions, so I thought they seemed useful. For example, the presence of encryption, external fonts, or resources that are stored remotely but visually or aurally embedded in the EPUB. The result is an unusually large property/value section! You can review what is included here and here.

I look forward to your comments, and especially welcome any feedback on the rationales described above!

Added EPUB module to jhove-ext-modules.
This uses the EPUBcheck library, with a new MasterReport implementation for EPUBcheck that focuses on collecting and exploring relevant data for JHOVE RepInfo data.
Tests included.
Signature check defaults to valid=true. The signature check doesn't do anything to check validity of a file. This overrides the `checkSignature()` method to make  `valid=UNDETERMINED` by default in all cases.
@karenhanson karenhanson force-pushed the epub-ext branch 2 times, most recently from 2690749 to 10f95c7 Compare July 8, 2019 23:08
@karenhanson karenhanson force-pushed the epub-ext branch 6 times, most recently from 4c88d58 to dd9fd54 Compare July 26, 2019 20:32
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<configuration>
<argLine>-Xss1024k</argLine>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the additional argument for 32bit JVMs. I'm a little curious as to whether this would affect Travis builds as it appears that both of the JVMs on Travis are 64bit. Am also curious if the failure is determinate, i.e. it fails consistently without this arg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior was quite strange. It seemed that one Travis build or the other would fail but never both or none - I'm not sure if that was just a coincidence. They seemed consistent in their failure or success when I repeated the same Travis build. Running them on a small test Linux instance I had they were much less consistent - if I ran them a few times, eventually they would build. I wonder if using a 32-bit JVM simply increases the chance of failure, but the failure is not impossible in the 64-bit environment for some reason.


}

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These generate property methods appear to be useful, but they're also quite generic and might be useful to other module devs. While I don't propose moving them yet it looks like a useful maintenance / refactoring task for a later date.

Copy link
Member

@carlwilson carlwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be an issue with the ordering of elements within the XML report. I'm not sure how serious it is and whether the test framework needs fixing or the module. In essence the XML reports are identical but the elements are in a different order. My guess is this has it's roots in the "multi-threadedness" of the underlying EPUB library.

@karenhanson
Copy link
Contributor Author

I dug in to the property ordering. I think it might be because EpubModule.generateProperties() is returning a Set not a List. It's not fundamentally wrong, but to make it easier to compare files, I will change to a List at the top level of the property/value tree and then we can discuss whether it should use Lists instead of Sets everywhere or not.

karenhanson and others added 7 commits August 20, 2019 14:47
- E-PUB module test baseline update added to `jhove-bbt/scripts/create-1.23-target.sh`;
- simplified test process and added more explicit logging;
- fixed small `shellcheck` lint warnings, particularly error prone directory cycling;
- added comments to disable `shellcheck` inclusion warning; and
- removed `bin/.gitignore` generated by Eclipse.
- replaced `ArrayList` with `TreeSet` for elements testing suggested order mattered;
- helper method to avoid null Property values been added to sets;
- dedicated `Comparator` implementations for `Message` and `Property`; and
- added example and error files for E-PUB module testing.
@codecov
Copy link

codecov bot commented Sep 26, 2019

Codecov Report

Merging #460 into integration will increase coverage by 0.59%.
The diff coverage is n/a.

Impacted file tree graph

@@                Coverage Diff                @@
##             integration     #460      +/-   ##
=================================================
+ Coverage          49.15%   49.75%   +0.59%     
+ Complexity           983      982       -1     
=================================================
  Files                 56       55       -1     
  Lines               7895     7670     -225     
  Branches            1432     1392      -40     
=================================================
- Hits                3881     3816      -65     
+ Misses              3542     3402     -140     
+ Partials             472      452      -20
Impacted Files Coverage Δ Complexity Δ
...in/java/edu/harvard/hul/ois/jhove/InfoMessage.java 41.17% <0%> (-11.77%) 4% <0%> (-1%)
...a/edu/harvard/hul/ois/jhove/NisoImageMetadata.java 74.64% <0%> (-1%) 175% <0%> (-7%)
...edu/harvard/hul/ois/jhove/handler/TextHandler.java 0.54% <0%> (-0.01%) 2% <0%> (ø)
...n/java/edu/harvard/hul/ois/jhove/ChecksumType.java 100% <0%> (ø) 3% <0%> (ø) ⬇️
...in/java/edu/harvard/hul/ois/jhove/RFC1766Lang.java 0% <0%> (ø) 0% <0%> (ø) ⬇️
...du/harvard/hul/ois/jhove/handler/AuditHandler.java 9.21% <0%> (ø) 2% <0%> (ø) ⬇️
.../main/java/edu/harvard/hul/ois/jhove/MacStuff.java 0% <0%> (ø) 0% <0%> (ø) ⬇️
...src/main/java/edu/harvard/hul/ois/jhove/Utils.java
...main/java/edu/harvard/hul/ois/jhove/JhoveBase.java 41.79% <0%> (+0.26%) 41% <0%> (ø) ⬇️
...ain/java/edu/harvard/hul/ois/jhove/ModuleBase.java 50.69% <0%> (+0.35%) 70% <0%> (-1%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e31e254...5199c32. Read the comment docs.

…encies

Inconsistencies were found in how these lists were formed and this opened the door to confusing reports. Removing the distinction between local and remote to match original EPUBCheck module.
@carlwilson carlwilson added this to the v1.24-m4 Release milestone Oct 18, 2019
@carlwilson carlwilson added feature New functionality to be developed P1 High priority issues to be scheduled in the upcoming release labels Oct 18, 2019
@carlwilson
Copy link
Member

FYI @karenhanson am aware of the conflict and how to solve it. Am just waiting on merging this as the conflicts can only be in the test scripts and as this branch / PR is a new module it's easier to untangle conflicts here. It'll get merged this week.

@carlwilson carlwilson merged commit 5199c32 into openpreserve:integration Dec 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality to be developed P1 High priority issues to be scheduled in the upcoming release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants