Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX - EPUB validation module based on W3C's EPUBCheck #543

Merged
merged 22 commits into from
Dec 10, 2019

Conversation

carlwilson
Copy link
Member

@carlwilson carlwilson commented Dec 10, 2019

This change is to add the EPUB-ptc extension module version 1.0. The module wraps the W3C EPUBCheck validation tool as a JHOVE module.

Some notes in advance of full documentation:

  1. Tests use examples derived from IDPF's EPUB 3 samples and test suite on GitHub and National Library of the Netherlands / Research epubPolicyTests, also on GitHub. These may be useful for additional test material.
  2. The checkSignatures() method is based on the Library of Congress EPUB2and EPUB3 documentation. For this valid is always set to UNDETERMINED. This check only looks at "magic numbers" and file extension. If the signature matches, it sets "well formed" to true. Note that it does not perform the additional package checks that contribute to "well formed"-ness in parse() so there is a small inconsistency there.
  3. Well-formed/valid status is currently dependent on the messages returned by EPUBCheck (except when a non EPUB is passed and it throws an exception see this issue). The full list of possible messages and their severity according to EPUBCheck can be seen here.
    • an EPUB is well formed if it has no messages with (severity="FATAL") or (severity="ERROR" and message-id begins with "PKG-"). In other words, fatal errors and package errors make the file "Not well formed".
    • If well formed, an EPUB is not valid if it has any messages with severity="ERROR".
  4. The EPUBCheck provides a lot of interesting properties that I've included in the JHOVE report. Some of them may affect preservation decisions, so I thought they seemed useful. For example, the presence of encryption, external fonts, or resources that are stored remotely but visually or aurally embedded in the EPUB. The result is an unusually large property/value section! You can review what is included here and here.

Includes #460 contributed by @karenhanson

karenhanson and others added 21 commits June 10, 2019 15:31
Added EPUB module to jhove-ext-modules.
This uses the EPUBcheck library, with a new MasterReport implementation for EPUBcheck that focuses on collecting and exploring relevant data for JHOVE RepInfo data.
Tests included.
Signature check defaults to valid=true. The signature check doesn't do anything to check validity of a file. This overrides the `checkSignature()` method to make  `valid=UNDETERMINED` by default in all cases.
- E-PUB module test baseline update added to `jhove-bbt/scripts/create-1.23-target.sh`;
- simplified test process and added more explicit logging;
- fixed small `shellcheck` lint warnings, particularly error prone directory cycling;
- added comments to disable `shellcheck` inclusion warning; and
- removed `bin/.gitignore` generated by Eclipse.
- replaced `ArrayList` with `TreeSet` for elements testing suggested order mattered;
- helper method to avoid null Property values been added to sets;
- dedicated `Comparator` implementations for `Message` and `Property`; and
- added example and error files for E-PUB module testing.
…encies

Inconsistencies were found in how these lists were formed and this opened the door to confusing reports. Removing the distinction between local and remote to match original EPUBCheck module.
@carlwilson carlwilson self-assigned this Dec 10, 2019
@carlwilson carlwilson added this to the v1.24-m4 Release milestone Dec 10, 2019
@codecov
Copy link

codecov bot commented Dec 10, 2019

Codecov Report

Merging #543 into integration will increase coverage by 0.2%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##             integration     #543     +/-   ##
================================================
+ Coverage          49.44%   49.65%   +0.2%     
- Complexity           984      992      +8     
================================================
  Files                 56       56             
  Lines               7768     7768             
  Branches            1409     1409             
================================================
+ Hits                3841     3857     +16     
+ Misses              3457     3445     -12     
+ Partials             470      466      -4
Impacted Files Coverage Δ Complexity Δ
.../edu/harvard/hul/ois/jhove/handler/XmlHandler.java 63% <0%> (+0.27%) 266% <0%> (+2%) ⬆️
...src/main/java/edu/harvard/hul/ois/jhove/Utils.java 72.22% <0%> (+2.38%) 29% <0%> (+1%) ⬆️
...c/main/java/edu/harvard/hul/ois/jhove/RepInfo.java 56.92% <0%> (+3.84%) 40% <0%> (+4%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88e9211...cf2166a. Read the comment docs.

@carlwilson carlwilson merged commit dfab9ae into integration Dec 10, 2019
@carlwilson carlwilson deleted the fix/epub-ext branch December 10, 2019 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants