-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New WARC documentation #489
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
--- | ||
title: GZIP-kb Module | ||
--- | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
{% include header.html %} | ||
<body role="document"> | ||
|
||
{% include navbar.html nav=site.data.navbar %} | ||
<div class="container" role="main"> | ||
<h1>GZip-kb Module</h1> | ||
<a class="name" name="introduction"> | ||
<h2>1 Introduction</h2> | ||
</a> | ||
<p> | ||
The GZIP-kb module recognizes and validates the Gzip (GNU zip) format. | ||
[<a href="/references#gzip">GZip</a>]. | ||
</p> | ||
<p> | ||
The module is invoked by the: | ||
</p> | ||
<blockquote> | ||
<pre> | ||
jhove ... -m GZIP-kb ... | ||
</pre> | ||
</blockquote> | ||
<p> | ||
command line option. | ||
</p> | ||
<p> | ||
The GZIP-kb module recognizes and validates <a href="/references#gzip">GZip version 4.3</a>. It also supports multiple member GZip files. This module uses the <a href="/references#jwat">JWAT</a> library for GZip parsing. | ||
</p> | ||
|
||
<p> | ||
This module doesn't have configurable parameters. | ||
</p> | ||
|
||
<a class="name" name="coverage"> | ||
<h2>2 Coverage</h2> | ||
</a> | ||
<p> | ||
The GZIP-kb module recognizes and validates the following public profiles: | ||
</p> | ||
<ul> | ||
<li> | ||
<a href="/references#gzip">RFC 1952</a> | ||
</li> | ||
</ul> | ||
|
||
<a class="name" name="well-formedness"> | ||
<h2>3 Well-Formedness</h2> | ||
</a> | ||
<p> | ||
The GZip module checks well-formedness. | ||
</p> | ||
|
||
<a class="name" name="validity"> | ||
<h2>4 Validity</h2></a> | ||
<p> | ||
The following criteria must be met by a GZip file for JHOVE to consider it valid: | ||
</p> | ||
<ul> | ||
<li> The file is well-formed. In GZIP-kb module the well-formedness and validity are equivalent. | ||
</ul> | ||
|
||
<a class="name" name="repinfo"> | ||
<h2>5 Representation Information</h2></a> | ||
<p> | ||
The MIME type is reported as: application/gzip [<a href="/references#rfc6713">RFC 6713</a>]. Application/x-gzip is also supported | ||
|
||
</p> | ||
<p> | ||
In addition to the standard JHOVE | ||
<a href="/documentation#repinfo">representation information</a>, the following | ||
GZip-specific properties are reported: | ||
</p> | ||
<ul> | ||
<li> | ||
Property "GzipEntryProperties" | ||
<ul> | ||
<li>Property "Is non compliant" of type STRING</li> | ||
<li>Property "Offset value" of type STRING</li> | ||
<li>Property "GZip entry name" of type STRING</li> | ||
<li>Property "GZip entry comment" of type STRING</li> | ||
<li>Property "GZip entry date" of type STRING</li> | ||
<li>Property "GZip entry compression method" of type STRING</li> | ||
<li>Property "GZip entry operating system" of type STRING</li> | ||
<li>Property "GZip entry header crc16" of type STRING</li> | ||
<li>Property "GZip entry crc32" of type STRING</li> | ||
<li>Property "GZip entry extracted size (ISIZE)" of type STRING</li> | ||
<li>Property "GZip entry (computed) uncompressed size, in bytes" of type STRING</li> | ||
<li>Property "GZip entry (computed) compressed size, in bytes" of type STRING</li> | ||
<li>Property "GZip entry (computed) compression ratio" of type STRING</li> | ||
</ul> | ||
</li> | ||
</ul> | ||
|
||
|
||
<a class="name" name="extras"> | ||
<h2>6 Additional Module Properties</h2> | ||
</a> | ||
<ul> | ||
<li>Nominal file extension: .gz</li> | ||
</ul> | ||
</div> | ||
{% include footer.html %} | ||
</body> | ||
</html> |
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,114 @@ | ||||||||
--- | ||||||||
title: WARC-kb Module | ||||||||
--- | ||||||||
<!DOCTYPE html> | ||||||||
<html lang="en"> | ||||||||
{% include header.html %} | ||||||||
<body role="document"> | ||||||||
|
||||||||
{% include navbar.html nav=site.data.navbar %} | ||||||||
<div class="container" role="main"> | ||||||||
<h1>WARC-kb Module</h1> | ||||||||
<a class="name" name="introduction"> | ||||||||
<h2>1 Introduction</h2> | ||||||||
</a> | ||||||||
<p> | ||||||||
The WARC-kb module recognizes and validates the WARC (Web ARChive) format. | ||||||||
[<a href="/references#warciso">WARC</a>]. It only validates the WARC file format and WARC headers, not the actual payload of the WARC records. | ||||||||
This module uses the <a href="/references#jwat">JWAT</a> library for WARC parsing. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was difficult to verify this because jHove bundles these modules in one jar, but I don't think the GZIP-kb module is required. Because it uses the JWAT library for the GZip parsing. Can you explain why you want this change? @tledoux There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems to me that compressed WARC files (.warc.gz) are very common in the field. So something has to be told about the handling of this "format". Indeed, .warc.gz are very different from, said, .tar.gz. The latter is just the compression of a tarball, the first is a sequence of gzip entries that when they are individually uncompressed build a series of warc records that made a warc file: the two formats are really interweaved. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I looked more in the JWAT library and there the GZip file is checked and the compliancy as well. So I'll add a sentence about that in .warc.gz the GZip compliancy is checked as well with the JWAT library. Thanks for your feedback, Thomas! @tledoux |
||||||||
For Compressed WARC files the JWAT library is also used to parse compressed WARCs (.warc.gz) | ||||||||
</p> | ||||||||
<p> | ||||||||
The module is invoked by the: | ||||||||
</p> | ||||||||
<blockquote> | ||||||||
<pre> | ||||||||
jhove ... -m WARC-kb ... | ||||||||
</pre> | ||||||||
</blockquote> | ||||||||
<p> | ||||||||
command line option. | ||||||||
</p> | ||||||||
<p> | ||||||||
The WARC-kb module recognizes <a href="/references#warciso">ISO28500:2009</a>. | ||||||||
</p> | ||||||||
|
||||||||
<p> | ||||||||
This module doesn't have configurable parameters. | ||||||||
</p> | ||||||||
|
||||||||
<a class="name" name="coverage"> | ||||||||
<h2>2 Coverage</h2> | ||||||||
</a> | ||||||||
<p> | ||||||||
The WARC-kb module recognizes and validates the following profiles: | ||||||||
</p> | ||||||||
<ul> | ||||||||
<li> | ||||||||
<a href="/references#warciso">ISO28500:2009</a> | ||||||||
</li> | ||||||||
</ul> | ||||||||
|
||||||||
<a class="name" name="well-formedness"> | ||||||||
<h2>3 Well-Formedness</h2> | ||||||||
</a> | ||||||||
<p> | ||||||||
The WARC module doesn't check the well-formedness | ||||||||
</p> | ||||||||
|
||||||||
<a class="name" name="validity"> | ||||||||
<h2>4 Validity</h2></a> | ||||||||
<p> | ||||||||
The WARC module only validates the WARC file format, WARC headers. It doesn't check the payload of the WARC records. | ||||||||
</p> | ||||||||
|
||||||||
<a class="name" name="repinfo"> | ||||||||
<h2>5 Representation Information</h2></a> | ||||||||
<p> | ||||||||
The MIME type is reported as: application/warc | ||||||||
[<a href="https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-file-name-size-and-compression">application/warc, application/warc-fields</a>]. | ||||||||
</p> | ||||||||
<p> | ||||||||
In addition to the standard JHOVE | ||||||||
<a href="/documentation#repinfo">representation information</a>, the following | ||||||||
WARC-specific properties are reported: | ||||||||
</p> | ||||||||
<ul> | ||||||||
<li> | ||||||||
Property "WarcRecordProperties" | ||||||||
<ul> | ||||||||
<li>Property "Record offset" of type STRING</li> | ||||||||
<li>Property "Warc-Date" of type STRING</li> | ||||||||
<li>Property "Warc-Record-ID" of type STRING</li> | ||||||||
<li>Property "Record-ID-Scheme" of type STRING</li> | ||||||||
<li>Property "Content-Type" of type STRING</li> | ||||||||
<li>Property "Content-Length" of type STRING</li> | ||||||||
<li>Property "Warc-Type" of type STRING</li> | ||||||||
<li>Property "Warc-Block-Digest" of type STRING</li> | ||||||||
<li>Property "Block-Digest-Algorithm" of type STRING</li> | ||||||||
<li>Property "Block-Digest-Encoding" of type STRING</li> | ||||||||
<li>Property "isValidBlockDigest" of type STRING</li> | ||||||||
<li>Property "Warc-Payload-Digest" of type STRING</li> | ||||||||
<li>Property "Payload-Digest-Algorithm" of type STRING</li> | ||||||||
<li>Property "Payload-Digest-Encoding" of type STRING</li> | ||||||||
<li>Property "isValidPayloadDigest" of type STRING</li> | ||||||||
<li>Property "Warc-Truncated" of type STRING</li> | ||||||||
<li>Property "hasPayload" of type STRING</li> | ||||||||
<li>Property "PayloadLength" of type STRING</li> | ||||||||
<li>Property "Warc-Identified-Payload-Type" of type STRING</li> | ||||||||
<li>Property "Warc-Segment-Number" of type STRING</li> | ||||||||
<li>Property "isNonCompliant value" of type STRING</li> | ||||||||
<li>Property "Computed Block-Digest" of type STRING</li> | ||||||||
</ul> | ||||||||
</li> | ||||||||
</ul> | ||||||||
|
||||||||
<h2>6 Additional Module Properties</h2> | ||||||||
</a> | ||||||||
<ul> | ||||||||
<li>Nominal file extension: .warc, <a href="https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#annex-c-informative-warc-file-size-and-name-recommendations">.warc.gz</a></li> | ||||||||
</ul> | ||||||||
</div> | ||||||||
{% include footer.html %} | ||||||||
</body> | ||||||||
</html> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use here the link to the warc specification which can be directly used (as opposed to the ISO standard)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Thomas,
Thanks for the feedback! I'm was doubting about this myself. But my reasoning eventually was, that the general links (like this example) point to the ISO standard and specific links (for example the file extension information) to the publicly available specs.
I also point to the 2009 version (ISO28500:2009) while there is a newer version of 2017, but the module documentation points to the 2009 version so that's why I reference it.
So @tledoux do you want me to change this? @carlwilson do you have an opinion about this (or somebody else).