Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New WARC documentation #489

Merged
merged 3 commits into from
Oct 22, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions modules/gzip/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
title: GZIP-kb Module
---
<!DOCTYPE html>
<html lang="en">
{% include header.html %}
<body role="document">

{% include navbar.html nav=site.data.navbar %}
<div class="container" role="main">
<h1>GZip-kb Module</h1>
<a class="name" name="introduction">
<h2>1 Introduction</h2>
</a>
<p>
The GZIP-kb module recognizes and validates the Gzip (GNU zip) format.
[<a href="/references#gzip">GZip</a>].
</p>
<p>
The module is invoked by the:
</p>
<blockquote>
<pre>
jhove ... -m GZIP-kb ...
</pre>
</blockquote>
<p>
command line option.
</p>
<p>
The GZIP-kb module recognizes and validates <a href="/references#gzip">GZip version 4.3</a>. It also supports multiple member GZip files. This module uses the <a href="/references#jwat">JWAT</a> library for GZip parsing.
</p>

<p>
This module doesn't have configurable parameters.
</p>

<a class="name" name="coverage">
<h2>2 Coverage</h2>
</a>
<p>
The GZIP-kb module recognizes and validates the following public profiles:
</p>
<ul>
<li>
<a href="/references#gzip">RFC 1952</a>
</li>
</ul>

<a class="name" name="well-formedness">
<h2>3 Well-Formedness</h2>
</a>
<p>
The GZip module checks well-formedness.
</p>

<a class="name" name="validity">
<h2>4 Validity</h2></a>
<p>
The following criteria must be met by a GZip file for JHOVE to consider it valid:
</p>
<ul>
<li> The file is well-formed. In GZIP-kb module the well-formedness and validity are equivalent.
</ul>

<a class="name" name="repinfo">
<h2>5 Representation Information</h2></a>
<p>
The MIME type is reported as: application/gzip [<a href="/references#rfc6713">RFC 6713</a>]. Application/x-gzip is also supported

</p>
<p>
In addition to the standard JHOVE
<a href="/documentation#repinfo">representation information</a>, the following
GZip-specific properties are reported:
</p>
<ul>
<li>
Property "GzipEntryProperties"
<ul>
<li>Property "Is non compliant" of type STRING</li>
<li>Property "Offset value" of type STRING</li>
<li>Property "GZip entry name" of type STRING</li>
<li>Property "GZip entry comment" of type STRING</li>
<li>Property "GZip entry date" of type STRING</li>
<li>Property "GZip entry compression method" of type STRING</li>
<li>Property "GZip entry operating system" of type STRING</li>
<li>Property "GZip entry header crc16" of type STRING</li>
<li>Property "GZip entry crc32" of type STRING</li>
<li>Property "GZip entry extracted size (ISIZE)" of type STRING</li>
<li>Property "GZip entry (computed) uncompressed size, in bytes" of type STRING</li>
<li>Property "GZip entry (computed) compressed size, in bytes" of type STRING</li>
<li>Property "GZip entry (computed) compression ratio" of type STRING</li>
</ul>
</li>
</ul>


<a class="name" name="extras">
<h2>6 Additional Module Properties</h2>
</a>
<ul>
<li>Nominal file extension: .gz</li>
</ul>
</div>
{% include footer.html %}
</body>
</html>
114 changes: 114 additions & 0 deletions modules/warc/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
title: WARC-kb Module
---
<!DOCTYPE html>
<html lang="en">
{% include header.html %}
<body role="document">

{% include navbar.html nav=site.data.navbar %}
<div class="container" role="main">
<h1>WARC-kb Module</h1>
<a class="name" name="introduction">
<h2>1 Introduction</h2>
</a>
<p>
The WARC-kb module recognizes and validates the WARC (Web ARChive) format.
[<a href="/references#warciso">WARC</a>]. It only validates the WARC file format and WARC headers, not the actual payload of the WARC records.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[<a href="/references#warciso">WARC</a>]. It only validates the WARC file format and WARC headers, not the actual payload of the WARC records.
[<a href="/references#warc">WARC</a>]. It only validates the WARC file format and WARC headers, not the actual payload of the WARC records.

Use here the link to the warc specification which can be directly used (as opposed to the ISO standard)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Thomas,

Thanks for the feedback! I'm was doubting about this myself. But my reasoning eventually was, that the general links (like this example) point to the ISO standard and specific links (for example the file extension information) to the publicly available specs.

I also point to the 2009 version (ISO28500:2009) while there is a newer version of 2017, but the module documentation points to the 2009 version so that's why I reference it.

So @tledoux do you want me to change this? @carlwilson do you have an opinion about this (or somebody else).

This module uses the <a href="/references#jwat">JWAT</a> library for WARC parsing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This module uses the <a href="/references#jwat">JWAT</a> library for WARC parsing.
This module uses the <a href="/references#jwat">JWAT</a> library for WARC parsing.
It also expects the GZIP-kb module to be present in order to parse compressed WARCs (.warc.gz) which interweave WARC records and GZIP entries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was difficult to verify this because jHove bundles these modules in one jar, but I don't think the GZIP-kb module is required. Because it uses the JWAT library for the GZip parsing. Can you explain why you want this change? @tledoux

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that compressed WARC files (.warc.gz) are very common in the field. So something has to be told about the handling of this "format". Indeed, .warc.gz are very different from, said, .tar.gz. The latter is just the compression of a tarball, the first is a sequence of gzip entries that when they are individually uncompressed build a series of warc records that made a warc file: the two formats are really interweaved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked more in the JWAT library and there the GZip file is checked and the compliancy as well. So I'll add a sentence about that in .warc.gz the GZip compliancy is checked as well with the JWAT library.

Thanks for your feedback, Thomas! @tledoux

For Compressed WARC files the JWAT library is also used to parse compressed WARCs (.warc.gz)
</p>
<p>
The module is invoked by the:
</p>
<blockquote>
<pre>
jhove ... -m WARC-kb ...
</pre>
</blockquote>
<p>
command line option.
</p>
<p>
The WARC-kb module recognizes <a href="/references#warciso">ISO28500:2009</a>.
</p>

<p>
This module doesn't have configurable parameters.
</p>

<a class="name" name="coverage">
<h2>2 Coverage</h2>
</a>
<p>
The WARC-kb module recognizes and validates the following profiles:
</p>
<ul>
<li>
<a href="/references#warciso">ISO28500:2009</a>
</li>
</ul>

<a class="name" name="well-formedness">
<h2>3 Well-Formedness</h2>
</a>
<p>
The WARC module doesn't check the well-formedness
</p>

<a class="name" name="validity">
<h2>4 Validity</h2></a>
<p>
The WARC module only validates the WARC file format, WARC headers. It doesn't check the payload of the WARC records.
</p>

<a class="name" name="repinfo">
<h2>5 Representation Information</h2></a>
<p>
The MIME type is reported as: application/warc
[<a href="https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-file-name-size-and-compression">application/warc, application/warc-fields</a>].
</p>
<p>
In addition to the standard JHOVE
<a href="/documentation#repinfo">representation information</a>, the following
WARC-specific properties are reported:
</p>
<ul>
<li>
Property "WarcRecordProperties"
<ul>
<li>Property "Record offset" of type STRING</li>
<li>Property "Warc-Date" of type STRING</li>
<li>Property "Warc-Record-ID" of type STRING</li>
<li>Property "Record-ID-Scheme" of type STRING</li>
<li>Property "Content-Type" of type STRING</li>
<li>Property "Content-Length" of type STRING</li>
<li>Property "Warc-Type" of type STRING</li>
<li>Property "Warc-Block-Digest" of type STRING</li>
<li>Property "Block-Digest-Algorithm" of type STRING</li>
<li>Property "Block-Digest-Encoding" of type STRING</li>
<li>Property "isValidBlockDigest" of type STRING</li>
<li>Property "Warc-Payload-Digest" of type STRING</li>
<li>Property "Payload-Digest-Algorithm" of type STRING</li>
<li>Property "Payload-Digest-Encoding" of type STRING</li>
<li>Property "isValidPayloadDigest" of type STRING</li>
<li>Property "Warc-Truncated" of type STRING</li>
<li>Property "hasPayload" of type STRING</li>
<li>Property "PayloadLength" of type STRING</li>
<li>Property "Warc-Identified-Payload-Type" of type STRING</li>
<li>Property "Warc-Segment-Number" of type STRING</li>
<li>Property "isNonCompliant value" of type STRING</li>
<li>Property "Computed Block-Digest" of type STRING</li>
</ul>
</li>
</ul>

<h2>6 Additional Module Properties</h2>
</a>
<ul>
<li>Nominal file extension: .warc, <a href="https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#annex-c-informative-warc-file-size-and-name-recommendations">.warc.gz</a></li>
</ul>
</div>
{% include footer.html %}
</body>
</html>
Loading