-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New WARC documentation #489
Conversation
Hi @carlwilson The commit with the GZip documentation is automatically added to this pull request. I don't know how to separate them. Is this fine by you? Now I'm going to work on the dutch translation Thanks Sam |
</a> | ||
<p> | ||
The WARC-kb module recognizes and validates the WARC (Web ARChive) format. | ||
[<a href="/references#warciso">WARC</a>]. It only validates the WARC file format and WARC headers, not the actual payload of the WARC records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[<a href="/references#warciso">WARC</a>]. It only validates the WARC file format and WARC headers, not the actual payload of the WARC records. | |
[<a href="/references#warc">WARC</a>]. It only validates the WARC file format and WARC headers, not the actual payload of the WARC records. |
Use here the link to the warc specification which can be directly used (as opposed to the ISO standard)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Thomas,
Thanks for the feedback! I'm was doubting about this myself. But my reasoning eventually was, that the general links (like this example) point to the ISO standard and specific links (for example the file extension information) to the publicly available specs.
I also point to the 2009 version (ISO28500:2009) while there is a newer version of 2017, but the module documentation points to the 2009 version so that's why I reference it.
So @tledoux do you want me to change this? @carlwilson do you have an opinion about this (or somebody else).
<p> | ||
The WARC-kb module recognizes and validates the WARC (Web ARChive) format. | ||
[<a href="/references#warciso">WARC</a>]. It only validates the WARC file format and WARC headers, not the actual payload of the WARC records. | ||
This module uses the <a href="/references#jwat">JWAT</a> library for WARC parsing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module uses the <a href="/references#jwat">JWAT</a> library for WARC parsing. | |
This module uses the <a href="/references#jwat">JWAT</a> library for WARC parsing. | |
It also expects the GZIP-kb module to be present in order to parse compressed WARCs (.warc.gz) which interweave WARC records and GZIP entries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was difficult to verify this because jHove bundles these modules in one jar, but I don't think the GZIP-kb module is required. Because it uses the JWAT library for the GZip parsing. Can you explain why you want this change? @tledoux
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that compressed WARC files (.warc.gz) are very common in the field. So something has to be told about the handling of this "format". Indeed, .warc.gz are very different from, said, .tar.gz. The latter is just the compression of a tarball, the first is a sequence of gzip entries that when they are individually uncompressed build a series of warc records that made a warc file: the two formats are really interweaved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked more in the JWAT library and there the GZip file is checked and the compliancy as well. So I'll add a sentence about that in .warc.gz the GZip compliancy is checked as well with the JWAT library.
Thanks for your feedback, Thomas! @tledoux
Hi @carlwilson I updated this pull request with the comments of @tledoux. Sam |
Hi @samalloing I had been following the conversation at least but was happy for yourself and @tledoux to discuss. Have begun the great merge up but will be at this into next week. |
This commit resolves #307