Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New WARC documentation #489

Merged
merged 3 commits into from
Oct 22, 2019
Merged

New WARC documentation #489

merged 3 commits into from
Oct 22, 2019

Conversation

samalloing
Copy link
Collaborator

This commit resolves #307

@samalloing
Copy link
Collaborator Author

Hi @carlwilson

The commit with the GZip documentation is automatically added to this pull request. I don't know how to separate them. Is this fine by you?

Now I'm going to work on the dutch translation

Thanks

Sam

@carlwilson carlwilson added feature New functionality to be developed hacktoberfest P1 High priority issues to be scheduled in the upcoming release labels Oct 16, 2019
@carlwilson carlwilson added this to the Doc hack week October 2019 milestone Oct 16, 2019
modules/gzip/index.html Outdated Show resolved Hide resolved
modules/gzip/index.html Outdated Show resolved Hide resolved
</a>
<p>
The WARC-kb module recognizes and validates the WARC (Web ARChive) format.
[<a href="/references#warciso">WARC</a>]. It only validates the WARC file format and WARC headers, not the actual payload of the WARC records.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[<a href="/references#warciso">WARC</a>]. It only validates the WARC file format and WARC headers, not the actual payload of the WARC records.
[<a href="/references#warc">WARC</a>]. It only validates the WARC file format and WARC headers, not the actual payload of the WARC records.

Use here the link to the warc specification which can be directly used (as opposed to the ISO standard)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Thomas,

Thanks for the feedback! I'm was doubting about this myself. But my reasoning eventually was, that the general links (like this example) point to the ISO standard and specific links (for example the file extension information) to the publicly available specs.

I also point to the 2009 version (ISO28500:2009) while there is a newer version of 2017, but the module documentation points to the 2009 version so that's why I reference it.

So @tledoux do you want me to change this? @carlwilson do you have an opinion about this (or somebody else).

<p>
The WARC-kb module recognizes and validates the WARC (Web ARChive) format.
[<a href="/references#warciso">WARC</a>]. It only validates the WARC file format and WARC headers, not the actual payload of the WARC records.
This module uses the <a href="/references#jwat">JWAT</a> library for WARC parsing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This module uses the <a href="/references#jwat">JWAT</a> library for WARC parsing.
This module uses the <a href="/references#jwat">JWAT</a> library for WARC parsing.
It also expects the GZIP-kb module to be present in order to parse compressed WARCs (.warc.gz) which interweave WARC records and GZIP entries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was difficult to verify this because jHove bundles these modules in one jar, but I don't think the GZIP-kb module is required. Because it uses the JWAT library for the GZip parsing. Can you explain why you want this change? @tledoux

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that compressed WARC files (.warc.gz) are very common in the field. So something has to be told about the handling of this "format". Indeed, .warc.gz are very different from, said, .tar.gz. The latter is just the compression of a tarball, the first is a sequence of gzip entries that when they are individually uncompressed build a series of warc records that made a warc file: the two formats are really interweaved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked more in the JWAT library and there the GZip file is checked and the compliancy as well. So I'll add a sentence about that in .warc.gz the GZip compliancy is checked as well with the JWAT library.

Thanks for your feedback, Thomas! @tledoux

@samalloing
Copy link
Collaborator Author

Hi @carlwilson

I updated this pull request with the comments of @tledoux.

Sam

@carlwilson
Copy link
Member

Hi @samalloing I had been following the conversation at least but was happy for yourself and @tledoux to discuss. Have begun the great merge up but will be at this into next week.

@carlwilson carlwilson merged commit ccb22f4 into openpreserve:gh-pages Oct 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality to be developed P1 High priority issues to be scheduled in the upcoming release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants