Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed changes for 1.0 (updated source repo) #19

Open
wants to merge 169 commits into
base: master
Choose a base branch
from

Conversation

acdha
Copy link

@acdha acdha commented Nov 2, 2017

This is a replacement for #17 reflecting the move from the old loc-rdc organization to the primary LibraryOfCongress. The primary notable change from #17 is restoring the fetch.txt section following discussion with @jkunze, @dbrunton, and @johnscancella.

johnscancella and others added 30 commits November 23, 2016 14:34
This appears to have been commented out since at least 2008.
This follows the guidelines in RFC-3629
* Note the existence of namespaces in the security considerations section
* Update previously un-displayed list of reserved DOS/Windows filenames
Clarify that this section is part of the specification
but is not considered a hard requirement for an implementation.
Update the section describing md5sum’s output format and clarify that it is strictly optional to accept bags which are produced using md5sum and will not pass a strict validation.
This adds background information for problems related to
case-sensitivity and Unicode normalization and adds a list of
recommendations for implementors.
This adds the note that, unlike other metadata tags, this element must
not be repeated and clarifies that the Payload-Oxum value is not
sufficient for validation.
This triggers the standard formatting in HTML, etc. outputs
* Use <organization> for relevant <author> entries
* Omit empty <date> attributes
* Remove reference to GRABIT since the spec is now 
  returning HTTP 404 and there are no known public
  implementations.
* Add METALINK (RFC 5854) as an alternative which
  supports mirrors and protocols such as BitTorrent.
This wording is shorter and doesn’t distinguish between
validation for payload and tag files.
The spec shouldn't need to include mechanistic transfer details: if the
results validate, it's a bag.
jkunze added 5 commits May 24, 2018 10:24
…rror handling

"Upon discovering errors in bags, an implementation is free to take action (for example, logging or reporting) in an application-specific manner. This document does not mandate any particular action."
some displays ended with an extra blank line
Per reviewer comment:
>    Section 2.1.3), a file named "bagit.txt" (see Section 2.1.1), and
>    zero or more additional tag files (see Section 2.2).  The tag files
>    in the optional tag directories are arbitrary file hierarchies and
>    the tag directories MAY have any name that is not reserved for a file
>    or directory in this specification.

Above (2) seems to say that all tag directories are optional.  Hence
constantly including the word 'optional' for them, in the rest of the
document, is distracting.

>
>    The base directory MAY have any name.
>
>            <base directory>/
>            |   bagit.txt
>            |   manifest-<algorithm>.txt
>            |   [optional additional tag files]
>            \--- data/
>                  |   [payload files]
>            \--- [optional tag directories]/
>                  |   [optional tag files]

The square brackets are probably enough to indicate being optional. The
word just makes things wordier.

_The word “optional” has been removed as redundant, given the bracketing and that all tag directories have been described previously as optional._
bagit.xml Outdated
@@ -287,8 +287,7 @@ The base directory can have any name.
|
+-- [optional tag directories]/
|
+-- [optional tag files]
</artwork>
+-- [optional tag files] </artwork>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the intention of this to avoid an extra line in the rendered text output? I'm not a huge fan of the closing tag being on the end of the line like this but I'm not sure it's worth changing everything to go the other way.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, and it definitely makes the XML ugly. It may be a defect in the xml2rfc tool. If you can find a way around it, go for it, but if push comes to shove, I think getting the rendered human-oriented document consistent and correct is more important than making the XML pretty.

jkunze and others added 21 commits May 25, 2018 10:34
Per reviewer comment:
>    A payload manifest is a tag file that lists payload files and

probably:

    that lists payload file names and

 Clarified.

Saying "lists" does imply names and not the file contents, but for some
reason I think the modified form will be clearer.


>    checksums for those payload files generated using a particular bag

I'm pretty sure it's not the payload files that are generated using a
checksum algorithm...  I assume it's a manifest payload file listing...

_That sentence was stricken during recent editing rounds. A similar sentence has been reworded: “Every payload manifest MUST list every payload file name exactly once.”_
Per reviewer comment:
>    checksum algorithm.  Every bag MUST contain one payload manifest
>    file, and MAY contain more than one.  A payload manifest file MUST

I think this is unusual enough to warrant, again, an initial, summary
statement.  If I'm understanding, it should be something like:

      A bag can have more than one data integrity manifest, with each
using a different validation algorithm.

_This sentence has been added: A bag can have more than one payload manifest, with each
using a different validation algorithm._
Per reviewer comment:
>    Source-Organization  Organization transferring the content.
...
>    Organization-Address  Mailing address of the organization.

    organization  ->  source organization

>    Contact-Name  Person at the source organization who is responsible
>       for the content transfer.
>
>    Contact-Phone  International format telephone number of person or
>       position responsible.
>
>    Contact-Email  Fully qualified email address of person or position
>       responsible.
> ...
>    External-Description  A brief explanation of the contents and
>       provenance.
...
>    Bagging-Date  Date (YYYY-MM-DD) that the content was prepared for
>       delivery.

I think you mean 'transfer' rather than 'delivery'...
Per reviewer comment:
>    The "fetch.txt" file allows a bag to be transmitted with "holes" in
>    it, which can be practical for several reasons.  For example, it
>    obviates the need for the sender to stage a large serialized copy of
>    the content while the bag is transferred to the receiver.  Also, this
>    method allows a sender to construct a bag from components that are
>    either a subset of logically related components (e.g., the localized
>    logical object could be much larger than what is intended for export)
>    or assembled from logically distributed sources (e.g., the object
>    components for export are not stored locally under one filesystem
>    tree).

This paragraph would be a better introduction to the section.

_Done._
Per reviewer comment:
>    Implementors of tools that complete bags by retrieving URLs listed in
>    a "fetch.txt" file need to be aware that some of those URLs may point
>    to hosts, intentionally or unintentionally, that are not under
>    control of the bag's sender.  Checksums are intended as a reasonable
>    guarantee against corruption during transit, not a strong
>    cryptographic protection against intentional spoofing.

Oh?

_This wording was meant to apply to checksums as they are used in bags, as well as to address criticism that many legacy bags used easily broken MD5 checksums. That last sentence has now been reworded to: Moreover, older checksum algorithms, even if reasonable for detecting corruption during transit, may not offer strong cryptographic protection against intentional spoofing._
Per reviewer comment:
>    In all text tag files except for the bag declaration file, text MUST
>    be encoded in the character encoding specified in the "bagit.txt" bag

    be encoded in the character encoding  -> use the character encoding

_Done._
Per reviewer comment:
>    The size of files, as optionally reported in the "fetch.txt" file,
>    cannot be guaranteed to match the actual file size to be downloaded.
>    Implementors SHOULD take care to appropriately handle cases where the
>    actual file size does not match the file size reported in the
>    fetch.txt.  Implementors SHOULD NOT use the file size in the
>    "fetch.txt" file for critical resource allocation, such as buffer
>    sizing or storage requisitioning.

Absent specification of what "appropriately handle" means, this guidance
lacks substance.

_Reworded the second sentence to be: Implementers SHOULD take steps to monitor and abort transfer when the received file size exceeds the file size reported in the fetch file._
Update Justin's contact info
Changed reference to character set registry.
Added clarification about malicious attackers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants