From 480c5f672325f3162ee94ef889a57eed21f1a871 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Wed, 23 Nov 2016 14:34:11 -0500 Subject: [PATCH 001/144] first cut at bagit 1.0 spec --- bagit.xml | 610 ++++++------------------------------------------------ 1 file changed, 62 insertions(+), 548 deletions(-) diff --git a/bagit.xml b/bagit.xml index 68e83397..0b1c8b79 100644 --- a/bagit.xml +++ b/bagit.xml @@ -37,7 +37,7 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) - + ]> @@ -93,8 +93,9 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) - Library of Congress + George Washington University Libraries +
101 Independence Avenue SE @@ -106,6 +107,23 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak)
+ + + University of Maryland + + +
+ + 101 Independence Avenue SE + Washington DC + 20540 + USA + + ehs@pobox.com +
+
+ @@ -122,8 +140,8 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) - + Library of Congress @@ -134,9 +152,25 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) 20540 USA - ehs@pobox.com + jsca@loc.gov - + + + + + Library of Congress + +
+ + 101 Independence Avenue SE + Washington DC + 20540 + USA + + cadams@loc.gov +
+
@@ -202,11 +236,7 @@ document are to be interpreted as described in . An implementation is not compliant if it fails to satisfy one or more of the MUST or REQUIRED level requirements for the protocols -it implements. An implementation that satisfies all the MUST or -REQUIRED level and all the SHOULD level requirements for its protocols -is said to be "unconditionally compliant"; one that satisfies all -the MUST level requirements but not all the SHOULD level requirements -for its protocols is said to be "conditionally compliant." +it implements. @@ -241,7 +271,7 @@ in this section are intended to clarify any ambiguity. A bag which comprises all elements required by this specification, with all files listed in all payload and tag manifests present, - all payload files present listed in at least one manifest. See + all payload files present listed in all manifests. See . @@ -364,7 +394,8 @@ specification, and is not otherwise prescribed. A payload manifest is a tag file that lists payload files and checksums for those payload files generated using a particular bag checksum algorithm. Every bag &must; contain one payload manifest file, and &may; contain -more than one. A payload manifest file &must; +more than one. If there is more than one, each payload manifest +file must list all payload files. A payload manifest file &must; have a name of the form manifest-algorithm.txt, where algorithm is a string specifying the bag checksum algorithm used in that manifest, such as: @@ -390,16 +421,14 @@ CHECKSUM FILENAME -where FILENAME is the pathname of a file relative to the base directory -and CHECKSUM is a hex-encoded checksum calculated according to algorithm over every octet in the file. The hex-encoded -checksum &may; use uppercase and/or lowercase letters. The slash +where FILENAME is the pathname of a file relative to the base directory, + and CHECKSUM is a hex-encoded checksum calculated according to algorithm over every octet in the file. +The hex-encoded checksum &may; use uppercase and/or lowercase letters. The slash character ('/') &must; be used as a path separator in FILENAME. One or more linear whitespace characters (spaces or tabs) &must; separate -CHECKSUM from FILENAME. An asterisk ('*') &may; preceed FILENAME for -interoperability on some platforms (see ). There is no limitation on the length of a pathname. The payload -manifest &must-not; reference files outside the payload directory. If +CHECKSUM from FILENAME. There is no limitation on the length of a pathname. +The payload manifest &must-not; reference files outside the payload directory. If a FILENAME includes a newline (LF), a carriage return (CR), or carriage return plus newline (CRLF) it &must; be percent-encoded . @@ -454,7 +483,7 @@ The "bag-info.txt" file is a tag file that contains metadata elements describing the bag and the payload. The metadata elements contained in the "bag-info.txt" file are intended primarily for human readability. All metadata elements are optional and &may; be repeated. Implementations -&should; assume that the ordering is significant and provide access to the +&must; assume that the ordering is significant and provide access to the metadata elements in the order they are given in the "bag-info.txt" file. @@ -466,41 +495,9 @@ or carriage return plus newline (CRLF) and indenting the next line with linear white space (spaces or tabs). -Reserved metadata element names are case-insensitive and defined as follows. - - - - - - Organization transferring the content. - - - Mailing address of the organization. - - - Person at the source organization who is responsible for the content - transfer. - - - International format telephone number of person or position responsible. - - - Fully qualified email address of person or position responsible. - - - A brief explanation of the contents and provenance. - - - Date (YYYY-MM-DD) that the content was prepared for delivery. - - - A sender-supplied identifier for the bag. - - - Size or approximate size of the bag being transferred, followed - by an abbreviation such as MB (megabytes), GB, or TB; for example, - 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum (described - next), Bag-Size is intended for human consumption. +A implementation &may; assume the metadata element Payload-Oxum exists +for the purpose of quickly verifying the bag. If it does exist it &must; +be in the form as defined as follows: The "octetstream sum" of the payload, namely, a two-part number @@ -511,33 +508,6 @@ Reserved metadata element names are case-insensitive and defined as follows. possible. Compared to Bag-Size (above), Payload-Oxum is intended for machine consumption. - - A sender-supplied identifier for the set, if any, of bags - to which it logically belongs. - This identifier must be unique across the sender's content, and if - recognizable as belonging to a globally unique scheme, the receiver - should make an effort to honor reference to it. - - - Two numbers separated by "of", in particular, "N of T", - where T is the total number of bags in a group of bags and N is the - ordinal number within the group; if T is not known, specify it as "?" - (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 145. - - - An alternate sender-specific identifier for the content - and/or bag. - - - A sender-local prose description of the contents of the - bag. - - - - - -In addition to these metadata elements, other arbitrary metadata elements may also be present. - Here is an example "bag-info.txt" file. @@ -566,46 +536,6 @@ Here is an example "bag-info.txt" file. -
- - -For reasons of efficiency, a bag &may; be sent with a list of files to be -fetched and added to the payload before it can meaningfully be checked -for completeness. An &optional; tag file named "fetch.txt" -contains such a list. Each line of "fetch.txt" has the form - -
- -URL LENGTH FILENAME - -
- -where URL identifies the file to be fetched, LENGTH is the number of -octets in the file (or "-", to leave it unspecified), and FILENAME -identifies the corresponding payload file, relative to the base directory. -The slash character ('/') &must; be used as a path separator in FILENAME. -If FILENAME begins with a slash character, the destination &must; still be -treated as relative to the bag base directory. -One or more linear whitespace characters (spaces or tabs) &must; separate these -three values, and any such characters in the URL &must; be percent-encoded -. There is no limitation on the length of any -of the fields in the "fetch.txt". -
- - -The "fetch.txt" file allows a bag to be transmitted with -"holes" in it, which can be practical for several reasons. For example, -it obviates the need for the sender to stage a large serialized copy of -the content while the bag is transferred to the receiver. Also, this -method allows a sender to construct a bag from components that are either -a subset of logically related components (e.g., the localized logical -object could be much larger than what is intended for export) or -assembled from logically distributed sources (e.g., the object components -for export are not stored locally under one filesystem tree). - - -
-
A bag &may; contain other tag files that are not defined by this @@ -691,8 +621,7 @@ attributes: Every file in every payload manifest &must; be present. Every file in every tag manifest &must; be present. Tag files not listed in a tag manifest &may; be present. - Every payload file &must; be listed in at least one manifest. - Payload files &may; be listed in more than one payload manifest. + Every payload file &must; be listed in all manifests. Every element present &must; comply with this specification. @@ -706,9 +635,6 @@ the following exceptions to the attributes of a complete bag: One or more files in any payload manifest are absent. One or more files in any tag manifest are absent. - A fetch.txt is present. Any files listed in - any payload manifest or any tag manifest which are - absent &must; be listed in the fetch.txt. @@ -864,70 +790,6 @@ or '*' (matching any number of characters). For example,
--> - -
- - -The following example bag contains content from a web crawler. -As before, lines of file content are shown in parentheses beneath the -file name, with long lines continued indented on subsequent lines. -This bag is not complete until every -component listed in the "fetch.txt" file is retrieved. - -
- -mysecondbag/ -| -| manifest-md5.txt -| (93c53193ef96732c76e00b3fdd8f9dd3 data/Collection Overview.txt ) -| (e9c5753d65b1ef5aeb281c0bb880c6c8 data/Seed List.txt ) -| (61c96810788283dc7be157b340e4eff4 data/gov-20060601-050019.arc.gz) -| (55c7c80c6635d5a4c8fe76a940bf353e data/gov-20060601-100002.arc.gz) -| -| fetch.txt -| (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-050019.arc.gz -| 26583985 data/gov-20060601-050019.arc.gz ) -| (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-100002.arc.gz -| 99509720 data/gov-20060601-100002.arc.gz ) -| ( ...............................................................) -| -| bag-info.txt -| (Source-organization: California Digital Library ) -| (Organization-address: 415 20th St, 4th Floor, Oakland, CA 94612) -| (Contact-name: A. E. Newman ) -| (Contact-phone: +1 510-555-1234 ) -| (Contact-email: alfred@ucop.edu ) -| (External-Description: The collection "Local Davis Flood Control ) -| Collection" includes captured California State and local ) -| websites containing information on flood control resources for ) -| the Davis and Sacramento area. Sites were captured by UC Davis) -| curator Wrigley Spyder using the Web Archiving Service in ) -| February 2007 and October 2007. ) -| (Bag-date: 2008.04.15 ) -| (External-identifier: ark:/13030/fk4jm2bcp ) -| (Bag-size: about 22Gb ) -| (Payload-Oxum: 21836794142.831 ) -| (Internal-sender-identifier: UCDL ) -| (Internal-sender-description: UC Davis Libraries ) -| -| bagit.txt -| (BagIt-version: 0.96 ) -| (Tag-File-Character-Encoding: UTF-8 ) -| -\--- data/ - | - | Collection Overview.txt - | (... narrative description ... ) - | - | Seed List.txt - | (... list of crawler starting point URLs ... ) - .... - -
- -
- -
@@ -935,8 +797,8 @@ mysecondbag/
-The paths specified in the payload manifest, tag manifest, and -"fetch.txt" file do not prohibit special directory characters which might be +The paths specified in the payload manifest, and tag manifest +file do not prohibit special directory characters which might be significant on implementing systems. Implementors &should; take care that files outside the bag directory structure are not accessed when reading or writing files based on paths specified in a bag. @@ -944,34 +806,10 @@ writing files based on paths specified in a bag. For example, path characters such as ".." or "~" -in a maliciously crafted "fetch.txt" file might cause a naive implementation to +in a maliciously crafted "tagmanifest-md5.txt" file might cause a naive implementation to overwrite critical system files.
- -
- -Implementors of tools that complete bags by retrieving URLs listed in a -"fetch.txt" file need to be aware that some of those URLs may point to hosts, -intentionally or unintentionally, that are not under control of the bag's -sender. Checksums are intended as a reasonable guarantee against corruption -during transit, not a strong cryptographic protection against intentional -spoofing. - -
- -
- - -The size of files, as optionally reported in the "fetch.txt" file, cannot be -guaranteed to match the actual file size to be downloaded. Implementors &should; -take care to appropriately handle cases where the actual file size does not -match the file size reported in the fetch.txt. Implementors &should-not; use -the file size in the "fetch.txt" file for critical resource allocation, such as -buffer sizing or storage requisitioning. - -
-
@@ -997,11 +835,6 @@ large bag into several smaller bags. -Transmitting a whole bag in serialized form as a single file will tend -to be the most straightforward mode of transfer. When throughput is a -priority, use of "fetch.txt" lends itself to an easy, application-level -parallelism in which the list of URL-addressed items to fetch is divided -among multiple processes. The mechanics of sending and receiving bags over networks is otherwise out of scope of the present document and may be facilitated by protocols such as and . @@ -1013,7 +846,7 @@ such as and . This section is not part of the BagIt specification. It describes some -practical considerations for bag creators and receivers circa 2010. +practical considerations for bag creators and receivers circa 2016.
@@ -1025,11 +858,8 @@ package (md5sum, sha1sum, etc.), collectively referred to here as "md5sum". First, md5sum can be run in binary or text mode; text mode sometimes normalizes line-endings. While these modes appear to produce the same checksums under Unix-like systems, they -can produce different checksums under Windows. When using md5sum, it -may be safest to run it in binary mode, with one caveat: a side-effect -of binary mode is that md5sum requires a space and an asterisk ('*'), -compared to two spaces in text mode, between the CHECKSUM and FILENAME in -its manifest format. +can produce different checksums under Windows. When using md5sum, you +&should; always run it in binary mode. @@ -1054,7 +884,7 @@ backslashes inside FILENAME may be doubled. As specified above, only the Unix-based path separator ('/') may be -used inside filenames listed in BagIt manifests and "fetch.txt" files. +used inside filenames listed in BagIt manifests files. When bags are exchanged between Windows and Unix platforms, care should be taken to translate the path separator as needed. Receivers of bags on physical media should be prepared for filesystems created under either @@ -1157,322 +987,6 @@ This draft does not request any action from IANA. -
- - -(This appendix to be removed in the final draft.) - - -
- -Allowing tag directories. - - - -Fixed definition of valid. - - - -Clarified that tag files do not need to be text files. - - - -Clarified that repeatability and ordering of metadata elements in bag-info.txt. - - - -Clarified case of hex-encoding in manifests. - - -
- -
- -Re-replaced entity reference for current version number in artwork, -where it doesn't appear to work (xml2rfc bug?). -Updated to latest IETF Trust Legal Provisions 200902. (jak) - - - -Re-wording Tag File Format section. - - - -Adding new section for Other Tag Files. - - - -Minor clarification on the Fetch File description. - - - -Synchronized the language between the Payload Manifest and the Tag Manifest sections. - - - -Minor grammatical corrections and clarifications to the Payload Manifest section. - - - -Re-worded and re-ordered payload section and structure intro. Except for the base directory naming, the structure intro is strictly explanatory. - - - -Replaced current version number with entity reference. - - - -Move checksum algorithm information into its own section. - - - -Major re-wording of section on validity and completeness to provide -explicit, enumerated definitions for "valid", "complete", and "incomplete" bags. - - - -Added explicit wording about byte order marks (BOM) in UTF-8. - - - -Re-named section titles for better clarity. - - - -Re-wording security consideration on checksum purposes to more accurately -reflect the real purposes of the checksums. - - - -Major restructuring of the document for brevity and -precision. - - - -Added RFC 2119 language. - - - -Added terminology section. - - - -Cleaning up example artwork so that parenthesis are more consistently used. - - - -Explicitly stated version number required for comforming to the current -version of the specification. - - - -Various minor tweaks to grammar and wording. - -
- -
- -Re-worded interoperability statement in the Introduction. (Justin) - - - -Added statements regarding no limitations on various paths, URI, and other -lengths. - - - -Clarified that the bag directory may not contain any other directories except -for the "data" directory. - - - -A soel carriage return character is now explicitly allowed as a valid line -separator. - - - -Tag file encoding requirements are now required to be as-stated in the -"bagit.txt". The "bagit.txt" file is explicitly required to be in UTF-8. - - - -Wording cleanup, clarifying payload file manifests and tag file manifests. - - - -Tags in "bag-info.txt" no longer have any ordering requirement. - - - -Tag formatting now explicitly states where significant whitespace begins -in the tag. - - - -After some consideration, added some security considerations. - - - -Made it clear that a bag may contain other bags, re: serialization. - - - -Re-worded interoperabiilty to concerns to require creators to be -spec-compliant, and readers to be tolerant of known potential issues. - - - -Specificity to the FILENAME element in "fetch.txt" is relative to the bag -root, and to make sure to treat leading slashes as relative. - - - -Updated acknowledgements. - - - -Various other minor edits for clarity and readibility. - -
- -
- - -Added language to require the slash ('/') as path separator, -regardless of the platform where the bag was created. -Added an extra co-author and an Acknowledgements section. - - - -Deleted the unnecessary "(optional)" from four of the metadata elements, -since all metadata elements are optional. Softened the equivalence of -the serialization name and name of the contained bag base directory. -Replaced the reference to RFC2822 with an inline description of the -simpler bag-info.txt format. - - - -Changed to a variable linear whitespace separator in the description -of manifest layout and in manifest examples. -Added two paragraphs under a new "Checksum tools" subsection of the -Interoperability section to describe some of the peculiarities of -dealing with the widely used GNU Coreutils checksum tools. - - - -With the new version, 0.96, there is an important and incompatible change -of file name (package-info.txt -> bag-info.txt), metadata element names -(Package-Size -> Bag-Size, Packing-Date -> Bagging-Date), and -descriptive language to replace the noun "package" with "bag" throughout -the spec. This was to reduce unnecessary synonymy and free up the noun -"package" to name the physical container (e.g., a mailing carton) used to -transfer hard disks. - - - -In section 7, another important change is the introduction of the -Payload-Oxum ("octetstream sum") metadata element to convey precise, -machine-readable payload size information for capacity planning -(especially useful when preparing to receive files listed in fetch.txt). -The Bag-size definition was adjusted to steer it more towards human -consumption. - - - -In section 2.2 the spec now requires exactly two spaces between checksum -and filename in manifests. This results from the experience that as of -2008, not all widely available validation tools are flexible in the -kind of separating whitespace recognized. The examples have been -updated to include use the two-space form as well. - - - -Comment added that while overall bag size is unlimited, practical -limitations on the amount of data that a receiver can stage may -warrant splitting a large bag into several smaller bags. - - - -Added a reference to the SWORD protocol. - - - -Minor edits for scanning and reformatting to cut down line length for -some figures that exceeded 72 chars (limit for Internet-Drafts). - - -
- -
- - -Added mention of preserving empty directories. - - - -Simplified function of "tag checksum file" to "tag manifest", having same -format as payload manifest. The tag manifest is optional and need not -include every tag file. - - - -Loosened interpretation of payload manifest to "union" concept: -every payload file must be listed in at least one manifest but -need not be listed in every manifest. - - - -Shortened the Introduction's first paragraph to be less duplicative -of text in the Abstract. - - - -Changed Delivery-Date to Packing-Date. - - - -Correctly sorted the author list and clarification of -deserialization wording. - - -
- -
- - -Author address corrections and miscellaneous stylistic edits. - - - -Added some mention of physical media-based transfers, preferred -characteristics of transfer filesystems, and network transfer issues. - - - -Added basic bag example early and changed the narrative to more clearly -delineate component files. - - - -Wording changes under fetch.txt, and note that fetch.txt will need to be -modified before bag return. - - - -Fixed checksum encoding reference to base64 rather than hex. (B. Vargas) - - - -Described simple normalization approach for checksum algorithm names. (B. Vargas) - - - -In the example bag, add the ARC files found in the fetch.txt to the manifest as well (A. Turoff) - - -
- -
From 8b266faa52c568c4328e4a0bbcc27029dece41d8 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Tue, 6 Dec 2016 11:40:39 -0500 Subject: [PATCH 002/144] updated ed summers, and justin littman addresses --- bagit.xml | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/bagit.xml b/bagit.xml index 0b1c8b79..0d82bde6 100644 --- a/bagit.xml +++ b/bagit.xml @@ -95,12 +95,11 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) George Washington University Libraries -
- 101 Independence Avenue SE + 2130 H St NW Washington DC - 20540 + 20052 USA jlit@loc.gov @@ -112,12 +111,11 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) University of Maryland -
- 101 Independence Avenue SE - Washington DC - 20540 + 4130 Campus Drive + College Park MD + 20742 USA ehs@pobox.com @@ -185,7 +183,7 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak)
- + From 988036bb41faeddee7a401a728e503d8aed8846d Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Fri, 9 Dec 2016 13:01:34 -0500 Subject: [PATCH 003/144] Remove legacy comment about old DNS results for xml.resource.org --- bagit.xml | 3 --- 1 file changed, 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 0d82bde6..49d8eb67 100644 --- a/bagit.xml +++ b/bagit.xml @@ -15,9 +15,6 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) - From 5b1cc8745d1fd030a50f5ed8108dbf760e2e3430 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Fri, 9 Dec 2016 13:02:09 -0500 Subject: [PATCH 004/144] Trim trailing whitespace --- bagit.xml | 141 +++++++++++++++++++++++++++--------------------------- 1 file changed, 70 insertions(+), 71 deletions(-) diff --git a/bagit.xml b/bagit.xml index 49d8eb67..08806551 100644 --- a/bagit.xml +++ b/bagit.xml @@ -7,7 +7,6 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) --> - @@ -20,7 +19,7 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) - + @@ -32,9 +31,9 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) - + - + ]> @@ -58,8 +57,8 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) The BagIt File Packaging Format (V¤t-bagit-version;) - +
1438 Kingfisher Way @@ -71,10 +70,10 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak)
- + - California Digital Library + California Digital Library
@@ -87,8 +86,8 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak)
- + George Washington University Libraries @@ -103,8 +102,8 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak)
- + University of Maryland @@ -119,8 +118,8 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) - + Library of Congress @@ -135,8 +134,8 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) - + Library of Congress @@ -151,8 +150,8 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) - + Library of Congress @@ -167,8 +166,8 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) - +
1354 Quincy St. NW @@ -215,12 +214,12 @@ and transfer of the bag. The name, BagIt, is inspired by the "enclose and depos Implementors of BagIt tools should consider interoperability between different platforms, operating systems, toolsets, and languages. Differences in path separators, newline characters, reserved -file names, and maximum path lengths are all possible barriers to +file names, and maximum path lengths are all possible barriers to moving bags between different systems. Discussion of these issues may be found in the Interoperability section of this document.
- +
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", @@ -231,10 +230,10 @@ document are to be interpreted as described in . An implementation is not compliant if it fails to satisfy one or more of the MUST or REQUIRED level requirements for the protocols -it implements. +it implements.
- +
This specification uses a number of terms to describe BagIt, some @@ -250,19 +249,19 @@ in this section are intended to clarify any ambiguity. A set of opaque data contained within the structure defined by this specification. - + The tag file required to be in all bags conforming to this specification. Contains tags necessary for bootstrapping the reading and processing of the rest of a bag. See . - + - A reference to a cryptographic checksum algorithm, such as MD5 or - SHA-1, with its name normalized for use in a manifest or tag + A reference to a cryptographic checksum algorithm, such as MD5 or + SHA-1, with its name normalized for use in a manifest or tag manifest file name. See . - + A bag which comprises all elements required by this specification, with all files listed in all payload and tag manifests present, @@ -271,11 +270,11 @@ in this section are intended to clarify any ambiguity. - The data encapsulated by the bag. The contents of the payload - are opaque to this specification, and are always considered as a + The data encapsulated by the bag. The contents of the payload + are opaque to this specification, and are always considered as a set of octet streams. See . - + A bag that has been serialized into a single, monolithic file. See . @@ -292,7 +291,7 @@ in this section are intended to clarify any ambiguity. A complete bag wherein every checksum in every payload manifest and - tag manifest can be successfully verified against the corresponding + tag manifest can be successfully verified against the corresponding payload file. See . @@ -321,7 +320,7 @@ The tag files in the base directory consist of one or more files named (see ), and zero or more additional tag files (see ). The tag files in the optional tag directories are arbitrary file hierarchies and the tag directories -&may; have any name that is not reserved for a file or directory in this specification. +&may; have any name that is not reserved for a file or directory in this specification. @@ -355,7 +354,7 @@ Tag-File-Character-Encoding: UTF-8 where M.N identifies the BagIt major (M) and minor (N) version numbers, and UTF-8 identifies the character set encoding of tag files. The bag -declaration &must; be encoded in UTF-8, and &must-not; contain a byte-order +declaration &must; be encoded in UTF-8, and &must-not; contain a byte-order mark (BOM). @@ -389,7 +388,7 @@ specification, and is not otherwise prescribed. A payload manifest is a tag file that lists payload files and checksums for those payload files generated using a particular bag checksum algorithm. Every bag &must; contain one payload manifest file, and &may; contain -more than one. If there is more than one, each payload manifest +more than one. If there is more than one, each payload manifest file must list all payload files. A payload manifest file &must; have a name of the form manifest-algorithm.txt, where algorithm is a string specifying @@ -416,25 +415,25 @@ CHECKSUM FILENAME -where FILENAME is the pathname of a file relative to the base directory, +where FILENAME is the pathname of a file relative to the base directory, and CHECKSUM is a hex-encoded checksum calculated according to algorithm over every octet in the file. The hex-encoded checksum &may; use uppercase and/or lowercase letters. The slash character ('/') &must; be used as a path separator in FILENAME. One or more linear whitespace characters (spaces or tabs) &must; separate -CHECKSUM from FILENAME. There is no limitation on the length of a pathname. +CHECKSUM from FILENAME. There is no limitation on the length of a pathname. The payload manifest &must-not; reference files outside the payload directory. If a FILENAME includes a newline (LF), a carriage return (CR), or carriage -return plus newline (CRLF) it &must; be percent-encoded +return plus newline (CRLF) it &must; be percent-encoded . Payload manifests only include the pathnames of files. Because of this, -a payload manifest cannot reference empty directories. To account for -an empty directory, a bag creator may wish to include at least one file -in that directory; it suffices, for example, to include a zero-length +a payload manifest cannot reference empty directories. To account for +an empty directory, a bag creator may wish to include at least one file +in that directory; it suffices, for example, to include a zero-length file named ".keep".
@@ -479,7 +478,7 @@ describing the bag and the payload. The metadata elements contained in the "bag-info.txt" file are intended primarily for human readability. All metadata elements are optional and &may; be repeated. Implementations &must; assume that the ordering is significant and provide access to the -metadata elements in the order they are given in the "bag-info.txt" file. +metadata elements in the order they are given in the "bag-info.txt" file.
A metadata element &must; consist of a label, a colon, and a value, @@ -547,8 +546,8 @@ purpose of checksum verification.
All tag files specifically described in this specification &must; adhere to -the text tag file format described below. Other tag files &may; adhere to -the text tag file format described below. +the text tag file format described below. Other tag files &may; adhere to +the text tag file format described below. Text tag files are line-oriented, and each line &must; be @@ -558,8 +557,8 @@ Text tag files &must; end in the extension ".txt". -In all text tag files except for the bag declaration file, text &must; be -encoded in the character encoding specified in the "bagit.txt" bag declaration +In all text tag files except for the bag declaration file, text &must; be +encoded in the character encoding specified in the "bagit.txt" bag declaration file. Text tag files except for the bag declaration file &may; include a byte-order mark (BOM) only if the specified encoding requires it for proper decoding. (Note that UTF-8 does not.) @@ -589,13 +588,13 @@ output format &must; be able to fit in the manifest format specified in -The name of the checksum algorithm &must; be normalized for use in the -manifest's filename by lowercasing the common name of the algorithm and +The name of the checksum algorithm &must; be normalized for use in the +manifest's filename by lowercasing the common name of the algorithm and removing all non-alphanumeric characters. -Implementors of tools that create and validate bags &should; support at +Implementors of tools that create and validate bags &should; support at least two widely implemented checksum algorithms: "md5" and "sha1" . @@ -622,7 +621,7 @@ attributes: -A bag is incomplete when it exhibits any of +A bag is incomplete when it exhibits any of the following exceptions to the attributes of a complete bag: @@ -642,7 +641,7 @@ attributes: The bag &must; be complete. Every CHECKSUM in every payload manifest and tag manifest - can be sucessfully verified against the contents of its + can be sucessfully verified against the contents of its corresponding FILENAME. @@ -654,8 +653,8 @@ ways a bag may be invalid are not covered by this specification. -Tag files that do not appear in a tag manifest can be modified, added -to, or removed from a bag without impacting the completeness or validity +Tag files that do not appear in a tag manifest can be modified, added +to, or removed from a bag without impacting the completeness or validity of the bag. @@ -664,7 +663,7 @@ of the bag.
-In some scenarios, it may be convenient to serialize the +In some scenarios, it may be convenient to serialize the bag's filesystem hierarchy (i.e., the base directory) into a single-file archive format such as TAR or ZIP (the serialization) and then later deserialize the serialization to recreate the filesystem hierarchy. @@ -708,7 +707,7 @@ as any other payload file. When serializing a bag, care must be taken to ensure that the archive format's restrictions on file naming, such as allowable characters, length, or character encoding, will support the -requirements of the systems on which it will be used. See +requirements of the systems on which it will be used. See . @@ -718,8 +717,8 @@ requirements of the systems on which it will be used. See
-This is the layout of a basic bag containing an image and a companion -OCR file. Lines of file content are shown in parentheses beneath the +This is the layout of a basic bag containing an image and a companion +OCR file. Lines of file content are shown in parentheses beneath the file name.
From 16258e730f6bd0302a8185a18cc01aeee05e1826 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Fri, 9 Dec 2016 13:11:03 -0500 Subject: [PATCH 006/144] Strengthen guidance on UTF-8 BOMs in tag files This follows the guidelines in RFC-3629 --- bagit.xml | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index 8a90cf91..f6e42bd0 100644 --- a/bagit.xml +++ b/bagit.xml @@ -559,9 +559,10 @@ Text tag files &must; end in the extension ".txt". In all text tag files except for the bag declaration file, text &must; be encoded in the character encoding specified in the "bagit.txt" bag declaration -file. Text tag files except for the bag declaration file &may; include a +file. Text tag files except for the bag declaration file &may; include a byte-order mark (BOM) only if the specified encoding requires it for -proper decoding. (Note that UTF-8 does not.) +proper decoding. In accordance with , when "bagit.txt" +specifies UTF-8 the tag files &must-not; begin with a byte-order mark (BOM). From cd901f1a0dda3295ba39167a3748e9dd713684a8 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Fri, 9 Dec 2016 16:24:22 -0500 Subject: [PATCH 007/144] Recommend modern hash algorithms --- bagit.xml | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/bagit.xml b/bagit.xml index f6e42bd0..eccd868c 100644 --- a/bagit.xml +++ b/bagit.xml @@ -19,6 +19,7 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) + @@ -595,9 +596,15 @@ removing all non-alphanumeric characters. -Implementors of tools that create and validate bags &should; support at -least two widely implemented checksum algorithms: "md5" - and "sha1" . + Bag creation and validation tools &must; support the SHA-2 family of + algorithms and &should; enable SHA-512 by default + when creating new bags. + + For backwards-compatibility implementors should support &should; support + MD-5 and SHA-1 . + + Implementors are encouraged to simplify the process of adding additional + manifests using new algorithms to streamline the process of in-place upgrades.
@@ -929,9 +936,10 @@ This draft does not request any action from IANA. target="http://msdn2.microsoft.com/en-us/library/aa365247.aspx" /> - &rfc1321; &rfc2119; + &rfc1321; &rfc3174; + &rfc6234; &rfc3629; &rfc3986; From 6673683d31473318b7d03f803b44336fc6f31319 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Fri, 9 Dec 2016 17:00:38 -0500 Subject: [PATCH 008/144] Update recommendations related to Windows filenaming * Note the existence of namespaces in the security considerations section * Update previously un-displayed list of reserved DOS/Windows filenames --- bagit.xml | 55 ++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 38 insertions(+), 17 deletions(-) diff --git a/bagit.xml b/bagit.xml index eccd868c..ed3d0707 100644 --- a/bagit.xml +++ b/bagit.xml @@ -765,19 +765,26 @@ myfirstbag/
-The paths specified in the payload manifest, and tag manifest -file do not prohibit special directory characters which might be -significant on implementing systems. Implementors &should; take care that -files outside the bag directory structure are not accessed when reading or -writing files based on paths specified in a bag. +The paths specified in the payload manifest, and tag manifest file do not +prohibit special directory characters which might be significant on +implementing systems. Implementors &must; ensure that files outside the bag +directory structure are not accessed when reading or writing files based on +paths specified in a bag. -For example, path characters such as ".." or "~" -in a maliciously crafted "tagmanifest-md5.txt" file might cause a naive implementation to -overwrite critical system files. +For example, a maliciously crafted "tagmanifest-md5.txt" file might +contain entries which begin with a path character such as "/", "..", +or a "~username" home directory reference in an attempt to cause a +naive implementation to leak or overwrite targeted files. -
+ + + Windows implementations &should; test their implementations to ensure + that safety-checks prevent use of drive letters and the less commonly used + namespace sequences (e.g. "\\?\C:\…") described in . + +
@@ -858,22 +865,36 @@ be taken to translate the path separator as needed. Receivers of bags on physical media should be prepared for filesystems created under either Windows or Unix. Besides the fundamental difference between path separators ('\' and '/'), generally, Windows filesystems have more -limitations than Unix filesystems. Windows path names have a maximum of -255 characters, and none of these characters may be used in a path -component: +limitations than Unix filesystems. +
+ + Windows path names have a maximum of + 255 characters, and none of these characters may be used in a path + component: + + < > : " / | ? *
-Windows also reserves the following names: CON, PRN, AUX, NUL, COM1, -COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, -LPT5, LPT6, LPT7, LPT8, and LPT9. See for more -information. - +
+ + Windows also reserves the following names, with or without a file extension: + + + + CON, PRN, AUX, NUL + COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9 + LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9 + +
+ + See for more information and possible alternatives. +
From 1da8da7c0f7bbcf1dce1316f5dae9daa81a6c5ad Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Fri, 9 Dec 2016 17:06:14 -0500 Subject: [PATCH 009/144] Update interoperability disclaimer Clarify that this section is part of the specification but is not considered a hard requirement for an implementation. --- bagit.xml | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index ed3d0707..1f841e18 100644 --- a/bagit.xml +++ b/bagit.xml @@ -820,8 +820,9 @@ such as and .
-This section is not part of the BagIt specification. It describes some -practical considerations for bag creators and receivers circa 2016. +This section lists practical considerations for implementors and users. None of +the points below are required but they are recommended for general-purpose +usage.
From 766324f25f3b51f970e52385e134f62e1ab0f914 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Fri, 9 Dec 2016 17:28:35 -0500 Subject: [PATCH 010/144] Discourage the use of manual bag creation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Update the section describing md5sum’s output format and clarify that it is strictly optional to accept bags which are produced using md5sum and will not pass a strict validation. --- bagit.xml | 66 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 35 insertions(+), 31 deletions(-) diff --git a/bagit.xml b/bagit.xml index 1f841e18..01e0f2a9 100644 --- a/bagit.xml +++ b/bagit.xml @@ -825,37 +825,6 @@ the points below are required but they are recommended for general-purpose usage. -
- - -Some cautions regarding bag interchange arise in regard to the -commonly available checksum tools distributed with the GNU Coreutils -package (md5sum, sha1sum, etc.), collectively referred to here as -"md5sum". First, md5sum can be run in binary or text -mode; text mode sometimes normalizes line-endings. While these -modes appear to produce the same checksums under Unix-like systems, they -can produce different checksums under Windows. When using md5sum, you -&should; always run it in binary mode. - - - -Due to the widespread use of md5sum (and its relatives), it is not -unexpected for bag receivers to see manifests in which CHECKSUM and -FILENAME are separated by a space followed by an asterisk. Implementors -creating or processing bags with md5sum should be aware of these subtle -differences, and ensure compliance with the manifest specification in this -document. Implementors creating and processing bags with other tools may wish -to be tolerant of asterisks found in the manifests. - - -A final note about md5sum-generated manifests is that for a -FILENAME containing a backslash ('\'), the manifest line will have a -backslash inserted in front of the CHECKSUM and, under Windows, the -backslashes inside FILENAME may be doubled. - - -
-
@@ -898,6 +867,41 @@ limitations than Unix filesystems.
+
+ + +Some bags have been manually assembled using checksum utilities such as those +contained in the GNU Coreutils package (md5sum, sha1sum, etc.), collectively +referred to here as "md5sum". Implementors who desire wide support of legacy +content should be aware of some known quirks of these tools: + + + +md5sum can be run in “text mode” which causes it to normalize line-endings +on some operating systems. On Unix-like systems both modes will usually produce +the same results but on systems like Windows they may produce different results +based on the file contents. + +The md5sum output format has two characters between the checksum and the +filename: the first is always a space and the second is an asterisk ("*") for +binary mode and a space for text mode. + + + +A final note about md5sum-generated manifests is that for a FILENAME containing +a backslash ('\'), the manifest line will have a backslash inserted in front of +the CHECKSUM and, under Windows, the backslashes inside FILENAME may be doubled. + + + +Implementers &may; wish to accept this format by ignoring a leading asterisk or +handling differences in line termination gracefully but, if so, implementations +&must; warn the user that the bag in question will fail strict validation. In +such cases it is strongly encouraged that tools provide an easy option to +update the bag with valid manifests. + +
+
From f898aff4ee89c441ee6931f708d942551ad549a4 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Fri, 9 Dec 2016 19:11:45 -0500 Subject: [PATCH 011/144] Add filename-related normalization discussion This adds background information for problems related to case-sensitivity and Unicode normalization and adds a list of recommendations for implementors. --- bagit.xml | 153 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 152 insertions(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 01e0f2a9..75b4f70d 100644 --- a/bagit.xml +++ b/bagit.xml @@ -825,7 +825,141 @@ the points below are required but they are recommended for general-purpose usage.
-
+
+ + + This section provides background information on various challenges caused by + differences in how operating systems, filesystems, and common tools handle + filenames followed by a list of recommendations for implementors in + . + + +
+ + There are two challenges for interoperability related to filename case: + + + Filesystems such as FAT or EXFAT always convert filenames to uppercase: + "example.txt" will be stored as "EXAMPLE.TXT" + + + Many Unix filesystems save filenames exactly as provided, allowing + multiple files which differ only in case: "example.txt" and + "Example.txt" are separate files + + + NTFS and HFS+ usually preserve case when storing files but are + case-insensitive when retrieving them. A file saved as "Example.txt" + will be retrieved by that name but will also be retrieved as + "EXAMPLE.TXT", "example.txt", etc. + + + +
+ +
+ + +The Unicode specification has common cases where different character sequences +produce the same human-meaningful text. These are referred to as “canonically +equivalent” and the Unicode specification defines different normalization +forms — see for the full details and a brief +example below: + + +
+ + The common surname "Núñez" normalized in different forms + + + +
+ + + Unicode normalization is relevant to BagIt implementors because different + systems have different standards for normalization: + + + + Apple's HFS Plus filesystem always normalizes filenames to a + fully-decomposed form based on the Unicode 2.0 specification (see ). + + + Windows treats filenames as opaque character sequences (see ) and will store and return the encoded bytes exactly + as provided. + + + Linux and other common Unix systems are generally similar to Windows in + storing and returning opaque byte streams but this behaviour is + technically filesystem-dependent. + + + Utilities used for file management, transfer, and archival may ignore this + issue, apply an arbitrary normalization form, or allow the user to control + how normalization is applied. + + + + + + In practice, this means that the encoded filename stored in a manifest may + fail a simple file existence check because the filename's normalization was + changed at some point after the manifest was written. This situation is very + confusing for users because the filenames are visually indistinguishable and + the “missing” file is obviously present in the payload directory. + +
+ +
+ + + + Implementations &should; discourage the creation of bags containing + files which differ only in case. + + + Implementations &must; prevent the creation of bags containing files + which differ only in normalization form. + + + BagIt implementations &should; tolerate differences in normalization + form by comparing both the list of filesystem and manifest names after + applying the same normalization form to both. + + + Implementations &should; issue a warning when multiple manifests are + present which differ only in case or normalization form. + + + +
+
+ +
As specified above, only the Unix-based path separator ('/') may be @@ -980,6 +1114,23 @@ This draft does not request any action from IANA. target="http://www.ukoln.ac.uk/repositories/digirep/index/SWORD" /> + + + Unicode® Standard Annex #15: Unicode Normalization Forms + + + + + + + + + Technical Note TN1150: HFS Plus Volume Format + + + + From 67c81c8f4189ed3fbeec5ad0d91a29879fdf3002 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Mon, 12 Dec 2016 11:39:11 -0500 Subject: [PATCH 012/144] Update "Payload-Oxum" documentation This adds the note that, unlike other metadata tags, this element must not be repeated and clarifies that the Payload-Oxum value is not sufficient for validation. --- bagit.xml | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/bagit.xml b/bagit.xml index 75b4f70d..f8525597 100644 --- a/bagit.xml +++ b/bagit.xml @@ -490,17 +490,18 @@ or carriage return plus newline (CRLF) and indenting the next line with linear white space (spaces or tabs). -A implementation &may; assume the metadata element Payload-Oxum exists -for the purpose of quickly verifying the bag. If it does exist it &must; -be in the form as defined as follows: +An implementation &should; add the optional "Payload-Oxum" element for the +purpose of quickly detecting incomplete bags before performing checksum +validation. This is strictly an optimization and implementations &must; perform +the standard checksum validation process before proclaiming a bag to be valid. +This element &must-not; be present more than once and, if present, &must; +conform to this format: - The "octetstream sum" of the payload, namely, a two-part number - of the form "OctetCount.StreamCount", where OctetCount is the - total number of octets (8-bit bytes) across all payload file content - and StreamCount is the total number of payload files. Payload-Oxum - should be included in "bag-info.txt" if at all - possible. Compared to Bag-Size (above), Payload-Oxum is + The "octet-stream sum" of the payload, namely, a two-part number of the + form "OctetCount.StreamCount", where OctetCount is the total number of + octets (8-bit bytes) across all payload file content and StreamCount is the + total number of payload files. Compared to "Bag-Size", "Payload-Oxum" is intended for machine consumption. From 0361e96a7b49b67d40bae24dc129196405580a6a Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Tue, 13 Dec 2016 10:00:08 -0500 Subject: [PATCH 013/144] Consistent number of spaces after a period --- bagit.xml | 72 +++++++++++++++++++++++++++---------------------------- 1 file changed, 36 insertions(+), 36 deletions(-) diff --git a/bagit.xml b/bagit.xml index f8525597..fd69a944 100644 --- a/bagit.xml +++ b/bagit.xml @@ -186,9 +186,9 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) This document specifies BagIt, a hierarchical file packaging format for -storage and transfer of arbitrary digital content. A "bag" has just enough +storage and transfer of arbitrary digital content. A "bag" has just enough structure to enclose descriptive "tags" and a "payload" but -does not require knowledge of the payload's internal semantics. This +does not require knowledge of the payload's internal semantics. This BagIt format should be suitable for disk-based or network-based storage and transfer. @@ -203,10 +203,10 @@ transfer. BagIt is a hierarchical file packaging format designed to support disk-based or network-based storage and transfer of arbitrary digital -content. A bag consists of a "payload" and "tags". The content of the payload +content. A bag consists of a "payload" and "tags". The content of the payload is the custodial focus of the bag and is treated as semantically opaque. The "tags" are metadata files intended to facilitate and document the storage -and transfer of the bag. The name, BagIt, is inspired by the "enclose and deposit" method +and transfer of the bag. The name, BagIt, is inspired by the "enclose and deposit" method , sometimes referred to as "bag it and tag it". @@ -216,7 +216,7 @@ Implementors of BagIt tools should consider interoperability between different platforms, operating systems, toolsets, and languages. Differences in path separators, newline characters, reserved file names, and maximum path lengths are all possible barriers to -moving bags between different systems. Discussion of these issues may be +moving bags between different systems. Discussion of these issues may be found in the Interoperability section of this document.
@@ -240,7 +240,7 @@ it implements. This specification uses a number of terms to describe BagIt, some of which are in common use, some of which are newly defined by this specification, and others which may have meanings obvious only -to those in the community from which this spec arose. Terms defined +to those in the community from which this spec arose. Terms defined in this section are intended to clarify any ambiguity. @@ -253,31 +253,31 @@ in this section are intended to clarify any ambiguity. The tag file required to be in all bags conforming to this - specification. Contains tags necessary for bootstrapping the - reading and processing of the rest of a bag. See . + specification. Contains tags necessary for bootstrapping the + reading and processing of the rest of a bag. See . A reference to a cryptographic checksum algorithm, such as MD5 or SHA-1, with its name normalized for use in a manifest or tag - manifest file name. See . + manifest file name. See . A bag which comprises all elements required by this specification, with all files listed in all payload and tag manifests present, - all payload files present listed in all manifests. See + all payload files present listed in all manifests. See . - The data encapsulated by the bag. The contents of the payload + The data encapsulated by the bag. The contents of the payload are opaque to this specification, and are always considered as a - set of octet streams. See . + set of octet streams. See . - A bag that has been serialized into a single, monolithic file. See + A bag that has been serialized into a single, monolithic file. See . @@ -293,7 +293,7 @@ in this section are intended to clarify any ambiguity. A complete bag wherein every checksum in every payload manifest and tag manifest can be successfully verified against the corresponding - payload file. See . + payload file. See . @@ -354,7 +354,7 @@ Tag-File-Character-Encoding: UTF-8 where M.N identifies the BagIt major (M) and minor (N) version numbers, -and UTF-8 identifies the character set encoding of tag files. The bag +and UTF-8 identifies the character set encoding of tag files. The bag declaration &must; be encoded in UTF-8, and &must-not; contain a byte-order mark (BOM). @@ -431,8 +431,8 @@ return plus newline (CRLF) it &must; be percent-encoded -Payload manifests only include the pathnames of files. Because of this, -a payload manifest cannot reference empty directories. To account for +Payload manifests only include the pathnames of files. Because of this, +a payload manifest cannot reference empty directories. To account for an empty directory, a bag creator may wish to include at least one file in that directory; it suffices, for example, to include a zero-length file named ".keep". @@ -475,7 +475,7 @@ As a result, no FILENAME listed in a tag manifest begins "data/". The "bag-info.txt" file is a tag file that contains metadata elements -describing the bag and the payload. The metadata elements contained in +describing the bag and the payload. The metadata elements contained in the "bag-info.txt" file are intended primarily for human readability. All metadata elements are optional and &may; be repeated. Implementations &must; assume that the ordering is significant and provide access to the @@ -483,7 +483,7 @@ metadata elements in the order they are given in the "bag-info.txt" file. A metadata element &must; consist of a label, a colon, and a value, -each separated by optional whitespace. It is &recommended; that +each separated by optional whitespace. It is &recommended; that lines not exceed 79 characters in length. Long values may be continued onto the next line by inserting a newline (LF), a carriage return (CR), or carriage return plus newline (CRLF) and indenting the next line with @@ -583,7 +583,7 @@ The backslash character ('\', U+005C) escapes from special processing any
The payload manifest and tag manifests assert integrity of the payload -and tags in a bag using checksum algorithms. The operation +and tags in a bag using checksum algorithms. The operation of those algorithms, and the formatting of their output within a manifest file, are generally beyond the scope of this specification, except that the output format &must; be able to fit in the manifest format specified in @@ -657,7 +657,7 @@ attributes: If a bag is neither valid, complete, nor incomplete, it is -invalid. Definitions for the various +invalid. Definitions for the various ways a bag may be invalid are not covered by this specification. @@ -687,15 +687,15 @@ The top-level directory of a serialization &must; contain only one bag. The serialization &should; have the same name as the bag's base directory, -but &must; have an extension added to identify the format. For example, the +but &must; have an extension added to identify the format. For example, the receiver of "mybag.tar.gz" expects the corresponding base directory to be created as "mybag". A bag &must-not; be serialized from within its base directory, but from the parent of the base directory (where the base directory appears as an -entry). Thus, after a bag is deserialized in an empty directory, -a listing of that directory shows exactly one entry. For example, +entry). Thus, after a bag is deserialized in an empty directory, +a listing of that directory shows exactly one entry. For example, deserializing "mybag.zip" in an empty directory causes the creation of the base directory "mybag" and, beneath "mybag", the creation of all payload and tag files. @@ -703,9 +703,9 @@ all payload and tag files. The deserialization of a bag &must; produce a single base directory bag with the top-level structure as described in this specification without -requiring any additional un-archiving step. For example, after one +requiring any additional un-archiving step. For example, after one un-archiving step it would be an error for the "data/" directory to -appear as "data.tar.gz". TAR and ZIP files may appear inside the payload +appear as "data.tar.gz". TAR and ZIP files may appear inside the payload beneath the "data/" directory, where they would be treated as any other payload file. @@ -713,10 +713,10 @@ as any other payload file. -When serializing a bag, care must be taken to -ensure that the archive format's restrictions on file naming, such as allowable -characters, length, or character encoding, will support the -requirements of the systems on which it will be used. See +When serializing a bag, care must be taken to ensure that the archive format's +restrictions on file naming, such as allowable characters, length, or character +encoding, will support the requirements of the systems on which it will be +used. See . @@ -727,7 +727,7 @@ requirements of the systems on which it will be used. See This is the layout of a basic bag containing an image and a companion -OCR file. Lines of file content are shown in parentheses beneath the +OCR file. Lines of file content are shown in parentheses beneath the file name. @@ -642,7 +642,7 @@ the following exceptions to the attributes of a complete bag: -A valid bag must have the following +A valid bag &must; have the following attributes: From a8dc5dfd3f03de83ff82ba2f19aa90cb68e36544 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Tue, 13 Dec 2016 10:01:27 -0500 Subject: [PATCH 015/144] Strengthen wording: all manifests MUST list all files --- bagit.xml | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/bagit.xml b/bagit.xml index 839ca842..516889f7 100644 --- a/bagit.xml +++ b/bagit.xml @@ -389,11 +389,11 @@ specification, and is not otherwise prescribed. A payload manifest is a tag file that lists payload files and checksums for those payload files generated using a particular bag checksum algorithm. Every bag &must; contain one payload manifest file, and &may; contain -more than one. If there is more than one, each payload manifest -file must list all payload files. A payload manifest file &must; -have a name of the form manifest-algorithm.txt, where -algorithm is a string specifying -the bag checksum algorithm used in that manifest, such as: +more than one. Every payload manifest &must; list every payload file. A payload +manifest file &must; have a name of the form manifest-algorithm.txt, where algorithm +is a string specifying the bag checksum algorithm used in that manifest, such +as:
From 8edbef1f4b2fa67c00d997a34223b59a1a7f7d30 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Tue, 13 Dec 2016 14:54:41 -0500 Subject: [PATCH 016/144] Better references syntax * Use for relevant entries * Omit empty attributes --- bagit.xml | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/bagit.xml b/bagit.xml index 516889f7..f4a1b3ee 100644 --- a/bagit.xml +++ b/bagit.xml @@ -1069,7 +1069,7 @@ This draft does not request any action from IANA. A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method - + @@ -1090,7 +1090,7 @@ This draft does not request any action from IANA. target="http://msdn2.microsoft.com/en-us/library/aa365247.aspx"> Naming a File - + Microsoft, Inc. Simple Web-service Offering Repository Deposit (SWORD) - - + UKOLN/JISC CETIS + @@ -1119,7 +1119,7 @@ This draft does not request any action from IANA. target="http://www.unicode.org/reports/tr15/"> Unicode® Standard Annex #15: Unicode Normalization Forms - + Unicode Consortium @@ -1128,7 +1128,7 @@ This draft does not request any action from IANA. Technical Note TN1150: HFS Plus Volume Format - + Apple Inc. From 5e0f3c1b1bd109f77d931e005e8d2abe19be3e2a Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Tue, 13 Dec 2016 14:57:37 -0500 Subject: [PATCH 017/144] Update transfer recommendations * Remove reference to GRABIT since the spec is now returning HTTP 404 and there are no known public implementations. * Add METALINK (RFC 5854) as an alternative which supports mirrors and protocols such as BitTorrent. --- bagit.xml | 77 ++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 65 insertions(+), 12 deletions(-) diff --git a/bagit.xml b/bagit.xml index f4a1b3ee..d7aa9d9a 100644 --- a/bagit.xml +++ b/bagit.xml @@ -813,7 +813,7 @@ large bag into several smaller bags. The mechanics of sending and receiving bags over networks is otherwise out of scope of the present document and may be facilitated by protocols -such as and . +such as or .
@@ -1075,17 +1075,6 @@ This draft does not request any action from IANA. target="http://www.iwaw.net/05/papers/iwaw05-tabata.pdf" /> - - - The GrabIt File Exchange Protocol - - - - - - @@ -1115,6 +1104,70 @@ This draft does not request any action from IANA. target="http://www.ukoln.ac.uk/repositories/digirep/index/SWORD" /> + + + The Metalink Download Description Format + + +
+ + + Pompano Beach + FL + USA + + anthonybryan@gmail.com + http://www.METALINK.org +
+
+ + +
+ neil@nabber.org + http://www.nabber.org +
+
+ + +
+ + + + Shiga + Japan + + tatsuhiro.t@gmail.com + http://aria2.sourceforge.net +
+
+ + MirrorBrain +
+ + Venloer Str. 317 + Koeln + 50823 + DE + + +49 221 6778 333 8 + peter@poeml.de + http://mirrorbrain.org/~poeml/ +
+
+ + +
+ henrik@henriknordstrom.net + http://www.henriknordstrom.net/ +
+
+ +
+ +
+ From c8ef7c6d52cd5f16632a0223e4d0a5681e3b2722 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Wed, 14 Dec 2016 15:25:25 -0500 Subject: [PATCH 018/144] Consistent indentation in Terminology --- bagit.xml | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/bagit.xml b/bagit.xml index d7aa9d9a..1001bce9 100644 --- a/bagit.xml +++ b/bagit.xml @@ -258,22 +258,22 @@ in this section are intended to clarify any ambiguity.
- A reference to a cryptographic checksum algorithm, such as MD5 or - SHA-1, with its name normalized for use in a manifest or tag - manifest file name. See . + A reference to a cryptographic checksum algorithm, such as MD5 or + SHA-1, with its name normalized for use in a manifest or tag + manifest file name. See . - - A bag which comprises all elements required by this specification, - with all files listed in all payload and tag manifests present, - all payload files present listed in all manifests. See - . - + + A bag which comprises all elements required by this specification, + with all files listed in all payload and tag manifests present, + all payload files present listed in all manifests. See + . + - The data encapsulated by the bag. The contents of the payload - are opaque to this specification, and are always considered as a - set of octet streams. See . + The data encapsulated by the bag. The contents of the payload + are opaque to this specification, and are always considered as a + set of octet streams. See . @@ -286,8 +286,8 @@ in this section are intended to clarify any ambiguity. - A file that contains metadata intended to facilitate and document - the storage and transfer of the bag. + A file that contains metadata intended to facilitate and document + the storage and transfer of the bag. From 900ce0ccf692deebcca5c31fbef69d7b360187d6 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Wed, 14 Dec 2016 15:33:52 -0500 Subject: [PATCH 019/144] Clarify the format for tag manifest algorithm names --- bagit.xml | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/bagit.xml b/bagit.xml index 1001bce9..91b3bd94 100644 --- a/bagit.xml +++ b/bagit.xml @@ -315,13 +315,16 @@ and optional tag files; (2) a sub-directory named "data", called the payload directory; and (3) a set of optional tag directories. The payload files in the payload directory are an arbitrary file hierarchy (see ). + The tag files in the base directory consist of one or more files named "manifest-algorithm.txt" -(see ), a file named "bagit.txt" -(see ), and zero or more additional tag -files (see ). The tag files in the -optional tag directories are arbitrary file hierarchies and the tag directories -&may; have any name that is not reserved for a file or directory in this specification. +(see and +), +a file named "bagit.txt" (see ), +and zero or more additional tag files (see +). The tag files in the optional tag +directories are arbitrary file hierarchies and the tag directories &may; have +any name that is not reserved for a file or directory in this specification. @@ -449,17 +452,20 @@ file named ".keep". A tag manifest is a tag file that lists other tag files and checksums for those tag files generated using a particular bag checksum algorithm. A bag &may; contain one or more tag manifests. -A tag manifest file &must; have a name of the form -"tagmanifest-algorithm.txt", where -algorithm is a string specifying -the bag checksum algorithm used in that manifest, such as: + +A tag manifest file &must; have a name of the form "tagmanifest-algorithm.txt", where algorithm +is a string following the format described in + specifying the bag checksum algorithm +used in that manifest.
- + Example tag manifest filenames: + tagmanifest-md5.txt tagmanifest-sha1.txt - +
From e87dda81cd7f69b9daa5230112855cb8b4964fce Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 5 Jan 2017 14:06:39 -0500 Subject: [PATCH 020/144] Update interoperability reference --- bagit.xml | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/bagit.xml b/bagit.xml index 91b3bd94..68507b17 100644 --- a/bagit.xml +++ b/bagit.xml @@ -209,16 +209,6 @@ The "tags" are metadata files intended to facilitate and document the storage and transfer of the bag. The name, BagIt, is inspired by the "enclose and deposit" method , sometimes referred to as "bag it and tag it". - - - -Implementors of BagIt tools should consider interoperability -between different platforms, operating systems, toolsets, and languages. -Differences in path separators, newline characters, reserved -file names, and maximum path lengths are all possible barriers to -moving bags between different systems. Discussion of these issues may be -found in the Interoperability section of this document. -
@@ -233,6 +223,11 @@ An implementation is not compliant if it fails to satisfy one or more of the MUST or REQUIRED level requirements for the protocols it implements. + + + Implementors are strongly encouraged to review the interoperability + considerations described in . +
From 2f2ca4bc7f966efe0c124c486852b37b553f92ed Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 5 Jan 2017 14:10:26 -0500 Subject: [PATCH 021/144] Update terminology for bag checksum algorithms --- bagit.xml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 68507b17..4ced7558 100644 --- a/bagit.xml +++ b/bagit.xml @@ -253,9 +253,9 @@ in this section are intended to clarify any ambiguity. - A reference to a cryptographic checksum algorithm, such as MD5 or - SHA-1, with its name normalized for use in a manifest or tag - manifest file name. See . + The name of a cryptographic checksum algorithm which has been normalized + for use in a manifest or tag manifest file name (e.g. “sha1”) in the + format described in . From b975a0ef4cea540ce2fd34b3636dca4f0fbf288b Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 5 Jan 2017 14:11:01 -0500 Subject: [PATCH 022/144] Consistent indentation for terminology list --- bagit.xml | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/bagit.xml b/bagit.xml index 4ced7558..47b49199 100644 --- a/bagit.xml +++ b/bagit.xml @@ -242,14 +242,14 @@ in this section are intended to clarify any ambiguity. - A set of opaque data contained within the structure defined - by this specification. + A set of opaque data contained within the structure defined + by this specification. - The tag file required to be in all bags conforming to this - specification. Contains tags necessary for bootstrapping the - reading and processing of the rest of a bag. See . + The tag file required to be in all bags conforming to this + specification. Contains tags necessary for bootstrapping the + reading and processing of the rest of a bag. See . @@ -258,12 +258,12 @@ in this section are intended to clarify any ambiguity. format described in . - + A bag which comprises all elements required by this specification, with all files listed in all payload and tag manifests present, all payload files present listed in all manifests. See . - + The data encapsulated by the bag. The contents of the payload @@ -272,12 +272,12 @@ in this section are intended to clarify any ambiguity. - A bag that has been serialized into a single, monolithic file. See - . + A bag that has been serialized into a single, monolithic file. See + . - A directory that contains one or more tag files. + A directory that contains one or more tag files. @@ -285,11 +285,11 @@ in this section are intended to clarify any ambiguity. the storage and transfer of the bag. - + A complete bag wherein every checksum in every payload manifest and tag manifest can be successfully verified against the corresponding payload file. See . - +
From 37926c026303017d01be804db974ecc012c24a58 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 5 Jan 2017 14:22:11 -0500 Subject: [PATCH 023/144] =?UTF-8?q?Terminology:=20simplify=20=E2=80=9Ctag?= =?UTF-8?q?=20file=E2=80=9D=20definition?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- bagit.xml | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index 47b49199..87bda172 100644 --- a/bagit.xml +++ b/bagit.xml @@ -281,8 +281,7 @@ in this section are intended to clarify any ambiguity. - A file that contains metadata intended to facilitate and document - the storage and transfer of the bag. + A file which contains the metadata required to validate the bag. From 5fbfdf663bae1a11635fc480adb2f0428e269e9f Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 5 Jan 2017 14:23:00 -0500 Subject: [PATCH 024/144] =?UTF-8?q?Terminology:=20simplify=20=E2=80=9Cvali?= =?UTF-8?q?d=E2=80=9D=20definition?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This wording is shorter and doesn’t distinguish between validation for payload and tag files. --- bagit.xml | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 87bda172..54e59703 100644 --- a/bagit.xml +++ b/bagit.xml @@ -285,9 +285,8 @@ in this section are intended to clarify any ambiguity. - A complete bag wherein every checksum in every payload manifest and - tag manifest can be successfully verified against the corresponding - payload file. See . + A complete bag where every checksum in every manifest has been + successfully verified against the corresponding file. From 8236209f98b538ecaa98a7d7e3db2d006019d6b4 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 5 Jan 2017 14:28:03 -0500 Subject: [PATCH 025/144] Convert numbered list in Structure section to --- bagit.xml | 37 ++++++++++++++++++++++--------------- 1 file changed, 22 insertions(+), 15 deletions(-) diff --git a/bagit.xml b/bagit.xml index 54e59703..8f7e2b6d 100644 --- a/bagit.xml +++ b/bagit.xml @@ -303,21 +303,28 @@ in this section are intended to clarify any ambiguity.
-A bag consists of a base directory containing (1) a set of required -and optional tag files; (2) a sub-directory named "data", called the payload -directory; and (3) a set of optional tag directories. The payload files in the -payload directory are an arbitrary file hierarchy -(see ). - -The tag files in the base directory consist of one or more files named -"manifest-algorithm.txt" -(see and -), -a file named "bagit.txt" (see ), -and zero or more additional tag files (see -). The tag files in the optional tag -directories are arbitrary file hierarchies and the tag directories &may; have -any name that is not reserved for a file or directory in this specification. + A bag consists of a base directory containing: + + + a set of required and optional tag files + a sub-directory named "data", called the payload directory + a set of optional tag directories + + + The payload files in the payload directory are an arbitrary file hierarchy + (see ). + + The tag files in the base directory consist of one or more files named + "manifest-algorithm.txt" + (see and + ), + a file named "bagit.txt" (see ), + and zero or more additional tag files (see + ). + + The tag files in the optional tag directories are arbitrary file hierarchies + and the tag directories &may; have any name that is not reserved for a file + or directory in this specification. From 94bcdaaa310f41f53844ae2d25afcba2306b58cb Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 5 Jan 2017 14:36:44 -0500 Subject: [PATCH 026/144] Remove Serialization section The spec shouldn't need to include mechanistic transfer details: if the results validate, it's a bag. --- bagit.xml | 58 ------------------------------------------------------- 1 file changed, 58 deletions(-) diff --git a/bagit.xml b/bagit.xml index 8f7e2b6d..b9d31e3f 100644 --- a/bagit.xml +++ b/bagit.xml @@ -271,11 +271,6 @@ in this section are intended to clarify any ambiguity. set of octet streams. See . - - A bag that has been serialized into a single, monolithic file. See - . - - A directory that contains one or more tag files. @@ -675,59 +670,6 @@ of the bag.
-
- - -In some scenarios, it may be convenient to serialize the -bag's filesystem hierarchy (i.e., the base directory) into a -single-file archive format such as TAR or ZIP (the serialization) and then -later deserialize the serialization to recreate the filesystem hierarchy. -Several rules govern the serialization of a bag and apply equally -to all types of archive files: - - - - - -The top-level directory of a serialization &must; contain only one bag. - - -The serialization &should; have the same name as the bag's base directory, -but &must; have an extension added to identify the format. For example, the -receiver of "mybag.tar.gz" expects the corresponding base directory -to be created as "mybag". - - -A bag &must-not; be serialized from within its base directory, but from the -parent of the base directory (where the base directory appears as an -entry). Thus, after a bag is deserialized in an empty directory, -a listing of that directory shows exactly one entry. For example, -deserializing "mybag.zip" in an empty directory causes the creation -of the base directory "mybag" and, beneath "mybag", the creation of -all payload and tag files. - - -The deserialization of a bag &must; produce a single base directory -bag with the top-level structure as described in this specification without -requiring any additional un-archiving step. For example, after one -un-archiving step it would be an error for the "data/" directory to -appear as "data.tar.gz". TAR and ZIP files may appear inside the payload -beneath the "data/" directory, where they would be treated -as any other payload file. - - - - - -When serializing a bag, care must be taken to ensure that the archive format's -restrictions on file naming, such as allowable characters, length, or character -encoding, will support the requirements of the systems on which it will be -used. See -. - - -
-
From cd5d6fe253e8630b443f6c1a7910cc3857b21017 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Thu, 5 Jan 2017 13:46:47 -0500 Subject: [PATCH 027/144] Cherry pick: first cut at changes from recommendations by Dave Crocker --- bagit.xml | 221 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 113 insertions(+), 108 deletions(-) diff --git a/bagit.xml b/bagit.xml index b9d31e3f..973f7411 100644 --- a/bagit.xml +++ b/bagit.xml @@ -202,12 +202,15 @@ transfer.
BagIt is a hierarchical file packaging format designed to support -disk-based or network-based storage and transfer of arbitrary digital -content. A bag consists of a "payload" and "tags". The content of the payload -is the custodial focus of the bag and is treated as semantically opaque. -The "tags" are metadata files intended to facilitate and document the storage -and transfer of the bag. The name, BagIt, is inspired by the "enclose and deposit" method -, sometimes referred to as "bag it and tag it". +storage and transfer of arbitrary digital content (i.e. files). +A bag consists of a directory containing the payload files and other accompanying +files known as "tag" files. The "tags" are metadata files intended to facilitate +and document the storage and transfer of the bag. +BagIt does not require the processing of the entire structure to understand to +involve understanding the payload. The name, BagIt, is inspired by the "enclose +and deposit" method , sometimes referred to as "bag it and tag it". +This differs from other specifications (like ZIP) that only ensure the integrity +of the files in that BagIt also ensures the complete set of files are included.
@@ -218,12 +221,6 @@ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", document are to be interpreted as described in . - -An implementation is not compliant if it fails to satisfy one or -more of the MUST or REQUIRED level requirements for the protocols -it implements. - - Implementors are strongly encouraged to review the interoperability considerations described in . @@ -234,22 +231,21 @@ it implements. This specification uses a number of terms to describe BagIt, some of which are in common use, some of which are newly defined by this -specification, and others which may have meanings obvious only -to those in the community from which this spec arose. Terms defined -in this section are intended to clarify any ambiguity. +specification. Terms defined in this section are intended to clarify +any ambiguity. - A set of opaque data contained within the structure defined - by this specification. + A order independent set of opaque files contained within the structure defined + by this specification. - The tag file required to be in all bags conforming to this - specification. Contains tags necessary for bootstrapping the - reading and processing of the rest of a bag. See . + The file required to be in all bags conforming to this + specification. Contains values necessary for bootstrapping the + reading and processing of the rest of a bag. See . @@ -258,17 +254,17 @@ in this section are intended to clarify any ambiguity. format described in . - + A bag which comprises all elements required by this specification, - with all files listed in all payload and tag manifests present, - all payload files present listed in all manifests. See + and all payload files are present. It also includes any optional files + specified. See . - + - The data encapsulated by the bag. The contents of the payload - are opaque to this specification, and are always considered as a - set of octet streams. See . + The data encapsulated by the bag. The contents of the payload + are opaque to this specification, and, with respect to BagIt processing, + are always considered as a set of uninterpreted octets. See . @@ -309,17 +305,15 @@ in this section are intended to clarify any ambiguity. The payload files in the payload directory are an arbitrary file hierarchy (see ). - The tag files in the base directory consist of one or more files named - "manifest-algorithm.txt" - (see and - ), - a file named "bagit.txt" (see ), - and zero or more additional tag files (see - ). - - The tag files in the optional tag directories are arbitrary file hierarchies - and the tag directories &may; have any name that is not reserved for a file - or directory in this specification. +The tag files in the base directory consist of one or more files named +"manifest-algorithm.txt" +(see and +), +a file named "bagit.txt" (see ), +and zero or more additional tag files (see +). The tag files and directories are +arbitrary file hierarchies and &may; have +any name that is not reserved for a file or directory in this specification. @@ -329,13 +323,20 @@ The base directory &may; have any name.
<base directory>/ - | bagit.txt - | manifest-<algorithm>.txt - | [optional additional tag files] - \--- data/ - | [payload files] - \--- [optional tag directories]/ - | [optional tag files] + | + +-- bagit.txt + | + +-- manifest-<algorithm>.txt + | + +-- [additional tag files] + | + +-- data/ + | | + | +-- [payload files] + | + +-- [optional tag directories]/ + | + +-- [optional tag files]
@@ -359,23 +360,22 @@ mark (BOM).
-The appropriate version for a bag that conforms to -this version of the specification is "¤t-bagit-version;". +The number for this version of the specification is "¤t-bagit-version;".
-The base directory &must; contain a sub-directory named "data", called the -payload directory. +The base directory &must; contain a sub-directory named "data". -The payload directory contains the custodial content within the bag. +The payload directory contains the arbitrary digital content within the bag. The files under the payload directory are called payload files, or the payload. -The payload is treated as octet streams for all purposes relating to this -specification, and is not otherwise prescribed. +Each payload file is treated as an opaque octet stream for all verifing file correctness. +The payload directory structure may contain meaning and thus should be preserved, +but is otherwise ignored for purposes relating to this specification.
@@ -384,8 +384,12 @@ specification, and is not otherwise prescribed. section on Tag Manifests. --> -A payload manifest is a tag file that lists payload files and checksums for those -payload files generated using a particular bag checksum algorithm. +A payload manifest file provides a complete listing of the files contained in the payload, +and a checksum for each payload file, to permit data integrity checking. + +A payload manifest is a tag file that lists payload file names and their checksums +generated using a checksum algorithm. + Every bag &must; contain one payload manifest file, and &may; contain more than one. Every payload manifest &must; list every payload file. A payload manifest file &must; have a name of the form manifest-A bag &must-not; contain more than one payload manifest for a particular bag checksum algorithm. + Each line of a payload manifest file &must; be of the form: @@ -425,14 +430,29 @@ The payload manifest &must-not; reference files outside the payload directory. I a FILENAME includes a newline (LF), a carriage return (CR), or carriage return plus newline (CRLF) it &must; be percent-encoded . + + +The ABNF form of this is: + + payload-manifest = checksum 1*WSP filename ending + checksum = 1*hex-val + hex-val = "x" 1*case-hexdig + [ 1*("." 1*case-hexdig) / ("-" 1*case-hexdig) ] + case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" / "d" / "F" / "f" + case-hexdig = DIGIT / "A" / "B" / "C" / "D" / "F" + filename = ("data" "/" *( unreserved / pct-encoded / sub-delims )) + unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" + sub-delims = "!" / "$" / "&" / "'" / "(" / ")" + / "*" / "+" / "," / ";" / "=" + pct-encoded = "%" case-hexdig case-hexdig + ending = CR / LF / CRLF -Payload manifests only include the pathnames of files. Because of this, -a payload manifest cannot reference empty directories. To account for +A payload manifest &must-not; reference empty directories. To account for an empty directory, a bag creator may wish to include at least one file -in that directory; it suffices, for example, to include a zero-length +in that directory; a bag creator might wish for example, to include a zero-length file named ".keep".
@@ -473,11 +493,10 @@ As a result, no FILENAME listed in a tag manifest begins "data/".
- The "bag-info.txt" file is a tag file that contains metadata elements describing the bag and the payload. The metadata elements contained in -the "bag-info.txt" file are intended primarily for human readability. +the "bag-info.txt" file are intended primarily for human use. All metadata elements are optional and &may; be repeated. Implementations &must; assume that the ordering is significant and provide access to the metadata elements in the order they are given in the "bag-info.txt" file. @@ -490,6 +509,13 @@ onto the next line by inserting a newline (LF), a carriage return (CR), or carriage return plus newline (CRLF) and indenting the next line with linear white space (spaces or tabs). +the ABNF form for this is: + + metadata = key ":" value ending + key = ALPHA / DIGIT + value = ([1*WSP] 1*[ALPHA / DIGIT]) + ending = CR / LF / CRLF + An implementation &should; add the optional "Payload-Oxum" element for the purpose of quickly detecting incomplete bags before performing checksum @@ -505,41 +531,42 @@ conform to this format: total number of payload files. Compared to "Bag-Size", "Payload-Oxum" is intended for machine consumption. - +The ABNF form for this is: + + payload-oxum = (1*DIGIT "." 1*DIGIT) + Here is an example "bag-info.txt" file. - +
- Source-Organization: Spengler University - Organization-Address: 1400 Elm St., Cupertino, California, 95014 - Contact-Name: Edna Janssen - Contact-Phone: +1 408-555-1212 - Contact-Email: ej@spengler.edu + Source-Organization: FOO University + Organization-Address: 1 Main St., Cupertino, California, 11111 + Contact-Name: Jane Doe + Contact-Phone: +1 111-111-1111 + Contact-Email: example@example.com External-Description: Uncompressed greyscale TIFF images from the - Yoshimuri papers colle... + FOO papers colle... Bagging-Date: 2008-01-15 - External-Identifier: spengler_yoshimuri_001 + External-Identifier: university_foo_001 Bag-Size: 260 GB Payload-Oxum: 279164409832.1198 - Bag-Group-Identifier: spengler_yoshimuri + Bag-Group-Identifier: univerisity_foo Bag-Count: 1 of 15 - Internal-Sender-Identifier: /storage/images/yoshimuri + Internal-Sender-Identifier: /storage/images/foo Internal-Sender-Description: Uncompressed greyscale TIFFs created from microfilm and are...
- -
A bag &may; contain other tag files that are not defined by this specification. -Implementations &should; ignore the content of any unexpected tag files, -except when they are listed in a tag manifest. -When unexpected tag files are listed in a tag manifest, implementations +Implementations &should; ignore the content of any tag files not +defined in this specification, except when they are listed in a tag manifest. +When tag files are listed in a tag manifest, implementations &must; only treat the content of those tag files as octet streams for the purpose of checksum verification. @@ -556,7 +583,7 @@ the text tag file format described below. Text tag files are line-oriented, and each line &must; be terminated by a newline (LF), a carriage return (CR), or carriage return plus newline (CRLF). -Text tag files &must; end in the extension ".txt". +Text tag file names &must; end in the extension ".txt". @@ -583,12 +610,11 @@ The backslash character ('\', U+005C) escapes from special processing any
-The payload manifest and tag manifests assert integrity of the payload -and tags in a bag using checksum algorithms. The operation -of those algorithms, and the formatting of their output within a manifest -file, are generally beyond the scope of this specification, except that the -output format &must; be able to fit in the manifest format specified in -. +The payload manifest and tag manifests permit validating the integrity of the payload +and tag files in a bag produced by the checksum algorithms. +Checksum values &must; be encoded so as to conform to the manifest format +specified in . However, the internal details +of a checksum are outside the scope of this document. @@ -622,23 +648,10 @@ attributes: Every required element &must; be present (). - Every file in every payload manifest &must; be present. - Every file in every tag manifest &must; be present. + Every file in every payload manifest &must; be present on the filesystem. + Every file in every tag manifest &must; be present on the filesystem. Tag files not listed in a tag manifest &may; be present. Every payload file &must; be listed in all manifests. - Every element present &must; comply with this specification. - - - - -A bag is incomplete when it exhibits any of -the following exceptions to the attributes of a complete bag: - - - - - One or more files in any payload manifest are absent. - One or more files in any tag manifest are absent. @@ -653,22 +666,13 @@ attributes: Every CHECKSUM in every payload manifest and tag manifest can be sucessfully verified against the contents of its corresponding FILENAME. + Every element present &must; comply with this specification. - -If a bag is neither valid, complete, nor incomplete, it is -invalid. Definitions for the various -ways a bag may be invalid are not covered by this specification. - +
- -Tag files that do not appear in a tag manifest can be modified, added -to, or removed from a bag without impacting the completeness or validity -of the bag. - -
@@ -726,6 +730,7 @@ For example, a maliciously crafted "tagmanifest-md5.txt" file might contain entries which begin with a path character such as "/", "..", or a "~username" home directory reference in an attempt to cause a naive implementation to leak or overwrite targeted files. +All implementations &should; have a test suite to guard against these cases. From 545e072df5b5252191652b9985fc3025f4cc3f89 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 5 Jan 2017 16:06:05 -0500 Subject: [PATCH 028/144] Use
for ABNF diagrams --- bagit.xml | 69 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 40 insertions(+), 29 deletions(-) diff --git a/bagit.xml b/bagit.xml index 973f7411..ba7096b5 100644 --- a/bagit.xml +++ b/bagit.xml @@ -432,22 +432,28 @@ return plus newline (CRLF) it &must; be percent-encoded . -The ABNF form of this is: - - - payload-manifest = checksum 1*WSP filename ending - checksum = 1*hex-val - hex-val = "x" 1*case-hexdig - [ 1*("." 1*case-hexdig) / ("-" 1*case-hexdig) ] - case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" / "d" / "F" / "f" - case-hexdig = DIGIT / "A" / "B" / "C" / "D" / "F" - filename = ("data" "/" *( unreserved / pct-encoded / sub-delims )) - unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" - sub-delims = "!" / "$" / "&" / "'" / "(" / ")" - / "*" / "+" / "," / ";" / "=" - pct-encoded = "%" case-hexdig case-hexdig - ending = CR / LF / CRLF - +
+ Payload Manifest ABNF + +
A payload manifest &must-not; reference empty directories. To account for @@ -509,13 +515,17 @@ onto the next line by inserting a newline (LF), a carriage return (CR), or carriage return plus newline (CRLF) and indenting the next line with linear white space (spaces or tabs). -the ABNF form for this is: - - metadata = key ":" value ending - key = ALPHA / DIGIT - value = ([1*WSP] 1*[ALPHA / DIGIT]) - ending = CR / LF / CRLF - + +
+ "bag-info.txt" ABNF + +
+ An implementation &should; add the optional "Payload-Oxum" element for the purpose of quickly detecting incomplete bags before performing checksum @@ -531,13 +541,14 @@ conform to this format: total number of payload files. Compared to "Bag-Size", "Payload-Oxum" is intended for machine consumption. -The ABNF form for this is: - + +
+ Payload-Oxum ABNF + - -Here is an example "bag-info.txt" file. - +]]> +
+
Source-Organization: FOO University From 35c32724ffdb5db9d16721e31d334837b10020da Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 5 Jan 2017 16:06:53 -0500 Subject: [PATCH 029/144] Minor
consistency cleanup --- bagit.xml | 41 ++++++++++++++++++++--------------------- 1 file changed, 20 insertions(+), 21 deletions(-) diff --git a/bagit.xml b/bagit.xml index ba7096b5..300b1096 100644 --- a/bagit.xml +++ b/bagit.xml @@ -321,23 +321,23 @@ The base directory &may; have any name.
- - <base directory>/ - | - +-- bagit.txt - | - +-- manifest-<algorithm>.txt - | - +-- [additional tag files] - | - +-- data/ - | | - | +-- [payload files] - | - +-- [optional tag directories]/ - | - +-- [optional tag files] - + + <base directory>/ + | + +-- bagit.txt + | + +-- manifest-<algorithm>.txt + | + +-- [additional tag files] + | + +-- data/ + | | + | +-- [payload files] + | + +-- [optional tag directories]/ + | + +-- [optional tag files] +
@@ -550,7 +550,8 @@ conform to this format:
- + An example "bag-info.txt" file + Source-Organization: FOO University Organization-Address: 1 Main St., Cupertino, California, 11111 Contact-Name: Jane Doe @@ -567,7 +568,7 @@ conform to this format: Internal-Sender-Identifier: /storage/images/foo Internal-Sender-Description: Uncompressed greyscale TIFFs created from microfilm and are... - +
@@ -836,7 +837,6 @@ example below: The common surname "Núñez" normalized in different forms - Windows also reserves the following names, with or without a file extension: - CON, PRN, AUX, NUL COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9 From 4407055e608fa1aae8ea15591086ef854a8f09fb Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 5 Jan 2017 16:08:21 -0500 Subject: [PATCH 030/144] Terminology: whitespace cleanup --- bagit.xml | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/bagit.xml b/bagit.xml index 300b1096..bf0f89a0 100644 --- a/bagit.xml +++ b/bagit.xml @@ -238,14 +238,14 @@ any ambiguity. - A order independent set of opaque files contained within the structure defined - by this specification. + A order independent set of opaque files contained within the structure + defined by this specification. - The file required to be in all bags conforming to this - specification. Contains values necessary for bootstrapping the - reading and processing of the rest of a bag. See . + The file required to be in all bags conforming to this specification. + Contains values necessary for bootstrapping the reading and processing of + the rest of a bag. See . @@ -254,17 +254,16 @@ any ambiguity. format described in . - + A bag which comprises all elements required by this specification, and all payload files are present. It also includes any optional files - specified. See - . - + specified. See . + - The data encapsulated by the bag. The contents of the payload - are opaque to this specification, and, with respect to BagIt processing, - are always considered as a set of uninterpreted octets. See . + The data encapsulated by the bag. The contents of the payload + are opaque to this specification, and, with respect to BagIt processing, + are always considered as a set of uninterpreted octets. See . From b07d6c0c442d15ab2af85034f797153e5c9e61b0 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 5 Jan 2017 16:09:14 -0500 Subject: [PATCH 031/144] Convert tabs to spaces --- bagit.xml | 90 +++++++++++++++++++++++++++---------------------------- 1 file changed, 45 insertions(+), 45 deletions(-) diff --git a/bagit.xml b/bagit.xml index bf0f89a0..43bd3165 100644 --- a/bagit.xml +++ b/bagit.xml @@ -21,20 +21,20 @@ Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak) - - - - - - - - - - - - - - + + + + + + + + + + + + + + ]> @@ -237,22 +237,22 @@ any ambiguity. - - A order independent set of opaque files contained within the structure - defined by this specification. - - - - The file required to be in all bags conforming to this specification. - Contains values necessary for bootstrapping the reading and processing of - the rest of a bag. See . - - - + + A order independent set of opaque files contained within the structure + defined by this specification. + + + + The file required to be in all bags conforming to this specification. + Contains values necessary for bootstrapping the reading and processing of + the rest of a bag. See . + + + The name of a cryptographic checksum algorithm which has been normalized for use in a manifest or tag manifest file name (e.g. “sha1”) in the format described in . - + A bag which comprises all elements required by this specification, @@ -260,19 +260,19 @@ any ambiguity. specified. See . - + The data encapsulated by the bag. The contents of the payload are opaque to this specification, and, with respect to BagIt processing, are always considered as a set of uninterpreted octets. See . - + - - A directory that contains one or more tag files. - + + A directory that contains one or more tag files. + - + A file which contains the metadata required to validate the bag. - + A complete bag where every checksum in every manifest has been @@ -361,7 +361,7 @@ mark (BOM). The number for this version of the specification is "¤t-bagit-version;". -
+
@@ -657,11 +657,11 @@ attributes: - Every required element &must; be present - (). - Every file in every payload manifest &must; be present on the filesystem. - Every file in every tag manifest &must; be present on the filesystem. - Tag files not listed in a tag manifest &may; be present. + Every required element &must; be present + (). + Every file in every payload manifest &must; be present on the filesystem. + Every file in every tag manifest &must; be present on the filesystem. + Tag files not listed in a tag manifest &may; be present. Every payload file &must; be listed in all manifests. @@ -673,9 +673,9 @@ attributes: - The bag &must; be complete. - Every CHECKSUM in every payload manifest and tag manifest - can be sucessfully verified against the contents of its + The bag &must; be complete. + Every CHECKSUM in every payload manifest and tag manifest + can be sucessfully verified against the contents of its corresponding FILENAME. Every element present &must; comply with this specification. @@ -694,8 +694,8 @@ OCR file. Lines of file content are shown in parentheses beneath the file name. + for the fact that the entity value is much shorter than the entity + name. -->
myfirstbag/ From ce06c32995881375a2d48a090cc0a701f5b28ab9 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 5 Jan 2017 16:10:17 -0500 Subject: [PATCH 032/144] =?UTF-8?q?Terminology:=20update=20definition=20of?= =?UTF-8?q?=20=E2=80=9Ccomplete=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- bagit.xml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 43bd3165..4c815209 100644 --- a/bagit.xml +++ b/bagit.xml @@ -255,9 +255,9 @@ any ambiguity. - A bag which comprises all elements required by this specification, - and all payload files are present. It also includes any optional files - specified. See . + A bag which contains every elements required by this specification, + every payload file listed in a manifest, and any optional files which are + listed in a tag manifest. See . From 757f68ae9332944c771f0b3bd7491027b220e93d Mon Sep 17 00:00:00 2001 From: John Scancella Date: Fri, 6 Jan 2017 12:39:37 -0500 Subject: [PATCH 033/144] fixed some wording of sentences and removed section on Disk and network transfer since it doesn't really belong. --- bagit.xml | 76 +++++++++++++++---------------------------------------- 1 file changed, 21 insertions(+), 55 deletions(-) diff --git a/bagit.xml b/bagit.xml index 4c815209..85b2d9b6 100644 --- a/bagit.xml +++ b/bagit.xml @@ -206,11 +206,13 @@ storage and transfer of arbitrary digital content (i.e. files). A bag consists of a directory containing the payload files and other accompanying files known as "tag" files. The "tags" are metadata files intended to facilitate and document the storage and transfer of the bag. -BagIt does not require the processing of the entire structure to understand to -involve understanding the payload. The name, BagIt, is inspired by the "enclose -and deposit" method , sometimes referred to as "bag it and tag it". -This differs from other specifications (like ZIP) that only ensure the integrity -of the files in that BagIt also ensures the complete set of files are included. +BagIt does not require the processing of the entire structure to understand +the payload. The name, BagIt, is inspired by the "enclose and deposit" method +, sometimes referred to as "bag it and tag it". +BagIt differs from other specifications (like ZIP) by allowing for stronger +checks then CRC-32, the ability to upgrade to stronger hash algorithms without +breaking backwards compatibility, the ability to verify each file without +having to read the entire archive, and no file size limitation
@@ -244,7 +246,7 @@ any ambiguity. The file required to be in all bags conforming to this specification. - Contains values necessary for bootstrapping the reading and processing of + Contains values necessary for the reading and processing of the rest of a bag. See . @@ -378,7 +380,7 @@ but is otherwise ignored for purposes relating to this specification. -
+
@@ -386,10 +388,7 @@ but is otherwise ignored for purposes relating to this specification. A payload manifest file provides a complete listing of the files contained in the payload, and a checksum for each payload file, to permit data integrity checking. -A payload manifest is a tag file that lists payload file names and their checksums -generated using a checksum algorithm. - -Every bag &must; contain one payload manifest file, and &may; contain +Every bag &must; contain at least one payload manifest file, and &may; contain more than one. Every payload manifest &must; list every payload file. A payload manifest file &must; have a name of the form manifest-algorithm.txt, where algorithm @@ -440,7 +439,7 @@ hex-val = "x" 1*case-hexdig [ 1*("." 1*case-hexdig) / ("-" 1*case-hexdig) ] case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" / "d" / "F" / "f" -case-hexdig = DIGIT / "A" / "B" / "C" / "D" / "F" +HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "F" filename = ( "data" "/" @@ -449,7 +448,7 @@ filename = ( unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" -pct-encoded = "%" case-hexdig case-hexdig +pct-encoded = "%" HEXDIG HEXDIG ending = CR / LF / CRLF ]]> @@ -518,10 +517,11 @@ linear white space (spaces or tabs).
"bag-info.txt" ABNF
@@ -537,14 +537,13 @@ conform to this format: The "octet-stream sum" of the payload, namely, a two-part number of the form "OctetCount.StreamCount", where OctetCount is the total number of octets (8-bit bytes) across all payload file content and StreamCount is the - total number of payload files. Compared to "Bag-Size", "Payload-Oxum" is - intended for machine consumption. + total number of payload files.
Payload-Oxum ABNF
@@ -604,11 +603,7 @@ file. Text tag files except for the bag declaration file &may; include a byte-order mark (BOM) only if the specified encoding requires it for proper decoding. In accordance with , when "bagit.txt" specifies UTF-8 the tag files &must-not; begin with a byte-order mark (BOM). - - - -As specified in , the bag declaration -file &must; be encoded in UTF-8 and &must-not; include a byte-order mark. +See @@ -662,7 +657,7 @@ attributes: Every file in every payload manifest &must; be present on the filesystem. Every file in every tag manifest &must; be present on the filesystem. Tag files not listed in a tag manifest &may; be present. - Every payload file &must; be listed in all manifests. + Every payload file &must; be listed in all payload manifests. @@ -753,35 +748,6 @@ All implementations &should; have a test suite to guard against these cases.
-
- - -When creating a bag on physical media (such as hard disk, CD-ROM, or -DVD) for transfer to another organization, the sender should select -and format the media in a manner compatible with both the content -requirements (e.g., file names and sizes) and the receiver's technical -infrastructure. If the receiver's infrastructure is not known or the -media needs to be compatible with a range of potential receivers, -consideration should be given to portability and common usage. For -example, a "lowest common denominator" for some potential receivers -could be USB disk drives formatted with the FAT32 filesystem. - - - -Although overall bag size is unlimited in principle, network-based -transfers may involve constraints on the amount of bag data that a -receiver can receive at one time. It may be practical to split a -large bag into several smaller bags. - - - -The mechanics of sending and receiving bags over networks is otherwise -out of scope of the present document and may be facilitated by protocols -such as or . - - -
-
From e304db46b11d97d2c5d49173413b4d1037e3d9b1 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Fri, 6 Jan 2017 16:39:47 -0500 Subject: [PATCH 034/144] Remove stale references The prose which referenced these was removed in 757f68ae9332944c771f0b3bd7491027b220e93d --- bagit.xml | 75 ------------------------------------------------------- 1 file changed, 75 deletions(-) diff --git a/bagit.xml b/bagit.xml index 85b2d9b6..f2e29cf3 100644 --- a/bagit.xml +++ b/bagit.xml @@ -1021,81 +1021,6 @@ This draft does not request any action from IANA. &rfc3629; &rfc3986; - - - Simple Web-service Offering Repository Deposit (SWORD) - UKOLN/JISC CETIS - - - - - - - - The Metalink Download Description Format - - -
- - - Pompano Beach - FL - USA - - anthonybryan@gmail.com - http://www.METALINK.org -
-
- - -
- neil@nabber.org - http://www.nabber.org -
-
- - -
- - - - Shiga - Japan - - tatsuhiro.t@gmail.com - http://aria2.sourceforge.net -
-
- - MirrorBrain -
- - Venloer Str. 317 - Koeln - 50823 - DE - - +49 221 6778 333 8 - peter@poeml.de - http://mirrorbrain.org/~poeml/ -
-
- - -
- henrik@henriknordstrom.net - http://www.henriknordstrom.net/ -
-
- -
- -
- From 318fe5c91dec073d08bdb7738102e142256dffa4 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Fri, 6 Jan 2017 16:42:12 -0500 Subject: [PATCH 035/144] Remove trailing whitespace --- bagit.xml | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/bagit.xml b/bagit.xml index f2e29cf3..d6ffc376 100644 --- a/bagit.xml +++ b/bagit.xml @@ -206,11 +206,11 @@ storage and transfer of arbitrary digital content (i.e. files). A bag consists of a directory containing the payload files and other accompanying files known as "tag" files. The "tags" are metadata files intended to facilitate and document the storage and transfer of the bag. -BagIt does not require the processing of the entire structure to understand -the payload. The name, BagIt, is inspired by the "enclose and deposit" method +BagIt does not require the processing of the entire structure to understand +the payload. The name, BagIt, is inspired by the "enclose and deposit" method , sometimes referred to as "bag it and tag it". BagIt differs from other specifications (like ZIP) by allowing for stronger -checks then CRC-32, the ability to upgrade to stronger hash algorithms without +checks then CRC-32, the ability to upgrade to stronger hash algorithms without breaking backwards compatibility, the ability to verify each file without having to read the entire archive, and no file size limitation
@@ -537,7 +537,7 @@ conform to this format: The "octet-stream sum" of the payload, namely, a two-part number of the form "OctetCount.StreamCount", where OctetCount is the total number of octets (8-bit bytes) across all payload file content and StreamCount is the - total number of payload files. + total number of payload files.
From 58152841a21ac2bc3424698ca2fe9787edee3075 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Fri, 6 Jan 2017 16:42:23 -0500 Subject: [PATCH 036/144] Remove stale inline TODO items --- bagit.xml | 7 ------- 1 file changed, 7 deletions(-) diff --git a/bagit.xml b/bagit.xml index d6ffc376..0a4499dd 100644 --- a/bagit.xml +++ b/bagit.xml @@ -1,11 +1,4 @@ - + - - - - - - + + + + + + + + + @@ -30,21 +33,14 @@ ]> - - + + - - - - - + @@ -1007,12 +1003,12 @@ This draft does not request any action from IANA. target="http://msdn2.microsoft.com/en-us/library/aa365247.aspx" /> </reference> - &rfc2119; <!-- Requirements --> - &rfc1321; <!-- MD5 --> - &rfc3174; <!-- SHA-1 --> - &rfc6234; <!-- SHA-2 --> - &rfc3629; <!-- utf-8 --> - &rfc3986; <!-- URLs --> + &RFC2119; <!-- Requirements --> + &RFC1321; <!-- MD5 --> + &RFC3174; <!-- SHA-1 --> + &RFC6234; <!-- SHA-2 --> + &RFC3629; <!-- utf-8 --> + &RFC3986; <!-- URLs --> <reference anchor="UNICODE-TR15" target="http://www.unicode.org/reports/tr15/"> From c6f0a08f147ffc4543adcd6e2612060a0195cdce Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Fri, 6 Jan 2017 18:04:12 -0500 Subject: [PATCH 038/144] Prose review for section 1 --- bagit.xml | 89 ++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 55 insertions(+), 34 deletions(-) diff --git a/bagit.xml b/bagit.xml index 8a45b18b..7fe73867 100644 --- a/bagit.xml +++ b/bagit.xml @@ -176,10 +176,9 @@ <t> This document specifies BagIt, a hierarchical file packaging format for storage and transfer of arbitrary digital content. A "bag" has just enough -structure to enclose descriptive "tags" and a "payload" but +structure to enclose descriptive metadata "tags" and a file "payload" but does not require knowledge of the payload's internal semantics. This -BagIt format should be suitable for disk-based or network-based storage and -transfer. +BagIt format should be suitable for reliable storage and transfer. </t> </abstract> @@ -191,17 +190,34 @@ transfer. <section title="Purpose"> <t> BagIt is a hierarchical file packaging format designed to support -storage and transfer of arbitrary digital content (i.e. files). +storage and transfer of arbitrary digital content. A bag consists of a directory containing the payload files and other accompanying -files known as "tag" files. The "tags" are metadata files intended to facilitate -and document the storage and transfer of the bag. -BagIt does not require the processing of the entire structure to understand -the payload. The name, BagIt, is inspired by the "enclose and deposit" method +metadata files known as "tag" files. The "tags" are metadata files intended to +facilitate and document the storage and transfer of the bag. Processing a bag +does not require any understanding of the payload file contents and the payload +can be accessed without processing the BagIt metadata. +</t> + +<t> +The name, BagIt, is inspired by the "enclose and deposit" method <xref target="ENCDEP" />, sometimes referred to as "bag it and tag it". -BagIt differs from other specifications (like ZIP) by allowing for stronger -checks then CRC-32, the ability to upgrade to stronger hash algorithms without -breaking backwards compatibility, the ability to verify each file without -having to read the entire archive, and no file size limitation +BagIt differs from traditional archive formats such as TAR or ZIP in two general +areas: + +<list style="numbers"> + <t> + Strong integrity assurances: the format supports only cryptographic-quality + hash algorithms (see <xref target="bag-checksum-algorithms" />) and allows + for in-place upgrades to add additional manifests using stronger algorithms + without breaking backwards compatibility + </t> + <t> + Direct file access: files may be accessed using standard operating system + utilities, implementations do not need to process a potentially large + archive file to extract a subset of data, and the format imposes no size + limits for either individual files or a bag. + </t> +</list> </t> </section> <!-- /Purpose --> @@ -220,10 +236,7 @@ document are to be interpreted as described in <xref target="RFC2119"/>. <section title="Terminology"> <t> -This specification uses a number of terms to describe BagIt, some -of which are in common use, some of which are newly defined by this -specification. Terms defined in this section are intended to clarify -any ambiguity. + The following terms have precise definitions as used in this specification: </t> <t> @@ -235,26 +248,21 @@ any ambiguity. <t hangText="bag declaration"> The file required to be in all bags conforming to this specification. - Contains values necessary for the reading and processing of - the rest of a bag. See <xref target="sec-bag-decl"/>. + Contains values necessary to process the rest of a bag. + See <xref target="sec-bag-decl"/>. </t> <t hangText="bag checksum algorithm"> The name of a cryptographic checksum algorithm which has been normalized - for use in a manifest or tag manifest file name (e.g. “sha1”) in the - format described in <xref target="bag-checksum-algorithms" />. - </t> - - <t hangText="complete"> - A bag which contains every elements required by this specification, - every payload file listed in a manifest, and any optional files which are - listed in a tag manifest. See <xref target="sec-complete-valid" />. + for use in a manifest or tag manifest file name (e.g. "SHA-1" becomes + "sha1") as described in <xref target="bag-checksum-algorithms" />. </t> <t hangText="payload"> The data encapsulated by the bag. The contents of the payload are opaque to this specification, and, with respect to BagIt processing, - are always considered as a set of uninterpreted octets. See <xref target="sec-payload-dir" />. + are always considered as an opaque octet stream. + See <xref target="sec-payload-dir" />. </t> <t hangText="tag directory"> @@ -262,7 +270,19 @@ any ambiguity. </t> <t hangText="tag file"> - A file which contains the metadata required to validate the bag. + A file which contains metadata. The specification defines two standard tag + files: tag manifests, which describe other tag files + <xref target="sec-tag-manifest" />, and the "bag-info.txt" file containing + human-meaningful metadata <xref target="sec-bag-info" />. + + The specification also allows other arbitrary tag files as described in + <xref target="sec-other-tag-files" />. + </t> + + <t hangText="complete"> + A bag which contains every element required by this specification, + every payload file listed in a manifest, and any optional files which are + listed in a tag manifest. See <xref target="sec-complete-valid" />. </t> <t hangText="valid"> @@ -452,7 +472,7 @@ file named ".keep". </section> <!-- /Required Elements --> <section title="Optional Elements" anchor="sec-optional-elements"> -<section title="Tag Manifest: tagmanifest-<alg>.txt"> +<section anchor="sec-tag-manifest" title="Tag Manifest: tagmanifest-<alg>.txt"> <!-- WARNING: This section should be kept in relative sync with the section on Payload Manifests. --> @@ -485,7 +505,7 @@ As a result, no FILENAME listed in a tag manifest begins "data/". </section> <!-- /Tag Manifest --> -<section title="Bag Metadata: bag-info.txt"> +<section anchor="sec-bag-info" title="Bag Metadata: bag-info.txt"> <t> The "bag-info.txt" file is a tag file that contains metadata elements describing the bag and the payload. The metadata elements contained in @@ -658,10 +678,11 @@ attributes: <t> <list style="numbers"> <t>The bag &must; be complete.</t> - <t>Every CHECKSUM in every payload manifest and tag manifest - can be sucessfully verified against the contents of its - corresponding FILENAME.</t> - <t>Every element present &must; comply with this specification.</t> + <t> + Every CHECKSUM in every payload manifest and tag manifest can be + sucessfully verified against the contents of its corresponding FILENAME. + </t> + <t>Every element present &must; comply with this specification.</t> </list> </t> From b6413fdf683f86ba1cc9c555b00e52951e40d5f0 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Fri, 6 Jan 2017 18:08:08 -0500 Subject: [PATCH 039/144] =?UTF-8?q?Fix=20formatting=20for=20Section=202=20?= =?UTF-8?q?(=E2=80=9CStructure=E2=80=9D)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- bagit.xml | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/bagit.xml b/bagit.xml index 7fe73867..7c1aeb8d 100644 --- a/bagit.xml +++ b/bagit.xml @@ -305,16 +305,17 @@ document are to be interpreted as described in <xref target="RFC2119"/>. <section title="Structure"> <t> A bag consists of a base directory containing: +</t> +<t> <list style="numbers"> - <t>a set of required and optional tag files</t> - <t>a sub-directory named "data", called the payload directory</t> + <t>a set of required and optional tag files <xref target="sec-optional-elements" /></t> + <t>a sub-directory named "data", called the payload directory. <xref target="sec-payload-dir" /></t> <t>a set of optional tag directories</t> </list> +</t> - The payload files in the payload directory are an arbitrary file hierarchy - (see <xref target="sec-payload-dir" />). - +<t> The tag files in the base directory consist of one or more files named "manifest-<spanx style="emph">algorithm</spanx>.txt" (see <xref target="sec-payload-manifest" /> and From b35ceb54a54443ff7fda99af64a23372acd28b33 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Fri, 6 Jan 2017 18:10:10 -0500 Subject: [PATCH 040/144] =?UTF-8?q?Fix=20formatting=20for=20=E2=80=9CBag?= =?UTF-8?q?=20Declaration=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- bagit.xml | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/bagit.xml b/bagit.xml index 7c1aeb8d..00bdf7c5 100644 --- a/bagit.xml +++ b/bagit.xml @@ -353,22 +353,24 @@ The base directory &may; have any name. <section title="Required Elements" anchor="sec-required-elements"> <section title="Bag Declaration: bagit.txt" anchor="sec-bag-decl"> + <t> The "bagit.txt" tag file &must; consist of exactly two lines: +</t> <figure> - <artwork> + <artwork> BagIt-Version: M.N Tag-File-Character-Encoding: UTF-8 - </artwork> -</figure> + </artwork> + <postamble> + M.N identifies the BagIt major (M) and minor (N) version numbers, + and UTF-8 identifies the character set encoding used by the tag files. -where M.N identifies the BagIt major (M) and minor (N) version numbers, -and UTF-8 identifies the character set encoding of tag files. The bag -declaration &must; be encoded in UTF-8, and &must-not; contain a byte-order -mark (BOM). -<xref target="RFC3629"/> -</t> + The bag declaration &must; be encoded in UTF-8, and &must-not; contain a + byte-order mark (BOM) <xref target="RFC3629"/>. + </postamble> +</figure> <t> The number for this version of the specification is "¤t-bagit-version;". From 45685bc2038cacbea32dae45d196343da4a8e431 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Fri, 6 Jan 2017 18:12:41 -0500 Subject: [PATCH 041/144] =?UTF-8?q?Copy-editing=20for=20=E2=80=9CPayload?= =?UTF-8?q?=20Directory=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- bagit.xml | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/bagit.xml b/bagit.xml index 00bdf7c5..5abc28fc 100644 --- a/bagit.xml +++ b/bagit.xml @@ -384,11 +384,11 @@ The base directory &must; contain a sub-directory named "data". <t> The payload directory contains the arbitrary digital content within the bag. -The files under the payload directory are called payload files, or -the payload. -Each payload file is treated as an opaque octet stream for all verifing file correctness. -The payload directory structure may contain meaning and thus should be preserved, -but is otherwise ignored for purposes relating to this specification. +The files under the payload directory are called payload files, or the payload. +Each payload file is treated as an opaque octet stream when verifying file +correctness. +Any sub-directory structure within the payload &must; be preserved but is +otherwise ignored for purposes relating to this specification. </t> </section> <!-- /Payload Directory --> From 272e6ce6a57c9384ff3ae641d5ad703a5c706cc6 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Fri, 6 Jan 2017 18:17:02 -0500 Subject: [PATCH 042/144] =?UTF-8?q?Copy=20editing=20for=20=E2=80=9CPayload?= =?UTF-8?q?=20Manifest=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- bagit.xml | 52 ++++++++++++++++++++++++++++++---------------------- 1 file changed, 30 insertions(+), 22 deletions(-) diff --git a/bagit.xml b/bagit.xml index 5abc28fc..7292af34 100644 --- a/bagit.xml +++ b/bagit.xml @@ -397,22 +397,25 @@ otherwise ignored for purposes relating to this specification. section on Tag Manifests. --> <t> -A payload manifest file provides a complete listing of the files contained in the payload, -and a checksum for each payload file, to permit data integrity checking. +A payload manifest file provides a complete listing of each payload file along +with a corresponding checksum to permit data integrity checking. +</t> -Every bag &must; contain at least one payload manifest file, and &may; contain +<t> +Every bag &must; contain at least one payload manifest file and &may; contain more than one. Every payload manifest &must; list every payload file. A payload -manifest file &must; have a name of the form manifest-<spanx -style="emph">algorithm</spanx>.txt, where <spanx style="emph">algorithm</spanx> -is a string specifying the bag checksum algorithm used in that manifest, such -as: +manifest file &must; have a name of the form "manifest-<spanx +style="emph">algorithm</spanx>.txt", where <spanx style="emph">algorithm</spanx> +is a string specifying the checksum algorithm used by that manifest as described +in <xref target="bag-checksum-algorithms" />. </t> <figure> - <artwork> + <preamble>Example payload manifest filenames</preamble> + <artwork> manifest-md5.txt manifest-sha1.txt - </artwork> + </artwork> </figure> <t>A bag &must-not; contain more than one payload manifest for a particular @@ -423,22 +426,27 @@ Each line of a payload manifest file &must; be of the form: </t> <figure> - <artwork> -CHECKSUM FILENAME - </artwork> + <artwork>CHECKSUM FILENAME</artwork> + <postamble> + where FILENAME is the pathname of a file relative to the base directory, + and CHECKSUM is a hex-encoded checksum calculated according to + <spanx style="emph">algorithm</spanx> over every octet in the file. + </postamble> </figure> <t> -where FILENAME is the pathname of a file relative to the base directory, - and CHECKSUM is a hex-encoded checksum calculated according to <spanx - style="emph">algorithm</spanx> over every octet in the file. -The hex-encoded checksum &may; use uppercase and/or lowercase letters. The slash -character ('/') &must; be used as a path separator in FILENAME. One -or more linear whitespace characters (spaces or tabs) &must; separate -CHECKSUM from FILENAME. There is no limitation on the length of a pathname. -The payload manifest &must-not; reference files outside the payload directory. If -a FILENAME includes a newline (LF), a carriage return (CR), or carriage -return plus newline (CRLF) it &must; be percent-encoded +The hex-encoded checksum &may; use uppercase and/or lowercase letters. + +The slash character ('/') &must; be used as a path separator in FILENAME. + +One or more linear whitespace characters (spaces or tabs) &must; separate CHECKSUM from FILENAME. + +There is no limitation on the length of a pathname. + +The payload manifest &must-not; reference files outside the payload directory. + +If a FILENAME includes a newline (LF), a carriage return (CR), or carriage +return plus newline (CRLF) it &must; be percent-encoded following <xref target="RFC3986"/>. </t> From 3bbebd5c7d1109393850e7aa81928570d8130cb7 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Fri, 6 Jan 2017 18:21:03 -0500 Subject: [PATCH 043/144] Remove injunction against reusing payload manifest algorithms This appears to be redundant since the only way an algorithm could be reused would involve violating the naming convention. --- bagit.xml | 3 --- 1 file changed, 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 7292af34..5d85e57a 100644 --- a/bagit.xml +++ b/bagit.xml @@ -418,9 +418,6 @@ manifest-sha1.txt </artwork> </figure> -<t>A bag &must-not; contain more than one payload manifest for a particular -bag checksum algorithm.</t> - <t> Each line of a payload manifest file &must; be of the form: </t> From d9ceeee5c364995524e30ad392d2a056e3b5cc59 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Fri, 6 Jan 2017 18:31:05 -0500 Subject: [PATCH 044/144] =?UTF-8?q?Copy-editing=20for=20=E2=80=9CPayload-O?= =?UTF-8?q?xum=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- bagit.xml | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/bagit.xml b/bagit.xml index 5d85e57a..869b0d62 100644 --- a/bagit.xml +++ b/bagit.xml @@ -550,11 +550,14 @@ the standard checksum validation process before proclaiming a bag to be valid. This element &must-not; be present more than once and, if present, &must; conform to this format: </t> + <t hangText="Payload-Oxum"> - The "octet-stream sum" of the payload, namely, a two-part number of the - form "OctetCount.StreamCount", where OctetCount is the total number of - octets (8-bit bytes) across all payload file content and StreamCount is the - total number of payload files. + The "octet-stream sum" of the payload is a pair of two numbers in the form + "<spanx style="emph">OctetCount</spanx>.<spanx style="emph">StreamCount</spanx>", + where <spanx style="emph">OctetCount</spanx> is the total number of octets + (8-bit bytes) across all payload file content and + <spanx style="emph">StreamCount</spanx> is the total number of payload + files. </t> <figure> From 38b8b59f0f453b4e4ecbda8c45d03fecf07cb004 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Fri, 6 Jan 2017 18:38:13 -0500 Subject: [PATCH 045/144] =?UTF-8?q?Simplify=20=E2=80=9COther=20Tag=20Files?= =?UTF-8?q?=E2=80=9D=20text?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- bagit.xml | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/bagit.xml b/bagit.xml index 869b0d62..376c0a9c 100644 --- a/bagit.xml +++ b/bagit.xml @@ -594,11 +594,9 @@ conform to this format: <t> A bag &may; contain other tag files that are not defined by this specification. -Implementations &should; ignore the content of any tag files not -defined in this specification, except when they are listed in a tag manifest. -When tag files are listed in a tag manifest, implementations -&must; only treat the content of those tag files as octet streams for the -purpose of checksum verification. + +Implementations &must; perform standard checksum validation on any tag file +which is listed in a tag manifest but &must; otherwise ignore their contents. </t> </section> <!-- /Other Tag Files --> </section> <!-- /Optional Elements --> From 00eb4ab86485844b33c1f6fc508118f40aedd3b4 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Fri, 6 Jan 2017 18:46:21 -0500 Subject: [PATCH 046/144] Spelling --- bagit.xml | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 376c0a9c..88b2daf8 100644 --- a/bagit.xml +++ b/bagit.xml @@ -990,9 +990,10 @@ update the bag with valid manifests. <section title="Acknowledgements"> <t> -BagIt owes much to many thoughtful contributers and reviewers, including -Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Brad Hards, Scott Fisher, Keith Johnson, Erik -Hetzner, Leslie Johnston, David Loy, Mark Phillips, Tracy Seneca, Brian Tingle, Adam Turoff, and Jim Tuttle. +BagIt owes much to many thoughtful contributors and reviewers, including +Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Brad Hards, Scott Fisher, Keith +Johnson, Erik Hetzner, Leslie Johnston, David Loy, Mark Phillips, Tracy Seneca, +Brian Tingle, Adam Turoff, and Jim Tuttle. </t> <section title="IANA Considerations"> From 8c9d62155522db7e9823cfd0e9120efbcdaeae10 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Fri, 6 Jan 2017 18:46:51 -0500 Subject: [PATCH 047/144] =?UTF-8?q?Copy-editing=20for=20=E2=80=9CComplete,?= =?UTF-8?q?=20Incomplete,=20and=20Valid=20bags=E2=80=9D?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- bagit.xml | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/bagit.xml b/bagit.xml index 88b2daf8..71e09708 100644 --- a/bagit.xml +++ b/bagit.xml @@ -664,32 +664,30 @@ removing all non-alphanumeric characters. <section title="Complete, Incomplete, and Valid bags" anchor="sec-complete-valid"> <t> -A <spanx style="emph">complete</spanx> bag &must; have the following -attributes: +A <spanx style="emph">complete</spanx> bag &must; meet the following +requirements: </t> <t> <list style="numbers"> <t>Every required element &must; be present (<xref target="sec-required-elements" />).</t> + <t>Every file in every tag manifest &must; be present on the filesystem.</t> <t>Every file in every payload manifest &must; be present on the filesystem.</t> - <t>Every file in every tag manifest &must; be present on the filesystem. - Tag files not listed in a tag manifest &may; be present.</t> - <t>Every payload file &must; be listed in all payload manifests.</t> + <t>Every payload file &must; be listed in every payload manifest.</t> </list> </t> <t> -A <spanx style="emph">valid</spanx> bag &must; have the following -attributes: +A <spanx style="emph">valid</spanx> bag &must; meet the following requirements: </t> <t> <list style="numbers"> - <t>The bag &must; be complete.</t> + <t>The bag &must; be <spanx style="emph">complete</spanx>.</t> <t> - Every CHECKSUM in every payload manifest and tag manifest can be - sucessfully verified against the contents of its corresponding FILENAME. + Every checksum in every payload manifest and tag manifest has been + successfully verified against the contents of the corresponding file. </t> <t>Every element present &must; comply with this specification.</t> </list> From ef7f6dffd1073b39943073c2c4f99287830e019f Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Fri, 6 Jan 2017 18:47:00 -0500 Subject: [PATCH 048/144] Update version number in example bag --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 71e09708..fa1a789a 100644 --- a/bagit.xml +++ b/bagit.xml @@ -717,7 +717,7 @@ myfirstbag/ | (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt) | | bagit.txt -| (BagIt-version: 0.96 ) +| (BagIt-version: 1.0 ) | (Tag-File-Character-Encoding: UTF-8 ) | \--- data/ From a60e64481ff0868a74ee178de3efad7c5763ff70 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Tue, 17 Jan 2017 12:57:17 -0500 Subject: [PATCH 049/144] Update authors * Add Rosie Storey, David Brunton, Kate Zwaard * Update email address for Justin Littman --- bagit.xml | 50 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 49 insertions(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index fa1a789a..d578a726 100644 --- a/bagit.xml +++ b/bagit.xml @@ -88,7 +88,7 @@ <code>20052</code> <country>USA</country> </postal> - <email>jlit@loc.gov</email> + <email>justinlittman@gwu.edu</email> </address> </author> @@ -138,8 +138,56 @@ </postal> <email>jsca@loc.gov</email> </address> + </author> + + <author initials="R." surname="Storey" + fullname="Rosie Storey"> + <organization> + Library of Congress + </organization> + <address> + <postal> + <street>101 Independence Avenue SE</street> + <city>Washington</city> <region>DC</region> + <code>20540</code> + <country>USA</country> + </postal> + <email>rstorey@loc.gov</email> + </address> + </author> + + <author initials="D." surname="Brunton" + fullname="David Brunton"> + <organization> + Library of Congress + </organization> + <address> + <postal> + <street>101 Independence Avenue SE</street> + <city>Washington</city> <region>DC</region> + <code>20540</code> + <country>USA</country> + </postal> + <email>dbrun@loc.gov</email> + </address> </author> + <author initials="K." surname="Zwaard" + fullname="Kate Zwaard"> + <organization> + Library of Congress + </organization> + <address> + <postal> + <street>101 Independence Avenue SE</street> + <city>Washington</city> <region>DC</region> + <code>20540</code> + <country>USA</country> + </postal> + <email>kzwa@loc.gov</email> + </address> +</author> + <author initials="C." surname="Adams" fullname="Chris Adams"> <organization> From 169a094b0c8e2b690ccff25f2a17332f0ed6eb3f Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Mon, 23 Jan 2017 09:40:39 -0500 Subject: [PATCH 050/144] Clarify wording about directories in manifests --- bagit.xml | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/bagit.xml b/bagit.xml index d578a726..a5ddd7c7 100644 --- a/bagit.xml +++ b/bagit.xml @@ -519,10 +519,9 @@ ending = CR / LF / CRLF </figure> <t> -A payload manifest &must-not; reference empty directories. To account for -an empty directory, a bag creator may wish to include at least one file -in that directory; a bag creator might wish for example, to include a zero-length -file named ".keep". +A manifest &must-not; reference directories. Bag creators who wish to create +an otherwise empty directory have typically done so by creating an empty +placeholder file with a name such as ".keep". </t> </section> <!-- /Payload Manifest --> </section> <!-- /Required Elements --> From 3d3bb73c5e0e3d01381c44da68dc3a76a447814c Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Mon, 23 Jan 2017 09:58:26 -0500 Subject: [PATCH 051/144] Remove duplicate wording --- bagit.xml | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index a5ddd7c7..a23c515d 100644 --- a/bagit.xml +++ b/bagit.xml @@ -699,11 +699,12 @@ removing all non-alphanumeric characters. algorithms <xref target="RFC6234"/> and &should; enable SHA-512 by default when creating new bags. - For backwards-compatibility implementors should support &should; support + For backwards-compatibility implementors &should; support MD-5 <xref target="RFC1321"/> and SHA-1 <xref target="RFC3174"/>. Implementors are encouraged to simplify the process of adding additional - manifests using new algorithms to streamline the process of in-place upgrades. + manifests using new algorithms to streamline the process of in-place + upgrades. </t> </section> <!-- /Bag Checksum Algorithms --> From 0257f7bc35bc8e56a990d52e196459b33c66381e Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Mon, 23 Jan 2017 10:07:37 -0500 Subject: [PATCH 052/144] =?UTF-8?q?Update=20=E2=80=9Cfilesystem=E2=80=9D?= =?UTF-8?q?=20reference=20in=20=E2=80=9Ccomplete=E2=80=9D=20requirements?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Since people may wish to use things which aren’t commonly considered filesystems (object stores, archive files, etc.) this should be more generic. --- bagit.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index a23c515d..96e466ad 100644 --- a/bagit.xml +++ b/bagit.xml @@ -720,8 +720,8 @@ requirements: <list style="numbers"> <t>Every required element &must; be present (<xref target="sec-required-elements" />).</t> - <t>Every file in every tag manifest &must; be present on the filesystem.</t> - <t>Every file in every payload manifest &must; be present on the filesystem.</t> + <t>Every file listed in every tag manifest &must; be present.</t> + <t>Every file listed in every payload manifest &must; be present.</t> <t>Every payload file &must; be listed in every payload manifest.</t> </list> </t> From e6770927bbd594d4a0a3e49dd7b8bc5897e4a394 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Mon, 23 Jan 2017 10:07:57 -0500 Subject: [PATCH 053/144] =?UTF-8?q?Update=20=E2=80=9Cspecial=20directory?= =?UTF-8?q?=20characters=E2=80=9D=20prose?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- bagit.xml | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/bagit.xml b/bagit.xml index 96e466ad..8ad9fd60 100644 --- a/bagit.xml +++ b/bagit.xml @@ -789,25 +789,29 @@ myfirstbag/ <section title="Special directory characters"> <!-- Added by Brian Vargas, 2009-04-09 --> <t> -The paths specified in the payload manifest, and tag manifest file do not -prohibit special directory characters which might be significant on -implementing systems. Implementors &must; ensure that files outside the bag +The paths specified in the payload manifest and tag manifest file do not +prohibit special directory characters which have special meaning on some +operating systems. Implementors &must; ensure that files outside the bag directory structure are not accessed when reading or writing files based on paths specified in a bag. </t> +<t> +All implementations &should; have a test suite to guard against these cases. +</t> + <t> For example, a maliciously crafted "tagmanifest-md5.txt" file might contain entries which begin with a path character such as "/", "..", or a "~username" home directory reference in an attempt to cause a -naive implementation to leak or overwrite targeted files. -All implementations &should; have a test suite to guard against these cases. +naive implementation to leak or overwrite targeted files on a POSIX operating +system. </t> <t> - Windows implementations &should; test their implementations to ensure - that safety-checks prevent use of drive letters and the less commonly used - namespace sequences (e.g. "\\?\C:\…") described in <xref target="MSFNAM" />. +Windows implementations &should; test their implementations to ensure +that safety-checks prevent use of drive letters and the less commonly used +namespace sequences (e.g. "\\?\C:\…") described in <xref target="MSFNAM" />. </t> </section> <!-- End Section: Special directory characters --> </section> <!-- End Section: Security considerations --> From 1f97d39af8ebaa7af1be7f7e877347ef63fd7148 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Mon, 23 Jan 2017 11:07:18 -0500 Subject: [PATCH 054/144] Note that bag-info.txt fields are intended for human consumption --- bagit.xml | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 8ad9fd60..f1dc3449 100644 --- a/bagit.xml +++ b/bagit.xml @@ -565,9 +565,10 @@ As a result, no FILENAME listed in a tag manifest begins "data/". The "bag-info.txt" file is a tag file that contains metadata elements describing the bag and the payload. The metadata elements contained in the "bag-info.txt" file are intended primarily for human use. -All metadata elements are optional and &may; be repeated. Implementations -&must; assume that the ordering is significant and provide access to the -metadata elements in the order they are given in the "bag-info.txt" file. +All metadata elements are optional and &may; be repeated. Because +“bag-info.txt” is intended for human reading and editing, implementations +&must; assume that the order of metadata elements is significant and &must; be +preserved. </t> <t> A metadata element &must; consist of a label, a colon, and a value, From aaffd0acc4da5617d7d5ea82297ffbf3083cac0e Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Mon, 23 Jan 2017 14:43:02 -0500 Subject: [PATCH 055/144] Update Makefile and add a format target --- Makefile | 13 +++++++++++++ makefile | 7 ------- 2 files changed, 13 insertions(+), 7 deletions(-) create mode 100644 Makefile delete mode 100644 makefile diff --git a/Makefile b/Makefile new file mode 100644 index 00000000..7f9cadc8 --- /dev/null +++ b/Makefile @@ -0,0 +1,13 @@ +default: html text + +html: + xml2rfc bagit.xml + +text: + xml2rfc --html bagit.xml + +format: + # We can't enable c14n because that triggers external DTD fetching and + # libxml2 currently does not support HTTPS, which is a problem now that all + # of the xml.resource.org URLs redirect: + xmllint --format --output bagit.xml bagit.xml \ No newline at end of file diff --git a/makefile b/makefile deleted file mode 100644 index 58e03ec3..00000000 --- a/makefile +++ /dev/null @@ -1,7 +0,0 @@ -default: html text - -html: - xml2rfc bagit.xml - -text: - xml2rfc --html bagit.xml From bc862cf916cfcb1424b3fd0deae3e0c65814c746 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Mon, 23 Jan 2017 14:45:48 -0500 Subject: [PATCH 056/144] XML formatting This applies consistent XML formatting going forward. It's a big diff but does not change the generated text or HTML output. --- bagit.xml | 1063 +++++++++++++++++++++++------------------------------ 1 file changed, 454 insertions(+), 609 deletions(-) diff --git a/bagit.xml b/bagit.xml index f1dc3449..43c393d2 100644 --- a/bagit.xml +++ b/bagit.xml @@ -1,242 +1,216 @@ -<?xml version='1.0' ?> +<?xml version="1.0"?> <!-- See http://xml.resource.org/ for formatting tools to work with the RFC 7749 XML format --> <!DOCTYPE rfc SYSTEM "rfc2629.dtd" [ - - <!ENTITY mdash '—' > - - <!ENTITY RFC1321 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.1321.xml"> - <!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"> - <!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml"> - <!ENTITY RFC3174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3174.xml"> - <!ENTITY RFC3552 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3552.xml"> - <!ENTITY RFC3629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3629.xml"> - <!ENTITY RFC3986 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3986.xml"> - <!ENTITY RFC5226 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5226.xml"> - <!ENTITY RFC6234 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6234.xml"> - - <!-- RFC 2119 entities - for convenience --> - <!ENTITY must 'MUST' > - <!ENTITY must-not 'MUST NOT' > - <!ENTITY required 'REQUIRED' > - <!ENTITY shall 'SHALL' > - <!ENTITY shall-not 'SHALL NOT' > - <!ENTITY should 'SHOULD' > - <!ENTITY should-not 'SHOULD NOT' > - <!ENTITY recommended 'RECOMMENDED' > - <!ENTITY may 'MAY' > - <!ENTITY optional 'OPTIONAL' > - - <!-- The current bagit version, for convenience. --> - <!ENTITY current-bagit-version '1.00' > +<!ENTITY mdash "—"> +<!ENTITY RFC1321 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.1321.xml"> +<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"> +<!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml"> +<!ENTITY RFC3174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3174.xml"> +<!ENTITY RFC3552 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3552.xml"> +<!ENTITY RFC3629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3629.xml"> +<!ENTITY RFC3986 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3986.xml"> +<!ENTITY RFC5226 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5226.xml"> +<!ENTITY RFC6234 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6234.xml"> +<!-- RFC 2119 entities - for convenience --><!ENTITY must "MUST"> +<!ENTITY must-not "MUST NOT"> +<!ENTITY required "REQUIRED"> +<!ENTITY shall "SHALL"> +<!ENTITY shall-not "SHALL NOT"> +<!ENTITY should "SHOULD"> +<!ENTITY should-not "SHOULD NOT"> +<!ENTITY recommended "RECOMMENDED"> +<!ENTITY may "MAY"> +<!ENTITY optional "OPTIONAL"> +<!-- The current bagit version, for convenience. --><!ENTITY current-bagit-version "1.00"> ]> - <?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?> <?rfc strict="yes" ?> <?rfc comments="no"?> <?rfc inline="yes"?> <?rfc symrefs="yes"?> <?rfc toc="yes"?> - <rfc category="info" docName="draft-kunze-bagit-14" ipr="trust200902"> - <front> - - <title abbrev="BagIt"> + <front> + <title abbrev="BagIt"> The BagIt File Packaging Format (V¤t-bagit-version;) - - -
- - 1438 Kingfisher Way - Sunnyvale CA - 94087 - USA - - andy@boyko.net -
-
- - - + +
+ + 1438 Kingfisher Way + Sunnyvale + CA + 94087 + USA + + andy@boyko.net +
+
+ + California Digital Library -
- - 415 20th St, 4th Floor - Oakland CA - 94612 - US - - jak@ucop.edu -
-
- - - +
+ + 415 20th St, 4th Floor + Oakland + CA + 94612 + US + + jak@ucop.edu +
+
+ + George Washington University Libraries -
- - 2130 H St NW - Washington DC - 20052 - USA - - justinlittman@gwu.edu -
-
- - - +
+ + 2130 H St NW + Washington + DC + 20052 + USA + + justinlittman@gwu.edu +
+
+ + University of Maryland -
- - 4130 Campus Drive - College Park MD - 20742 - USA - - ehs@pobox.com -
-
- - - +
+ + 4130 Campus Drive + College Park + MD + 20742 + USA + + ehs@pobox.com +
+
+ + Library of Congress -
- - 101 Independence Avenue SE - Washington DC - 20540 - USA - - emad@loc.gov -
-
- - - +
+ + 101 Independence Avenue SE + Washington + DC + 20540 + USA + + emad@loc.gov +
+
+ + Library of Congress -
- - 101 Independence Avenue SE - Washington DC - 20540 - USA - - jsca@loc.gov -
-
- - - +
+ + 101 Independence Avenue SE + Washington + DC + 20540 + USA + + jsca@loc.gov +
+
+ + Library of Congress -
- - 101 Independence Avenue SE - Washington DC - 20540 - USA - - rstorey@loc.gov -
-
- - - +
+ + 101 Independence Avenue SE + Washington + DC + 20540 + USA + + rstorey@loc.gov +
+
+ + Library of Congress -
- - 101 Independence Avenue SE - Washington DC - 20540 - USA - - dbrun@loc.gov -
-
- - - +
+ + 101 Independence Avenue SE + Washington + DC + 20540 + USA + + dbrun@loc.gov +
+
+ + Library of Congress -
- - 101 Independence Avenue SE - Washington DC - 20540 - USA - - kzwa@loc.gov -
-
- - - +
+ + 101 Independence Avenue SE + Washington + DC + 20540 + USA + + kzwa@loc.gov +
+
+ + Library of Congress -
- - 101 Independence Avenue SE - Washington DC - 20540 - USA - - cadams@loc.gov -
-
- - -
- - 1354 Quincy St. NW - Washington DC - 20011 - USA - - brian@ardvaark.net -
-
- - - - - - +
+ + 101 Independence Avenue SE + Washington + DC + 20540 + USA + + cadams@loc.gov +
+
+ +
+ + 1354 Quincy St. NW + Washington + DC + 20011 + USA + + brian@ardvaark.net +
+
+ + + This document specifies BagIt, a hierarchical file packaging format for storage and transfer of arbitrary digital content. A "bag" has just enough structure to enclose descriptive metadata "tags" and a file "payload" but does not require knowledge of the payload's internal semantics. This BagIt format should be suitable for reliable storage and transfer. - - - -
- - -
-
- + + + +
+
+ BagIt is a hierarchical file packaging format designed to support storage and transfer of arbitrary digital content. A bag consists of a directory containing the payload files and other accompanying @@ -245,142 +219,126 @@ facilitate and document the storage and transfer of the bag. Processing a bag does not require any understanding of the payload file contents and the payload can be accessed without processing the BagIt metadata. - - + The name, BagIt, is inspired by the "enclose and deposit" method -, sometimes referred to as "bag it and tag it". +, sometimes referred to as "bag it and tag it". BagIt differs from traditional archive formats such as TAR or ZIP in two general areas: - - + Strong integrity assurances: the format supports only cryptographic-quality - hash algorithms (see ) and allows + hash algorithms (see ) and allows for in-place upgrades to add additional manifests using stronger algorithms without breaking backwards compatibility - - + Direct file access: files may be accessed using standard operating system utilities, implementations do not need to process a potentially large archive file to extract a subset of data, and the format imposes no size limits for either individual files or a bag. - - + -
- -
- +
+ +
+ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in . - - + Implementors are strongly encouraged to review the interoperability - considerations described in . + considerations described in . -
- -
- +
+ +
+ The following terms have precise definitions as used in this specification: - - - - + + + A order independent set of opaque files contained within the structure defined by this specification. - - + The file required to be in all bags conforming to this specification. Contains values necessary to process the rest of a bag. See . - - + The name of a cryptographic checksum algorithm which has been normalized for use in a manifest or tag manifest file name (e.g. "SHA-1" becomes - "sha1") as described in . + "sha1") as described in . - - + The data encapsulated by the bag. The contents of the payload are opaque to this specification, and, with respect to BagIt processing, are always considered as an opaque octet stream. - See . + See . - - + A directory that contains one or more tag files. - - + A file which contains metadata. The specification defines two standard tag files: tag manifests, which describe other tag files - , and the "bag-info.txt" file containing - human-meaningful metadata . + , and the "bag-info.txt" file containing + human-meaningful metadata . The specification also allows other arbitrary tag files as described in - . + . - - + A bag which contains every element required by this specification, every payload file listed in a manifest, and any optional files which are - listed in a tag manifest. See . + listed in a tag manifest. See . - - + A complete bag where every checksum in every manifest has been successfully verified against the corresponding file. - - -
- - - + + -
- -
- +--> + +
+ +
+ A bag consists of a base directory containing: - - - - a set of required and optional tag files - a sub-directory named "data", called the payload directory. - a set of optional tag directories - - - - + + + a set of required and optional tag files + a sub-directory named "data", called the payload directory. + a set of optional tag directories + + + The tag files in the base directory consist of one or more files named "manifest-algorithm.txt" -(see and -), -a file named "bagit.txt" (see ), +(see and +), +a file named "bagit.txt" (see ), and zero or more additional tag files (see -). The tag files and directories are +). The tag files and directories are arbitrary file hierarchies and &may; have any name that is not reserved for a file or directory in this specification. - - + The base directory &may; have any name. - -
- +
+ <base directory>/ | +-- bagit.txt @@ -397,40 +355,35 @@ The base directory &may; have any name. | +-- [optional tag files] -
- -
-
- - +
+
+
+ The "bagit.txt" tag file &must; consist of exactly two lines: - -
- +
+ BagIt-Version: M.N Tag-File-Character-Encoding: UTF-8 - + M.N identifies the BagIt major (M) and minor (N) version numbers, and UTF-8 identifies the character set encoding used by the tag files. The bag declaration &must; be encoded in UTF-8, and &must-not; contain a byte-order mark (BOM) . -
- - +
+ The number for this version of the specification is "¤t-bagit-version;". -
- -
- +
+ +
+ The base directory &must; contain a sub-directory named "data". - - + The payload directory contains the arbitrary digital content within the bag. The files under the payload directory are called payload files, or the payload. Each payload file is treated as an opaque octet stream when verifying file @@ -438,48 +391,42 @@ correctness. Any sub-directory structure within the payload &must; be preserved but is otherwise ignored for purposes relating to this specification. -
- -
- +
+ - + A payload manifest file provides a complete listing of each payload file along with a corresponding checksum to permit data integrity checking. - - + Every bag &must; contain at least one payload manifest file and &may; contain more than one. Every payload manifest &must; list every payload file. A payload -manifest file &must; have a name of the form "manifest-algorithm.txt", where algorithm +manifest file &must; have a name of the form "manifest-algorithm.txt", where algorithm is a string specifying the checksum algorithm used by that manifest as described -in . +in . - -
- Example payload manifest filenames - +
+ Example payload manifest filenames + manifest-md5.txt manifest-sha1.txt -
- - +
+ Each line of a payload manifest file &must; be of the form: - -
- CHECKSUM FILENAME - +
+ CHECKSUM FILENAME + where FILENAME is the pathname of a file relative to the base directory, and CHECKSUM is a hex-encoded checksum calculated according to algorithm over every octet in the file. -
- - +
+ The hex-encoded checksum &may; use uppercase and/or lowercase letters. The slash character ('/') &must; be used as a path separator in FILENAME. @@ -494,10 +441,9 @@ If a FILENAME includes a newline (LF), a carriage return (CR), or carriage return plus newline (CRLF) it &must; be percent-encoded following . - -
- Payload Manifest ABNF - + Payload Manifest ABNF + -
- - +
+ A manifest &must-not; reference directories. Bag creators who wish to create an otherwise empty directory have typically done so by creating an empty placeholder file with a name such as ".keep". -
-
- -
-
- +
+ +
+
+ - + A tag manifest is a tag file that lists other tag files and checksums for those tag files generated using a particular bag checksum algorithm. A bag &may; contain one or more tag manifests. -A tag manifest file &must; have a name of the form "tagmanifest-algorithm.txt", where algorithm +A tag manifest file &must; have a name of the form "tagmanifest-algorithm.txt", where algorithm is a string following the format described in - specifying the bag checksum algorithm + specifying the bag checksum algorithm used in that manifest. - -
- Example tag manifest filenames: - +
+ Example tag manifest filenames: + tagmanifest-md5.txt tagmanifest-sha1.txt -
- - +
+ A tag manifest file has the same form as the payload file manifest -file described in , +file described in , but &must-not; list any payload files. As a result, no FILENAME listed in a tag manifest begins "data/". - -
- -
- +
+ +
+ The "bag-info.txt" file is a tag file that contains metadata elements describing the bag and the payload. The metadata elements contained in the "bag-info.txt" file are intended primarily for human use. All metadata elements are optional and &may; be repeated. Because -“bag-info.txt” is intended for human reading and editing, implementations +“bag-info.txt” is intended for human reading and editing, implementations &must; assume that the order of metadata elements is significant and &must; be preserved. - + A metadata element &must; consist of a label, a colon, and a value, each separated by optional whitespace. It is &recommended; that lines not exceed 79 characters in length. Long values may be continued @@ -578,19 +520,17 @@ onto the next line by inserting a newline (LF), a carriage return (CR), or carriage return plus newline (CRLF) and indenting the next line with linear white space (spaces or tabs). - -
- "bag-info.txt" ABNF - + "bag-info.txt" ABNF + -
- - + + An implementation &should; add the optional "Payload-Oxum" element for the purpose of quickly detecting incomplete bags before performing checksum validation. This is strictly an optimization and implementations &must; perform @@ -598,8 +538,7 @@ the standard checksum validation process before proclaiming a bag to be valid. This element &must-not; be present more than once and, if present, &must; conform to this format: - - + The "octet-stream sum" of the payload is a pair of two numbers in the form "OctetCount.StreamCount", where OctetCount is the total number of octets @@ -607,17 +546,15 @@ conform to this format: StreamCount is the total number of payload files. - -
- Payload-Oxum ABNF - + Payload-Oxum ABNF + -
- -
- An example "bag-info.txt" file - +
+
+ An example "bag-info.txt" file + Source-Organization: FOO University Organization-Address: 1 Main St., Cupertino, California, 11111 Contact-Name: Jane Doe @@ -635,34 +572,34 @@ conform to this format: Internal-Sender-Description: Uncompressed greyscale TIFFs created from microfilm and are... -
-
- -
- + +
+ +
+ A bag &may; contain other tag files that are not defined by this specification. Implementations &must; perform standard checksum validation on any tag file which is listed in a tag manifest but &must; otherwise ignore their contents. -
-
- -
- +
+ +
+ +
+ All tag files specifically described in this specification &must; adhere to the text tag file format described below. Other tag files &may; adhere to the text tag file format described below. - + Text tag files are line-oriented, and each line &must; be terminated by a newline (LF), a carriage return (CR), or carriage return plus newline (CRLF). Text tag file names &must; end in the extension ".txt". - - + In all text tag files except for the bag declaration file, text &must; be encoded in the character encoding specified in the "bagit.txt" bag declaration file. Text tag files except for the bag declaration file &may; include a @@ -671,31 +608,28 @@ proper decoding. In accordance with , when "bagit.txt" specifies UTF-8 the tag files &must-not; begin with a byte-order mark (BOM). See - - - + -
- -
- +
+ +
+ The payload manifest and tag manifests permit validating the integrity of the payload and tag files in a bag produced by the checksum algorithms. Checksum values &must; be encoded so as to conform to the manifest format specified in . However, the internal details of a checksum are outside the scope of this document. - - + The name of the checksum algorithm &must; be normalized for use in the manifest's filename by lowercasing the common name of the algorithm and removing all non-alphanumeric characters. - - + Bag creation and validation tools &must; support the SHA-2 family of algorithms and &should; enable SHA-512 by default when creating new bags. @@ -707,49 +641,42 @@ removing all non-alphanumeric characters. manifests using new algorithms to streamline the process of in-place upgrades. -
- -
- -
- +
+ + + +
+ A complete bag &must; meet the following requirements: - - - - Every required element &must; be present - (). - Every file listed in every tag manifest &must; be present. - Every file listed in every payload manifest &must; be present. - Every payload file &must; be listed in every payload manifest. - - - - + + + Every required element &must; be present + (). + Every file listed in every tag manifest &must; be present. + Every file listed in every payload manifest &must; be present. + Every payload file &must; be listed in every payload manifest. + + + A valid bag &must; meet the following requirements: - - - - The bag &must; be complete. - + + + The bag &must; be complete. + Every checksum in every payload manifest and tag manifest has been successfully verified against the contents of the corresponding file. - Every element present &must; comply with this specification. - - - -
- - - -
-
- - + Every element present &must; comply with this specification. + + +
+ +
+
+ This is the layout of a basic bag containing an image and a companion OCR file. Lines of file content are shown in parentheses beneath the file name. @@ -757,8 +684,7 @@ file name. -
- +
myfirstbag/ | | manifest-md5.txt @@ -777,102 +703,86 @@ myfirstbag/ | 27613-h/images/q172.txt | (... OCR text ... ) .... - -
+
-
- -
- -
- -
- - +
+
+ +
+
+ + The paths specified in the payload manifest and tag manifest file do not prohibit special directory characters which have special meaning on some operating systems. Implementors &must; ensure that files outside the bag directory structure are not accessed when reading or writing files based on paths specified in a bag. - - + All implementations &should; have a test suite to guard against these cases. - - + For example, a maliciously crafted "tagmanifest-md5.txt" file might contain entries which begin with a path character such as "/", "..", or a "~username" home directory reference in an attempt to cause a naive implementation to leak or overwrite targeted files on a POSIX operating system. - - + Windows implementations &should; test their implementations to ensure that safety-checks prevent use of drive letters and the less commonly used -namespace sequences (e.g. "\\?\C:\…") described in . - -
-
- -
-
- - +namespace sequences (e.g. "\\?\C:\…") described in . + +
+ +
+ +
+
+ This section lists practical considerations for implementors and users. None of the points below are required but they are recommended for general-purpose usage. - -
- - +
+ This section provides background information on various challenges caused by differences in how operating systems, filesystems, and common tools handle filenames followed by a list of recommendations for implementors in - . + . - -
- +
+ There are two challenges for interoperability related to filename case: - - + Filesystems such as FAT or EXFAT always convert filenames to uppercase: "example.txt" will be stored as "EXAMPLE.TXT" - - + Many Unix filesystems save filenames exactly as provided, allowing multiple files which differ only in case: "example.txt" and "Example.txt" are separate files - - + NTFS and HFS+ usually preserve case when storing files but are case-insensitive when retrieving them. A file saved as "Example.txt" will be retrieved by that name but will also be retrieved as "EXAMPLE.TXT", "example.txt", etc. - - + -
- -
- - +
+
+ The Unicode specification has common cases where different character sequences -produce the same human-meaningful text. These are referred to as “canonically -equivalent” and the Unicode specification defines different normalization -forms — see for the full details and a brief +produce the same human-meaningful text. These are referred to as “canonically +equivalent” and the Unicode specification defines different normalization +forms — see for the full details and a brief example below: - -
- - The common surname "Núñez" normalized in different forms +
+ + The common surname "Núñez" normalized in different forms - -
- - +
+ Unicode normalization is relevant to BagIt implementors because different systems have different standards for normalization: - - + Apple's HFS Plus filesystem always normalizes filenames to a - fully-decomposed form based on the Unicode 2.0 specification (see ). - - - Windows treats filenames as opaque character sequences (see ) and will store and return the encoded bytes exactly + fully-decomposed form based on the Unicode 2.0 specification (see ). + + Windows treats filenames as opaque character sequences (see ) and will store and return the encoded bytes exactly as provided. - - + Linux and other common Unix systems are generally similar to Windows in storing and returning opaque byte streams but this behaviour is technically filesystem-dependent. - - + Utilities used for file management, transfer, and archival may ignore this issue, apply an arbitrary normalization form, or allow the user to control how normalization is applied. - - + - - + In practice, this means that the encoded filename stored in a manifest may fail a simple file existence check because the filename's normalization was changed at some point after the manifest was written. This situation is very confusing for users because the filenames are visually indistinguishable and - the “missing” file is obviously present in the payload directory. + the “missing” file is obviously present in the payload directory. -
- -
- - - +
+
+ + + Implementations &should; discourage the creation of bags containing files which differ only in case. - + Implementations &must; prevent the creation of bags containing files which differ only in normalization form. - + BagIt implementations &should; tolerate differences in normalization form by comparing both the list of filesystem and manifest names after applying the same normalization form to both. - + Implementations &should; issue a warning when multiple manifests are present which differ only in case or normalization form. - - -
-
- -
- - + + +
+
+
+ As specified above, only the Unix-based path separator ('/') may be used inside filenames listed in BagIt manifests files. When bags are exchanged between Windows and Unix platforms, care should @@ -971,46 +869,40 @@ Windows or Unix. Besides the fundamental difference between path separators ('\' and '/'), generally, Windows filesystems have more limitations than Unix filesystems. - -
- +
+ Windows path names have a maximum of 255 characters, and none of these characters may be used in a path component: - + < > : " / | ? * -
- -
- +
+
+ Windows also reserves the following names, with or without a file extension: - + CON, PRN, AUX, NUL COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9 LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9 -
- - - See for more information and possible alternatives. +
+ + See for more information and possible alternatives. -
- -
- - +
+
+ Some bags have been manually assembled using checksum utilities such as those contained in the GNU Coreutils package (md5sum, sha1sum, etc.), collectively referred to here as "md5sum". Implementors who desire wide support of legacy content should be aware of some known quirks of these tools: - - -md5sum can be run in “text mode” which causes it to normalize line-endings + +md5sum can be run in “text mode” which causes it to normalize line-endings on some operating systems. On Unix-like systems both modes will usually produce the same results but on systems like Windows they may produce different results based on the file contents. @@ -1019,71 +911,40 @@ The md5sum output format has two characters between the checksum and the filename: the first is always a space and the second is an asterisk ("*") for binary mode and a space for text mode. - - + A final note about md5sum-generated manifests is that for a FILENAME containing a backslash ('\'), the manifest line will have a backslash inserted in front of the CHECKSUM and, under Windows, the backslashes inside FILENAME may be doubled. - - + Implementers &may; wish to accept this format by ignoring a leading asterisk or handling differences in line termination gracefully but, if so, implementations &must; warn the user that the bag in question will fail strict validation. In such cases it is strongly encouraged that tools provide an easy option to update the bag with valid manifests. -
- -
-
- -
- - +
+
+ +
+ +
+ BagIt owes much to many thoughtful contributors and reviewers, including Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Brad Hards, Scott Fisher, Keith Johnson, Erik Hetzner, Leslie Johnston, David Loy, Mark Phillips, Tracy Seneca, Brian Tingle, Adam Turoff, and Jim Tuttle. - -
- - +
+ This draft does not request any action from IANA. - -
- -
- - - - - - - - - A Collaboration Model between Archival Systems to Enhance - the Reliability of Preservation by an Enclose-and-Deposit Method - - - - - - - - - Naming a File - Microsoft, Inc. - - - - +
+ + + + A Collaboration Model between Archival Systems to Enhance + the Reliability of Preservation by an Enclose-and-Deposit MethodNaming a FileMicrosoft, Inc. &RFC2119; &RFC1321; @@ -1092,25 +953,9 @@ This draft does not request any action from IANA. &RFC3629; &RFC3986; - - - Unicode® Standard Annex #15: Unicode Normalization Forms - Unicode Consortium - - - - + Unicode® Standard Annex #15: Unicode Normalization FormsUnicode Consortium - - - Technical Note TN1150: HFS Plus Volume Format - Apple Inc. - - - + Technical Note TN1150: HFS Plus Volume FormatApple Inc. - - - + From 56c069e41f2ee02355d8650eb6e541b03db9178c Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Tue, 28 Feb 2017 11:47:14 -0500 Subject: [PATCH 057/144] Remove `Bag-Size` from example bag-info.txt This field is not specified but its presence in the examples can cause confusion that it's a standard field. Since any app which needs a human-formatted total file size may trivially calculate it from the Payload-Oxum (also see #2) or filesystem metadata this commit removes it to avoid confusion or the implication that tools must maintain the value. --- bagit.xml | 1 - 1 file changed, 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 43c393d2..873f615f 100644 --- a/bagit.xml +++ b/bagit.xml @@ -564,7 +564,6 @@ conform to this format: FOO papers colle... Bagging-Date: 2008-01-15 External-Identifier: university_foo_001 - Bag-Size: 260 GB Payload-Oxum: 279164409832.1198 Bag-Group-Identifier: univerisity_foo Bag-Count: 1 of 15 From c4a350dcd4428ebee88c3591bdfaac800e982728 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Tue, 15 Aug 2017 10:31:38 -0400 Subject: [PATCH 058/144] Fix Makefile targets --- Makefile | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Makefile b/Makefile index 7f9cadc8..6e033680 100644 --- a/Makefile +++ b/Makefile @@ -1,13 +1,13 @@ default: html text -html: +text: xml2rfc bagit.xml -text: +html: xml2rfc --html bagit.xml format: # We can't enable c14n because that triggers external DTD fetching and # libxml2 currently does not support HTTPS, which is a problem now that all # of the xml.resource.org URLs redirect: - xmllint --format --output bagit.xml bagit.xml \ No newline at end of file + xmllint --format --output bagit.xml bagit.xml From 5204986eec8a42bfa7363f0ece3f6d8002e4e1cc Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Tue, 15 Aug 2017 11:15:33 -0400 Subject: [PATCH 059/144] Tidy source formatting in the references section --- bagit.xml | 49 +++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 39 insertions(+), 10 deletions(-) diff --git a/bagit.xml b/bagit.xml index 873f615f..c0a97e20 100644 --- a/bagit.xml +++ b/bagit.xml @@ -942,19 +942,48 @@ This draft does not request any action from IANA. - A Collaboration Model between Archival Systems to Enhance - the Reliability of Preservation by an Enclose-and-Deposit MethodNaming a FileMicrosoft, Inc. + + + + A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method + + + + + + + + Naming a File + Microsoft, Inc. + + + + - &RFC2119; - &RFC1321; - &RFC3174; - &RFC6234; - &RFC3629; - &RFC3986; + &RFC2119; + &RFC1321; + &RFC3174; + &RFC6234; + &RFC3629; + &RFC3986; - Unicode® Standard Annex #15: Unicode Normalization FormsUnicode Consortium + + + Unicode® Standard Annex #15: Unicode Normalization Forms + Unicode Consortium + + + + + + + + Technical Note TN1150: HFS Plus Volume Format + Apple Inc. + + + - Technical Note TN1150: HFS Plus Volume FormatApple Inc. From fc58679b48931af3b9bfe65cecfa03437d422c80 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Tue, 15 Aug 2017 11:16:22 -0400 Subject: [PATCH 060/144] Add reference to conformance suite project --- bagit.xml | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/bagit.xml b/bagit.xml index c0a97e20..26d169f6 100644 --- a/bagit.xml +++ b/bagit.xml @@ -740,10 +740,17 @@ namespace sequences (e.g. "\\?\C:\…") described in
-This section lists practical considerations for implementors and users. None of -the points below are required but they are recommended for general-purpose -usage. - + This section lists practical considerations for implementors and + users. None of the points below are required but they are recommended + for general-purpose usage. + + + + The Library of Congress conformance suite + is provided as a public resource to test new implementations for compatibility and + error handling. + +
This section provides background information on various challenges caused by @@ -984,6 +991,13 @@ This draft does not request any action from IANA. + + + BagIt Conformance Suite + The Library of Congress + + + From 9a2787c5ff955ee8ccff55d279028f09ea2b90d4 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Tue, 15 Aug 2017 11:16:53 -0400 Subject: [PATCH 061/144] Restore fetch.txt We'll table discussion about the issue of transfer versus storage for BagIt 2. --- bagit.xml | 242 ++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 169 insertions(+), 73 deletions(-) diff --git a/bagit.xml b/bagit.xml index 26d169f6..5c7aebca 100644 --- a/bagit.xml +++ b/bagit.xml @@ -474,19 +474,16 @@ placeholder file with a name such as ".keep".
- -A tag manifest is a tag file that lists other tag files and checksums for -those tag files generated using a particular bag checksum algorithm. -A bag &may; contain one or more tag manifests. + A tag manifest is a tag file that lists other tag files and checksums for + those tag files generated using a particular bag checksum algorithm. + A bag &may; contain one or more tag manifests. -A tag manifest file &must; have a name of the form "tagmanifest-algorithm.txt", where algorithm -is a string following the format described in - specifying the bag checksum algorithm -used in that manifest. - + A tag manifest file &must; have a name of the form "tagmanifest-algorithm.txt", where algorithm + is a string following the format described in + specifying the bag checksum algorithm + used in that manifest. +
Example tag manifest filenames: @@ -548,40 +545,85 @@ conform to this format:
Payload-Oxum ABNF -
An example "bag-info.txt" file - Source-Organization: FOO University - Organization-Address: 1 Main St., Cupertino, California, 11111 - Contact-Name: Jane Doe - Contact-Phone: +1 111-111-1111 - Contact-Email: example@example.com - External-Description: Uncompressed greyscale TIFF images from the - FOO papers colle... - Bagging-Date: 2008-01-15 - External-Identifier: university_foo_001 - Payload-Oxum: 279164409832.1198 - Bag-Group-Identifier: univerisity_foo - Bag-Count: 1 of 15 - Internal-Sender-Identifier: /storage/images/foo - Internal-Sender-Description: Uncompressed greyscale TIFFs created - from microfilm and are... - +Source-Organization: FOO University +Organization-Address: 1 Main St., Cupertino, California, 11111 +Contact-Name: Jane Doe +Contact-Phone: +1 111-111-1111 +Contact-Email: example@example.com +External-Description: Uncompressed greyscale TIFF images from the + FOO papers colle... +Bagging-Date: 2008-01-15 +External-Identifier: university_foo_001 +Payload-Oxum: 279164409832.1198 +Bag-Group-Identifier: univerisity_foo +Bag-Count: 1 of 15 +Internal-Sender-Identifier: /storage/images/foo +Internal-Sender-Description: Uncompressed greyscale TIFFs created + from microfilm and are... +
+ +
+ + + For reasons of efficiency, a bag &may; be sent with a list of files to be + fetched and added to the payload before it can meaningfully be checked + for completeness. An &optional; tag file named "fetch.txt" + contains such a list. + + +
+ Each line of "fetch.txt" has the form: + URL LENGTH FILENAME + + URL identifies the file to be fetched + LENGTH is the number of octets in the file (or "-", to leave it unspecified) + FILENAME identifies the corresponding payload file, relative to the base directory. + +
+ + + The slash character ('/') &must; be used as a path separator in FILENAME. + If FILENAME begins with a slash character, the destination &must; still be + treated as relative to the bag base directory. + One or more linear whitespace characters (spaces or tabs) &must; separate these + three values, and any such characters in the URL &must; be percent-encoded + . There is no limitation on the length of any + of the fields in the "fetch.txt". + + + + The "fetch.txt" file allows a bag to be transmitted with + "holes" in it, which can be practical for several reasons. For example, + it obviates the need for the sender to stage a large serialized copy of + the content while the bag is transferred to the receiver. Also, this + method allows a sender to construct a bag from components that are either + a subset of logically related components (e.g., the localized logical + object could be much larger than what is intended for export) or + assembled from logically distributed sources (e.g., the object components + for export are not stored locally under one filesystem tree). + + +
+ +
-A bag &may; contain other tag files that are not defined by this -specification. + A bag &may; contain other tag files that are not defined by this + specification. -Implementations &must; perform standard checksum validation on any tag file -which is listed in a tag manifest but &must; otherwise ignore their contents. - + Implementations &must; perform standard checksum validation on any tag file + which is listed in a tag manifest but &must; otherwise ignore their contents. +
@@ -651,8 +693,7 @@ requirements:
- Every required element &must; be present - (). + Every required element &must; be present (). Every file listed in every tag manifest &must; be present. Every file listed in every payload manifest &must; be present. Every payload file &must; be listed in every payload manifest. @@ -676,14 +717,13 @@ A valid bag &must; meet the following requirements:
-This is the layout of a basic bag containing an image and a companion -OCR file. Lines of file content are shown in parentheses beneath the -file name. - - -
+ This is the layout of a basic bag containing an image and a companion + OCR file. Lines of file content are shown with added parentheses to + indicate each complete line. + + +
+ myfirstbag/ | | manifest-md5.txt @@ -702,9 +742,36 @@ myfirstbag/ | 27613-h/images/q172.txt | (... OCR text ... ) .... -
+
+
+
+
+ + This is the layout of a bag which expects the receiver to download the + files listed in the payload manifests prior to validation. Lines of + file content are shown with added parentheses to indicate each + complete line. + - +
+ +highsmith-tahoe/ +| +| manifest-md5.txt +| (102b0e6effe208ef9b29864946de9e22 data/23364a.tif ) +| +| fetch.txt +| (https://cdn.loc.gov/master/pnp/highsm/23300/23364a.tif +| 216951362 data/23364a.tif ) +| +| bagit.txt +| (BagIt-version: 1.0 ) +| (Tag-File-Character-Encoding: UTF-8 ) +| +| bag-info.txt +| (Source-URL: https://www.loc.gov/resource/highsm.23364/ ) + +
@@ -712,28 +779,57 @@ myfirstbag/
-The paths specified in the payload manifest and tag manifest file do not -prohibit special directory characters which have special meaning on some -operating systems. Implementors &must; ensure that files outside the bag -directory structure are not accessed when reading or writing files based on -paths specified in a bag. - + The paths specified in the payload manifests, tag manifests, and + fetch.txt do not prohibit special directory characters which have + special meaning on some operating systems. Implementors &must; ensure + that files outside the bag directory structure are not accessed when + reading or writing files based on paths specified in a bag. + -All implementations &should; have a test suite to guard against these cases. - + All implementations &should; have a test suite to guard against these + cases. + -For example, a maliciously crafted "tagmanifest-md5.txt" file might -contain entries which begin with a path character such as "/", "..", -or a "~username" home directory reference in an attempt to cause a -naive implementation to leak or overwrite targeted files on a POSIX operating -system. - + For example, a maliciously crafted "tagmanifest-md5.txt" file might + contain entries which begin with a path character such as "/", "..", + or a "~username" home directory reference in an attempt to cause a + naive implementation to leak or overwrite targeted files on a POSIX + operating system. + -Windows implementations &should; test their implementations to ensure -that safety-checks prevent use of drive letters and the less commonly used -namespace sequences (e.g. "\\?\C:\…") described in . - + Windows implementations &should; test their implementations to ensure + that safety-checks prevent use of drive letters and the less commonly used + namespace sequences (e.g. "\\?\C:\…") described in . + + + The Library of Congress conformance suite + has some tests for invalid bags which are expected to fail on POSIX or Windows clients + which should be useful for implementors. + +
+
+ + Implementors of tools that complete bags by retrieving URLs listed in + a "fetch.txt" file need to be aware that some of those URLs may point + to hosts, intentionally or unintentionally, that are not under control + of the bag's sender. Checksums are intended as a reasonable guarantee + against corruption during transit, not a strong cryptographic + protection against intentional spoofing. +
+ +
+ + The size of files, as optionally reported in the "fetch.txt" file, + cannot be guaranteed to match the actual file size to be downloaded. + Implementors &should; take care to appropriately handle cases where + the actual file size does not match the file size reported in the + fetch.txt. Implementors &should-not; use the file size in the + "fetch.txt" file for critical resource allocation, such as buffer + sizing or storage requisitioning. + +
+
@@ -866,15 +962,15 @@ z 7a LATIN SMALL LETTER Z
-As specified above, only the Unix-based path separator ('/') may be -used inside filenames listed in BagIt manifests files. -When bags are exchanged between Windows and Unix platforms, care should -be taken to translate the path separator as needed. Receivers of bags on -physical media should be prepared for filesystems created under either -Windows or Unix. Besides the fundamental difference between path -separators ('\' and '/'), generally, Windows filesystems have more -limitations than Unix filesystems. - + As specified above, only the Unix-based path separator ('/') may be + used inside filenames listed in BagIt manifest and fetch.txt files. + When bags are exchanged between Windows and Unix platforms, care + should be taken to translate the path separator as needed. Receivers + of bags on physical media should be prepared for filesystems created + under either Windows or Unix. Besides the fundamental difference + between path separators ('\' and '/'), generally, Windows + filesystems have more limitations than Unix filesystems. +
Windows path names have a maximum of From c2fe6ce26d552dbf49e23185620162b7a261b787 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Tue, 15 Aug 2017 11:34:42 -0400 Subject: [PATCH 062/144] Add a reminder that fetch.txt entries must be listed in manifests --- bagit.xml | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/bagit.xml b/bagit.xml index 5c7aebca..246bfa3d 100644 --- a/bagit.xml +++ b/bagit.xml @@ -580,6 +580,10 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created contains such a list. + + Every file listed in fetch.txt &must; be listed in every payload manifest. + +
Each line of "fetch.txt" has the form: URL LENGTH FILENAME From 1bd0a8f0f3537814c15000b5b09d8fb1adcf1d02 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Tue, 15 Aug 2017 12:08:56 -0400 Subject: [PATCH 063/144] Convert fetch.txt artwork to ABNF --- bagit.xml | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 246bfa3d..bb5dbc70 100644 --- a/bagit.xml +++ b/bagit.xml @@ -586,10 +586,22 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created
Each line of "fetch.txt" has the form: - URL LENGTH FILENAME + +length = DIGIT +filename = ( + "data" + "/" + *( unreserved / pct-encoded / sub-delims ) + ) +line-terminator = CR / LF / CRLF +]]> + - URL identifies the file to be fetched - LENGTH is the number of octets in the file (or "-", to leave it unspecified) + URL identifies the file to be fetched and must be an absolute URI + as defined in . + LENGTH is the number of octets in the file (or "-", to leave it unspecified). FILENAME identifies the corresponding payload file, relative to the base directory.
From 7e2b7e520b59de95189f8f4b8c389a555fee3e24 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 2 Nov 2017 12:01:41 -0400 Subject: [PATCH 064/144] Remove dead comments --- bagit.xml | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/bagit.xml b/bagit.xml index bb5dbc70..2e69f7cb 100644 --- a/bagit.xml +++ b/bagit.xml @@ -302,14 +302,6 @@ document are to be interpreted as described in .
- - -
@@ -665,12 +657,6 @@ proper decoding. In accordance with , when "bagit.txt" specifies UTF-8 the tag files &must-not; begin with a byte-order mark (BOM). See - -
From 708b8f8f3444a89a30d1cd7a755ecf39bd2da41f Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 2 Nov 2017 12:14:40 -0400 Subject: [PATCH 065/144] Tidy format of manifest section * Convert free-form list of statements to an actual * Indentation --- bagit.xml | 74 ++++++++++++++++++++++++++----------------------------- 1 file changed, 35 insertions(+), 39 deletions(-) diff --git a/bagit.xml b/bagit.xml index 2e69f7cb..00e25f9f 100644 --- a/bagit.xml +++ b/bagit.xml @@ -367,28 +367,25 @@ Tag-File-Character-Encoding: UTF-8 -The number for this version of the specification is "¤t-bagit-version;". - + The number for this version of the specification is "¤t-bagit-version;". +
-The base directory &must; contain a sub-directory named "data". - + The base directory &must; contain a sub-directory named "data". + -The payload directory contains the arbitrary digital content within the bag. -The files under the payload directory are called payload files, or the payload. -Each payload file is treated as an opaque octet stream when verifying file -correctness. -Any sub-directory structure within the payload &must; be preserved but is -otherwise ignored for purposes relating to this specification. - + The payload directory contains the arbitrary digital content within the bag. + The files under the payload directory are called payload files, or the payload. + Each payload file is treated as an opaque octet stream when verifying file + correctness. + Any sub-directory structure within the payload &must; be preserved but is + otherwise ignored for purposes relating to this specification. +
- A payload manifest file provides a complete listing of each payload file along with a corresponding checksum to permit data integrity checking. @@ -408,8 +405,8 @@ manifest-sha1.txt -Each line of a payload manifest file &must; be of the form: - + Each line of a payload manifest file &must; be of the form: +
CHECKSUM FILENAME @@ -419,20 +416,19 @@ Each line of a payload manifest file &must; be of the form:
-The hex-encoded checksum &may; use uppercase and/or lowercase letters. - -The slash character ('/') &must; be used as a path separator in FILENAME. - -One or more linear whitespace characters (spaces or tabs) &must; separate CHECKSUM from FILENAME. - -There is no limitation on the length of a pathname. - -The payload manifest &must-not; reference files outside the payload directory. - -If a FILENAME includes a newline (LF), a carriage return (CR), or carriage -return plus newline (CRLF) it &must; be percent-encoded following -. - + + The hex-encoded checksum &may; use uppercase and/or lowercase letters. + The slash character ('/') &must; be used as a path separator in FILENAME. + One or more linear whitespace characters (spaces or tabs) &must; separate CHECKSUM from FILENAME. + There is no limitation on the length of a pathname. + The payload manifest &must-not; reference files outside the payload directory. + + If a FILENAME includes a newline (LF), a carriage return (CR), + or carriage return plus newline (CRLF) it &must; be + percent-encoded following . + + +
Payload Manifest ABNF
-All tag files specifically described in this specification &must; adhere to -the text tag file format described below. Other tag files &may; adhere to -the text tag file format described below. - + All tag files specifically described in this specification &must; adhere to + the text tag file format described below. Other tag files &may; adhere to + the text tag file format described below. + -Text tag files are line-oriented, and each line &must; be -terminated by a newline (LF), a carriage return (CR), or carriage return -plus newline (CRLF). -Text tag file names &must; end in the extension ".txt". - + Text tag files are line-oriented, and each line &must; be terminated + by a newline (LF), a carriage return (CR), or carriage return plus + newline (CRLF). Text tag file names &must; end in the extension + ".txt". + In all text tag files except for the bag declaration file, text &must; be encoded in the character encoding specified in the "bagit.txt" bag declaration From 66c24ebd46cf28c4543e0741ae56502cae141020 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 2 Nov 2017 12:16:12 -0400 Subject: [PATCH 066/144] Update payload manifest to prohibit duplicates * Specify that files are only listed once * Convert list of requirements to an actual so each point is notable --- bagit.xml | 33 ++++++++++++++++++++++++--------- 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/bagit.xml b/bagit.xml index 00e25f9f..b1e27811 100644 --- a/bagit.xml +++ b/bagit.xml @@ -387,16 +387,31 @@ Tag-File-Character-Encoding: UTF-8
-A payload manifest file provides a complete listing of each payload file along -with a corresponding checksum to permit data integrity checking. - + A payload manifest file provides a complete listing of each payload file along + with a corresponding checksum to permit data integrity checking. Manifest entries + &must; satisfy the following constraints: + + -Every bag &must; contain at least one payload manifest file and &may; contain -more than one. Every payload manifest &must; list every payload file. A payload -manifest file &must; have a name of the form "manifest-algorithm.txt", where algorithm -is a string specifying the checksum algorithm used by that manifest as described -in . - + + + Every bag &must; contain at least one payload manifest file and &may; contain + more than one. + + + Every payload manifest &must; list every payload file exactly once. + + + A payload manifest file &must; have a name of the form + "manifest-algorithm.txt", where + algorithm + is a string specifying the checksum algorithm used by that + manifest as described in . + + + +
Example payload manifest filenames From 87893e21172cf138e09a58a96769e34ceb5b36e6 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Fri, 3 Nov 2017 15:05:14 -0400 Subject: [PATCH 067/144] =?UTF-8?q?Clarify=20that=20=E2=80=9Cwhitespace?= =?UTF-8?q?=E2=80=9D=20means=20tabs=20/=20spaces?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit We don't promise that every unicode code point flagged as a space separator is valid. --- bagit.xml | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/bagit.xml b/bagit.xml index b1e27811..df2b2c4e 100644 --- a/bagit.xml +++ b/bagit.xml @@ -504,22 +504,23 @@ As a result, no FILENAME listed in a tag manifest begins "data/".
-The "bag-info.txt" file is a tag file that contains metadata elements -describing the bag and the payload. The metadata elements contained in -the "bag-info.txt" file are intended primarily for human use. -All metadata elements are optional and &may; be repeated. Because -“bag-info.txt” is intended for human reading and editing, implementations -&must; assume that the order of metadata elements is significant and &must; be -preserved. - + The "bag-info.txt" file is a tag file that contains metadata + elements describing the bag and the payload. The metadata elements + contained in the "bag-info.txt" file are intended primarily for + human use. All metadata elements are optional and &may; be repeated. + Because “bag-info.txt” is intended for human reading + and editing, implementations &must; assume that the order of + metadata elements is significant and &must; be preserved. + -A metadata element &must; consist of a label, a colon, and a value, -each separated by optional whitespace. It is &recommended; that -lines not exceed 79 characters in length. Long values may be continued -onto the next line by inserting a newline (LF), a carriage return (CR), -or carriage return plus newline (CRLF) and indenting the next line with -linear white space (spaces or tabs). - + A metadata element &must; consist of a label, a colon, and a value, + each separated by optional whitespace (spaces or tabs). It is + &recommended; that lines not exceed 79 characters in length. Long + values may be continued onto the next line by inserting a newline + (LF), a carriage return (CR), or carriage return plus newline (CRLF) + and indenting the next line with linear white space (spaces or + tabs). +
"bag-info.txt" ABNF Date: Mon, 6 Nov 2017 16:10:49 -0500 Subject: [PATCH 068/144] Stricter bagit.txt wording This file isn't intended to be human edited and the cost of lax parsing is not worthwhile. --- bagit.xml | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/bagit.xml b/bagit.xml index df2b2c4e..2635371e 100644 --- a/bagit.xml +++ b/bagit.xml @@ -351,20 +351,20 @@ The base directory &may; have any name.
-The "bagit.txt" tag file &must; consist of exactly two lines: - + The "bagit.txt" tag file &must; consist of exactly two lines in this order: +
BagIt-Version: M.N Tag-File-Character-Encoding: UTF-8 - M.N identifies the BagIt major (M) and minor (N) version numbers, - and UTF-8 identifies the character set encoding used by the tag files. + M.N identifies the BagIt major (M) and minor (N) version numbers, + and UTF-8 identifies the character set encoding used by the tag files. - The bag declaration &must; be encoded in UTF-8, and &must-not; contain a - byte-order mark (BOM) . - + The bag declaration &must; be encoded in UTF-8, and &must-not; contain a + byte-order mark (BOM) . +
The number for this version of the specification is "¤t-bagit-version;". From c72f1e427dd073b527b4480b291670f425beb022 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Mon, 6 Nov 2017 16:17:30 -0500 Subject: [PATCH 069/144] Tighten restrictions for bag-info.txt syntax Previously the spec allowed variable whitespace before and after the separator in bag-info.txt files. This commit changes that to a single whitespace character for 1.0 to reduce implementation complexity and reflect the widespread availability of validation tools. --- bagit.xml | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/bagit.xml b/bagit.xml index 2635371e..5e9f3ade 100644 --- a/bagit.xml +++ b/bagit.xml @@ -513,24 +513,32 @@ As a result, no FILENAME listed in a tag manifest begins "data/". metadata elements is significant and &must; be preserved. - A metadata element &must; consist of a label, a colon, and a value, - each separated by optional whitespace (spaces or tabs). It is - &recommended; that lines not exceed 79 characters in length. Long - values may be continued onto the next line by inserting a newline - (LF), a carriage return (CR), or carriage return plus newline (CRLF) - and indenting the next line with linear white space (spaces or - tabs). + A metadata element &must; consist of a label, a colon, a single + linear whitespace character, and a value. It is &recommended; that + lines not exceed 79 characters in length. Long values may be + continued onto the next line by inserting a newline (LF), a carriage + return (CR), or carriage return plus newline (CRLF) and indenting + the next line with linear white space (spaces or tabs).
"bag-info.txt" ABNF
+ + + For BagIt 1.0, the colon separating the key from the value &must; be + followed by a single linear whitespace character. For compatibility + with previous versions, implementations &must; accept multiple + linear whitespace before and after the colon when the bag version is + earlier than 1.0. + + An implementation &should; add the optional "Payload-Oxum" element for the purpose of quickly detecting incomplete bags before performing checksum From b6060cfc73e3c9a57ca0f5635569bb7ba5d379d2 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Tue, 7 Nov 2017 14:29:30 -0500 Subject: [PATCH 070/144] fixed ABNF for bag-info.txt and added one for bagit.txt --- bagit.xml | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/bagit.xml b/bagit.xml index 5e9f3ade..b5f78ecc 100644 --- a/bagit.xml +++ b/bagit.xml @@ -369,6 +369,15 @@ Tag-File-Character-Encoding: UTF-8 The number for this version of the specification is "¤t-bagit-version;". +
+ bagit.txt ABNF: + +
@@ -445,7 +454,7 @@ manifest-sha1.txt
- Payload Manifest ABNF + Payload Manifest ABNF:
- "bag-info.txt" ABNF + bag-info.txt ABNF: @@ -556,7 +565,7 @@ conform to this format: files.
- Payload-Oxum ABNF + Payload-Oxum ABNF: @@ -597,7 +606,7 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created
- Each line of "fetch.txt" has the form: + fetch.txt ABNF: From d2d5e1ef4b0d1b9dfbe887148dab7b3d8d2e4486 Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Tue, 12 Dec 2017 13:59:41 -0500 Subject: [PATCH 071/144] Include Git configuration used for pretty diffs The included .gitattributes and config change will diff the XML file by running it through xml2rfc so you can see the visible text changes rather than the XML source. This requires one config change which cannot be included in the repo for security reasons: git config --local diff.xml2rfc.textconv "xml2rfc --quiet --out=/dev/stdout" If someone does care about the actual XML differences, the default conversion can be disabled: git diff --no-textconv --- .gitattributes | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 .gitattributes diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 00000000..dd89b145 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,5 @@ +# Activate this after installing xml2rfc with this configuration: +# git config --local diff.xml2rfc.textconv "xml2rfc --quiet --out=/dev/stdout" +# +# To temporarily disable this formatting, use `git diff --no-textconv` +bagit.xml diff=xml2rfc From c287f0bfcaf481674ea3ee32e804fc13ee412089 Mon Sep 17 00:00:00 2001 From: Justin Littman Date: Mon, 5 Mar 2018 14:36:07 -0500 Subject: [PATCH 072/144] Changes in preparation for 1.0 --- bagit.xml | 425 ++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 266 insertions(+), 159 deletions(-) diff --git a/bagit.xml b/bagit.xml index b5f78ecc..9d0f497c 100644 --- a/bagit.xml +++ b/bagit.xml @@ -24,7 +24,7 @@ - + ]> @@ -199,7 +199,7 @@ -This document specifies BagIt, a hierarchical file packaging format for +This document specifies BagIt, a set of hierarchical file layout conventions for storage and transfer of arbitrary digital content. A "bag" has just enough structure to enclose descriptive metadata "tags" and a file "payload" but does not require knowledge of the payload's internal semantics. This @@ -211,7 +211,7 @@ BagIt format should be suitable for reliable storage and transfer.
-BagIt is a hierarchical file packaging format designed to support +BagIt is a set of hierarchical file layout conventions designed to support storage and transfer of arbitrary digital content. A bag consists of a directory containing the payload files and other accompanying metadata files known as "tag" files. The "tags" are metadata files intended to @@ -222,20 +222,30 @@ can be accessed without processing the BagIt metadata. The name, BagIt, is inspired by the "enclose and deposit" method , sometimes referred to as "bag it and tag it". -BagIt differs from traditional archive formats such as TAR or ZIP in two general -areas: +BagIt differs from serialized archive formats such as MIME, TAR, or ZIP +in two general areas: - Strong integrity assurances: the format supports only cryptographic-quality + Strong integrity assurances. The format supports only cryptographic-quality hash algorithms (see ) and allows for in-place upgrades to add additional manifests using stronger algorithms - without breaking backwards compatibility + without breaking backwards compatibility. - Direct file access: files may be accessed using standard operating system - utilities, implementations do not need to process a potentially large - archive file to extract a subset of data, and the format imposes no size - limits for either individual files or a bag. + Direct file access. Because BagIt specifies an actual filesystem hierarchy + rather than a serialized representation of one, files can be accessed + using standard operating system utilities, implementations do not need + to process a potentially large archive file to extract a subset of data, + and the format imposes no size limits for either individual files or a bag. + + +BagIt is widely used for preserving digital assets originating from different +domains. Organizations involved in digital preservation with BagIt include +the Library of Congress, Dryad Data Repository, NSF DataONE, and the +Rockefeller Archive Center. Software implementations are available for many +languages including Python, Ruby, Java, Perl, and PHP. It is also used in +the libraries of many universities, such as Cornell, Purdue, Stanford, +Ghent University, New York University, and the University of California.
@@ -246,7 +256,7 @@ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", document are to be interpreted as described in . - Implementors are strongly encouraged to review the interoperability + Implementers are strongly encouraged to review the interoperability considerations described in .
@@ -258,7 +268,7 @@ document are to be interpreted as described in . - A order independent set of opaque files contained within the structure + A set of opaque files contained within the structure defined by this specification. @@ -274,7 +284,7 @@ document are to be interpreted as described in . The data encapsulated by the bag. The contents of the payload are opaque to this specification, and, with respect to BagIt processing, - are always considered as an opaque octet stream. + are always considered as a sequence of uninterpreted octets. See . @@ -306,7 +316,7 @@ document are to be interpreted as described in .
- A bag consists of a base directory containing: + A bag MUST consist of a base directory containing: @@ -323,7 +333,7 @@ The tag files in the base directory consist of one or more files named a file named "bagit.txt" (see ), and zero or more additional tag files (see ). The tag files and directories are -arbitrary file hierarchies and &may; have +in arbitrary file hierarchies and &may; have any name that is not reserved for a file or directory in this specification. @@ -337,7 +347,7 @@ The base directory &may; have any name. | +-- manifest-<algorithm>.txt | - +-- [additional tag files] + +-- [optional additional tag files] | +-- data/ | | @@ -369,15 +379,6 @@ Tag-File-Character-Encoding: UTF-8 The number for this version of the specification is "¤t-bagit-version;". -
- bagit.txt ABNF: - -
@@ -389,12 +390,12 @@ ending = CR / LF / CRLF The files under the payload directory are called payload files, or the payload. Each payload file is treated as an opaque octet stream when verifying file correctness. - Any sub-directory structure within the payload &must; be preserved but is + Any sub-directory structure within the payload &must-not; be changed but is otherwise ignored for purposes relating to this specification.
-
+
A payload manifest file provides a complete listing of each payload file along with a corresponding checksum to permit data integrity checking. Manifest entries @@ -432,49 +433,32 @@ manifest-sha1.txt Each line of a payload manifest file &must; be of the form:
- CHECKSUM FILENAME + checksum filename - where FILENAME is the pathname of a file relative to the base directory, - and CHECKSUM is a hex-encoded checksum calculated according to + where filename is the pathname of a file + relative to the base directory, and checksum is + a hex-encoded checksum calculated according to algorithm over every octet in the file.
The hex-encoded checksum &may; use uppercase and/or lowercase letters. - The slash character ('/') &must; be used as a path separator in FILENAME. - One or more linear whitespace characters (spaces or tabs) &must; separate CHECKSUM from FILENAME. + The slash character ('/') &must; be used as a path separator + in filename. + One or more linear whitespace characters (spaces or tabs) + &must; separate checksum from + filename. There is no limitation on the length of a pathname. The payload manifest &must-not; reference files outside the payload directory. - If a FILENAME includes a newline (LF), a carriage return (CR), + If a filename includes a newline + (LF), a carriage return (CR), or carriage return plus newline (CRLF) it &must; be percent-encoded following . -
- Payload Manifest ABNF: - -
A manifest &must-not; reference directories. Bag creators who wish to create an otherwise empty directory have typically done so by creating an empty @@ -485,7 +469,7 @@ placeholder file with a name such as ".keep".
-
+
A tag manifest is a tag file that lists other tag files and checksums for those tag files generated using a particular bag checksum algorithm. @@ -507,7 +491,7 @@ tagmanifest-sha1.txt A tag manifest file has the same form as the payload file manifest file described in , but &must-not; list any payload files. -As a result, no FILENAME listed in a tag manifest begins "data/". +As a result, no filename listed in a tag manifest begins "data/".
@@ -517,29 +501,20 @@ As a result, no FILENAME listed in a tag manifest begins "data/". elements describing the bag and the payload. The metadata elements contained in the "bag-info.txt" file are intended primarily for human use. All metadata elements are optional and &may; be repeated. - Because “bag-info.txt” is intended for human reading - and editing, implementations &must; assume that the order of - metadata elements is significant and &must; be preserved. + Because "bag-info.txt" is intended for human reading + and editing, ordering &may; be significant and the ordering of + metadata elements &must; be preserved. - A metadata element &must; consist of a label, a colon, a single - linear whitespace character, and a value. It is &recommended; that - lines not exceed 79 characters in length. Long values may be + A metadata element &must; consist of a label, a colon, at least one + linear whitespace character, and a value. The label &may; contain + linear whitespace characters, but &must-not; be preceded by + linear whitespace. It is &recommended; that + lines not exceed 79 characters in length. Long values &may; be continued onto the next line by inserting a newline (LF), a carriage return (CR), or carriage return plus newline (CRLF) and indenting the next line with linear white space (spaces or tabs). -
- bag-info.txt ABNF: - -
- For BagIt 1.0, the colon separating the key from the value &must; be followed by a single linear whitespace character. For compatibility @@ -547,29 +522,84 @@ ending = CR / LF / CRLF linear whitespace before and after the colon when the bag version is earlier than 1.0. + + Following are reserved metadata elements. The use of these reserved + metadata elements are &optional; but encouraged. Reserved metadata + element names are case-insensitive. + -An implementation &should; add the optional "Payload-Oxum" element for the -purpose of quickly detecting incomplete bags before performing checksum -validation. This is strictly an optimization and implementations &must; perform -the standard checksum validation process before proclaiming a bag to be valid. -This element &must-not; be present more than once and, if present, &must; -conform to this format: - + + + Organization transferring the content. + + + Mailing address of the organization. + + + Person at the source organization who is responsible for the content + transfer. + + + International format telephone number of person or position responsible. + + + Fully qualified email address of person or position responsible. + + + A brief explanation of the contents and provenance. + + + Date (YYYY-MM-DD) that the content was prepared for delivery. + + + A sender-supplied identifier for the bag. + + + Size or approximate size of the bag being transferred, followed + by an abbreviation such as MB (megabytes), GB, or TB; for example, + 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum (described + next), Bag-Size is intended for human consumption. + - The "octet-stream sum" of the payload is a pair of two numbers in the form - "OctetCount.StreamCount", - where OctetCount is the total number of octets - (8-bit bytes) across all payload file content and - StreamCount is the total number of payload - files. - -
- Payload-Oxum ABNF: - -
+ The "octetstream sum" of the payload, intended for the + purpose of quickly detecting incomplete bags before performing checksum + validation. This is strictly an optimization and implementations &must; perform + the standard checksum validation process before proclaiming a bag to be valid. + This element &must-not; be present more than once and, if present, &must; + be in the form "OctetCount.StreamCount", + where OctetCount is the total number of + octets (8-bit bytes) across all payload file content and + StreamCount is the total number of + payload files. + + + A sender-supplied identifier for the set, if any, of bags + to which it logically belongs. + This identifier must be unique across the sender's content, and if + recognizable as belonging to a globally unique scheme, the receiver + should make an effort to honor reference to it. + + + Two numbers separated by "of", in particular, "N of T", + where T is the total number of bags in a group of bags and N is the + ordinal number within the group; if T is not known, specify it as "?" + (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 145. + + + An alternate sender-specific identifier for the content + and/or bag. + + + A sender-local prose description of the contents of the + bag. + +
+ + + In addition to these metadata elements, other arbitrary metadata + elements &may; also be present. +
An example "bag-info.txt" file @@ -583,7 +613,7 @@ External-Description: Uncompressed greyscale TIFF images from the Bagging-Date: 2008-01-15 External-Identifier: university_foo_001 Payload-Oxum: 279164409832.1198 -Bag-Group-Identifier: univerisity_foo +Bag-Group-Identifier: university_foo Bag-Count: 1 of 15 Internal-Sender-Identifier: /storage/images/foo Internal-Sender-Description: Uncompressed greyscale TIFFs created @@ -597,48 +627,41 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created For reasons of efficiency, a bag &may; be sent with a list of files to be fetched and added to the payload before it can meaningfully be checked - for completeness. An &optional; tag file named "fetch.txt" + for completeness. An &optional; tag file called the fetch file contains such a list. - Every file listed in fetch.txt &must; be listed in every payload manifest. + The fetch file &must; be named "fetch.txt". Every file listed in + the fetch file &must; be listed in every + payload manifest. A fetch file &must-not; list any tag files. + + + Each line of a fetch file &must; be of the form: -
- fetch.txt ABNF: - -length = DIGIT -filename = ( - "data" - "/" - *( unreserved / pct-encoded / sub-delims ) - ) -line-terminator = CR / LF / CRLF -]]> - + url length filename - URL identifies the file to be fetched and must be an absolute URI - as defined in . - LENGTH is the number of octets in the file (or "-", to leave it unspecified). - FILENAME identifies the corresponding payload file, relative to the base directory. + where url identifies the file to be + fetched and must be an absolute URI as defined in + , length is + the number of octets in the file (or "-", to leave it unspecified), + and filename identifies the + corresponding payload file, relative to the base directory.
- The slash character ('/') &must; be used as a path separator in FILENAME. - If FILENAME begins with a slash character, the destination &must; still be - treated as relative to the bag base directory. - One or more linear whitespace characters (spaces or tabs) &must; separate these - three values, and any such characters in the URL &must; be percent-encoded - . There is no limitation on the length of any - of the fields in the "fetch.txt". + The slash character ('/') &must; be used as a path separator in + filename. One or more linear whitespace + characters (spaces or tabs) &must; separate these + three values, and any such characters in the url + &must; be percent-encoded . There is no + limitation on the length of any of the fields in the fetch file. - The "fetch.txt" file allows a bag to be transmitted with + The fetch file allows a bag to be transmitted with "holes" in it, which can be practical for several reasons. For example, it obviates the need for the sender to stage a large serialized copy of the content while the bag is transferred to the receiver. Also, this @@ -649,10 +672,8 @@ line-terminator = CR / LF / CRLF for export are not stored locally under one filesystem tree). -
- - - +
+
A bag &may; contain other tag files that are not defined by this @@ -685,6 +706,10 @@ byte-order mark (BOM) only if the specified encoding requires it for proper decoding. In accordance with , when "bagit.txt" specifies UTF-8 the tag files &must-not; begin with a byte-order mark (BOM). See + + +The use of UTF-8 for text tag files is strongly &recommended;. Future version +of BagIt may disallow encodings other than UTF-8.
@@ -699,17 +724,24 @@ of a checksum are outside the scope of this document. The name of the checksum algorithm &must; be normalized for use in the manifest's filename by lowercasing the common name of the algorithm and -removing all non-alphanumeric characters. - +removing all non-alphanumeric characters. Following is a partial list +mapping common algorithm names to normalized names: + + MD-5: md5 + SHA-1: sha1 + SHA-256: sha256 + SHA-512: sha512 + + - Bag creation and validation tools &must; support the SHA-2 family of + For BagIt 1.0, bag creation and validation tools &must; support the SHA-2 family of algorithms and &should; enable SHA-512 by default when creating new bags. - For backwards-compatibility implementors &should; support + For backwards-compatibility implementers &should; support MD-5 and SHA-1 . - Implementors are encouraged to simplify the process of adding additional + Implementers are encouraged to simplify the process of adding additional manifests using new algorithms to streamline the process of in-place upgrades. @@ -717,7 +749,7 @@ removing all non-alphanumeric characters.
-
+
A complete bag &must; meet the following requirements: @@ -728,6 +760,7 @@ requirements: Every file listed in every tag manifest &must; be present. Every file listed in every payload manifest &must; be present. Every payload file &must; be listed in every payload manifest. + Every element present &must; comply with this specification. @@ -740,7 +773,6 @@ A valid bag &must; meet the following requirements: Every checksum in every payload manifest and tag manifest has been successfully verified against the contents of the corresponding file. - Every element present &must; comply with this specification.
@@ -808,17 +840,16 @@ highsmith-tahoe/
- The paths specified in the payload manifests, tag manifests, and - fetch.txt do not prohibit special directory characters which have - special meaning on some operating systems. Implementors &must; ensure + fetch files do not prohibit special directory characters which have + special meaning on some operating systems. Implementers &must; ensure that files outside the bag directory structure are not accessed when reading or writing files based on paths specified in a bag. - All implementations &should; have a test suite to guard against these - cases. + All implementations &should; have a test suite to guard against + special directory characters. For example, a maliciously crafted "tagmanifest-md5.txt" file might @@ -833,15 +864,15 @@ highsmith-tahoe/ namespace sequences (e.g. "\\?\C:\…") described in . - The Library of Congress conformance suite - has some tests for invalid bags which are expected to fail on POSIX or Windows clients - which should be useful for implementors. + To assist implementers, the Library + of Congress conformance suite has some tests for invalid bags + which are expected to fail on POSIX or Windows clients.
- Implementors of tools that complete bags by retrieving URLs listed in - a "fetch.txt" file need to be aware that some of those URLs may point + Implementers of tools that complete bags by retrieving URLs listed in + a fetch file need to be aware that some of those URLs might point to hosts, intentionally or unintentionally, that are not under control of the bag's sender. Checksums are intended as a reasonable guarantee against corruption during transit, not a strong cryptographic @@ -851,12 +882,12 @@ highsmith-tahoe/
- The size of files, as optionally reported in the "fetch.txt" file, + The size of files, as optionally reported in the fetch file, cannot be guaranteed to match the actual file size to be downloaded. - Implementors &should; take care to appropriately handle cases where + Implementers &should; take care to appropriately handle cases where the actual file size does not match the file size reported in the - fetch.txt. Implementors &should-not; use the file size in the - "fetch.txt" file for critical resource allocation, such as buffer + fetch file. Implementers &should-not; use the file size in the + fetch file for critical resource allocation, such as buffer sizing or storage requisitioning.
@@ -867,7 +898,7 @@ highsmith-tahoe/
- This section lists practical considerations for implementors and + This section lists practical considerations for implementers and users. None of the points below are required but they are recommended for general-purpose usage. @@ -882,7 +913,7 @@ highsmith-tahoe/ This section provides background information on various challenges caused by differences in how operating systems, filesystems, and common tools handle - filenames followed by a list of recommendations for implementors in + filenames followed by a list of recommendations for implementers in .
@@ -896,7 +927,7 @@ highsmith-tahoe/ multiple files which differ only in case: "example.txt" and "Example.txt" are separate files - NTFS and HFS+ usually preserve case when storing files but are + NTFS and Apple's HFS Plus usually preserve case when storing files but are case-insensitive when retrieving them. A file saved as "Example.txt" will be retrieved by that name but will also be retrieved as "EXAMPLE.TXT", "example.txt", etc. @@ -916,7 +947,7 @@ example below: The common surname "Núñez" normalized in different forms Some bags have been manually assembled using checksum utilities such as those contained in the GNU Coreutils package (md5sum, sha1sum, etc.), collectively -referred to here as "md5sum". Implementors who desire wide support of legacy +referred to here as "md5sum". Implementers who desire wide support of legacy content should be aware of some known quirks of these tools: md5sum can be run in “text mode” which causes it to normalize line-endings on some operating systems. On Unix-like systems both modes will usually produce -the same results but on systems like Windows they may produce different results +the same results but on systems like Windows they can produce different results based on the file contents. The md5sum output format has two characters between the checksum and the @@ -1045,9 +1076,10 @@ filename: the first is always a space and the second is an asterisk ("*") for binary mode and a space for text mode. -A final note about md5sum-generated manifests is that for a FILENAME containing +A final note about md5sum-generated manifests is that for a filename containing a backslash ('\'), the manifest line will have a backslash inserted in front of -the CHECKSUM and, under Windows, the backslashes inside FILENAME may be doubled. +the checksum and, under Windows, the backslashes inside +filename can be doubled. Implementers &may; wish to accept this format by ignoring a leading asterisk or @@ -1061,10 +1093,85 @@ update the bag with valid manifests.
+
+ +The Augmented Backus-Naur form (ABNF) provided below are non-normative. If +there is a discrepancy between requirements in the normative sections and +the ABNF, the requirements in the normative sections prevail. + +
+
+ bagit.txt ABNF: + +
+
+ +
+
+ Payload Manifest ABNF: + +
+
+ +
+
+ bag-info.txt ABNF: + +
+
+ +
+
+ fetch.txt ABNF: + +length = DIGIT +filename = ( + "data" + "/" + *( unreserved / pct-encoded / sub-delims ) + ) +line-terminator = CR / LF / CRLF +]]> +
+
+ +
+
BagIt owes much to many thoughtful contributors and reviewers, including -Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Brad Hards, Scott Fisher, Keith +Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Dave Crocker, Brad Hards, Scott Fisher, Keith Johnson, Erik Hetzner, Leslie Johnston, David Loy, Mark Phillips, Tracy Seneca, Brian Tingle, Adam Turoff, and Jim Tuttle. From 5d7f9dbf1b42c7cf0c36a036bbe457956c413873 Mon Sep 17 00:00:00 2001 From: Justin Littman Date: Wed, 7 Mar 2018 09:48:48 -0500 Subject: [PATCH 073/144] Requested tweeks. --- bagit.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index 9d0f497c..45051f4c 100644 --- a/bagit.xml +++ b/bagit.xml @@ -523,7 +523,7 @@ As a result, no filename listed in a tag manifest be earlier than 1.0. - Following are reserved metadata elements. The use of these reserved + The following are reserved metadata elements. The use of these reserved metadata elements are &optional; but encouraged. Reserved metadata element names are case-insensitive. @@ -734,7 +734,7 @@ mapping common algorithm names to normalized names: - For BagIt 1.0, bag creation and validation tools &must; support the SHA-2 family of + Starting with BagIt 1.0, bag creation and validation tools &must; support the SHA-2 family of algorithms and &should; enable SHA-512 by default when creating new bags. From 36cd8fea3a53e20e535bc2cd8668ac4d2b710ece Mon Sep 17 00:00:00 2001 From: Chris Adams Date: Thu, 29 Mar 2018 14:05:05 -0400 Subject: [PATCH 074/144] Clarify that tag manifests should also list every manifest This clarifies an issue raised by @stain: https://github.com/jkunze/bagitspec/pull/19#discussion_r178067527 --- bagit.xml | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/bagit.xml b/bagit.xml index 45051f4c..331d788a 100644 --- a/bagit.xml +++ b/bagit.xml @@ -471,14 +471,22 @@ placeholder file with a name such as ".keep".
- A tag manifest is a tag file that lists other tag files and checksums for - those tag files generated using a particular bag checksum algorithm. + A tag manifest is a tag file that lists other tag files and + checksums for those tag files generated using a particular bag + checksum algorithm. + + A bag &may; contain one or more tag manifests. - - A tag manifest file &must; have a name of the form "tagmanifest-algorithm.txt", where algorithm - is a string following the format described in - specifying the bag checksum algorithm - used in that manifest. + + + Each tag manifest &must; list every payload manifest. + + + A tag manifest file &must; have a name of the form + "tagmanifest-algorithm.txt", + where algorithm is a string following + the format described in + specifying the bag checksum algorithm used in that manifest.
Example tag manifest filenames: From ae9ead91f9d3977d2deeace7231e79b39cef4734 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Mon, 2 Apr 2018 08:03:10 -0400 Subject: [PATCH 075/144] refs #3 - updated ABNF to better match the normative --- bagit.xml | 63 +++++++++++++++++++++++++------------------------------ 1 file changed, 28 insertions(+), 35 deletions(-) diff --git a/bagit.xml b/bagit.xml index 331d788a..f51921ff 100644 --- a/bagit.xml +++ b/bagit.xml @@ -716,7 +716,7 @@ specifies UTF-8 the tag files &must-not; begin with a byte-order mark (BOM). See -The use of UTF-8 for text tag files is strongly &recommended;. Future version +The use of UTF-8 for text tag files is strongly &recommended;. A future version of BagIt may disallow encodings other than UTF-8.
@@ -1111,10 +1111,10 @@ the ABNF, the requirements in the normative sections prevail.
bagit.txt ABNF:
@@ -1123,23 +1123,16 @@ the ABNF, the requirements in the normative sections prevail.
Payload Manifest ABNF:
@@ -1148,10 +1141,13 @@ ending = CR / LF / CRLF
bag-info.txt ABNF:
@@ -1161,15 +1157,12 @@ ending = CR / LF / CRLF
fetch.txt ABNF: -length = DIGIT -filename = ( - "data" - "/" - *( unreserved / pct-encoded / sub-delims ) - ) -line-terminator = CR / LF / CRLF +fetch = 1*fetch-line +fetch-line = url 1*WSP length 1*WSP filename ending +url = +length = 1*DIGIT / "-" +filename = ("data/" 1*( unreserved / pct-encoded / sub-delims )) +ending = CR / LF / CRLF ]]>
From 0b07d738030ec75fa39b28043e2605dca6741513 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Thu, 5 Apr 2018 12:42:21 -0400 Subject: [PATCH 076/144] refs #3 - added ABNF reference and not about using core rules --- bagit.xml | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index f51921ff..ba3f7489 100644 --- a/bagit.xml +++ b/bagit.xml @@ -7,6 +7,7 @@ + @@ -1105,7 +1106,9 @@ update the bag with valid manifests. The Augmented Backus-Naur form (ABNF) provided below are non-normative. If there is a discrepancy between requirements in the normative sections and -the ABNF, the requirements in the normative sections prevail. +the ABNF, the requirements in the normative sections prevail. Some +definitions use the core rules (e.g. DIGIT, HEXDIG, etc) as defined in +
@@ -1208,6 +1211,7 @@ This draft does not request any action from IANA. &RFC6234; &RFC3629; &RFC3986; + &RFC2234; From c3b62bbc18db406a7d72ba0d3ad1188c9537e475 Mon Sep 17 00:00:00 2001 From: Simeon Warner Date: Thu, 5 Apr 2018 14:25:47 -0400 Subject: [PATCH 077/144] Metadata elements are normatively OPTIONAL --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index ba3f7489..6ebe323d 100644 --- a/bagit.xml +++ b/bagit.xml @@ -509,7 +509,7 @@ As a result, no filename listed in a tag manifest be The "bag-info.txt" file is a tag file that contains metadata elements describing the bag and the payload. The metadata elements contained in the "bag-info.txt" file are intended primarily for - human use. All metadata elements are optional and &may; be repeated. + human use. All metadata elements are &optional; and &may; be repeated. Because "bag-info.txt" is intended for human reading and editing, ordering &may; be significant and the ordering of metadata elements &must; be preserved. From f30661f8bb900729ba2a44a8060a97f2bf6353a8 Mon Sep 17 00:00:00 2001 From: Simeon Warner Date: Thu, 5 Apr 2018 14:36:57 -0400 Subject: [PATCH 078/144] Be more specific than SHA-2 family --- bagit.xml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index ba3f7489..cd150956 100644 --- a/bagit.xml +++ b/bagit.xml @@ -743,9 +743,9 @@ mapping common algorithm names to normalized names: - Starting with BagIt 1.0, bag creation and validation tools &must; support the SHA-2 family of - algorithms and &should; enable SHA-512 by default - when creating new bags. + Starting with BagIt 1.0, bag creation and validation tools &must; support the + SHA-256 and SHA-512 algorithms and &should; enable + SHA-512 by default when creating new bags. For backwards-compatibility implementers &should; support MD-5 and SHA-1 . From f4d59117de25cc08b1b9fead3615cd2147f26cc8 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Wed, 18 Apr 2018 14:33:14 -0400 Subject: [PATCH 079/144] refs #2 - added language to specify that all files must be in all manifests starting with version 1.0 --- bagit.xml | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 8c6e3ff2..4b2d8b36 100644 --- a/bagit.xml +++ b/bagit.xml @@ -768,7 +768,10 @@ requirements: Every required element &must; be present (). Every file listed in every tag manifest &must; be present. Every file listed in every payload manifest &must; be present. - Every payload file &must; be listed in every payload manifest. + For BagIt 1.0, Every payload file &must; be listed in every payload manifest. + For compatibility with previous versions, every payload file &must; + be listed in at least one payload manifest. Additionally in previous versions, + a payload file &may; be listed in multiple payload manifests. Every element present &must; comply with this specification. From ac339b4625ef716dc03a8c51e78e2555364a1cec Mon Sep 17 00:00:00 2001 From: John Scancella Date: Wed, 18 Apr 2018 14:36:28 -0400 Subject: [PATCH 080/144] refs #7 - changed section about bags differing in normalization only from 'MUST' to 'SHOULD' be prevented --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 4b2d8b36..fe32f6de 100644 --- a/bagit.xml +++ b/bagit.xml @@ -1018,7 +1018,7 @@ z 7a LATIN SMALL LETTER Z files which differ only in case. - Implementations &must; prevent the creation of bags containing files + Implementations &should; prevent the creation of bags containing files which differ only in normalization form. From 4a8e2a3af80b5cb2d515335199c60a027e2e9b37 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Thu, 19 Apr 2018 07:43:23 -0400 Subject: [PATCH 081/144] refs #2 - fixing grammer --- bagit.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index fe32f6de..f49574ee 100644 --- a/bagit.xml +++ b/bagit.xml @@ -768,8 +768,8 @@ requirements: Every required element &must; be present (). Every file listed in every tag manifest &must; be present. Every file listed in every payload manifest &must; be present. - For BagIt 1.0, Every payload file &must; be listed in every payload manifest. - For compatibility with previous versions, every payload file &must; + For BagIt 1.0, every payload file &must; be listed in every payload manifest. + For compatibility with previous versions every payload file &must; be listed in at least one payload manifest. Additionally in previous versions, a payload file &may; be listed in multiple payload manifests. Every element present &must; comply with this specification. From 65f5e7e21bfdfb504a7d6753d427914b1760c995 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Fri, 20 Apr 2018 08:43:23 +0100 Subject: [PATCH 082/144] Tidy ABNF (e.g. shorter line length) .. to avoid xml2rfc warnings --- bagit.xml | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/bagit.xml b/bagit.xml index f49574ee..ee05531a 100644 --- a/bagit.xml +++ b/bagit.xml @@ -1132,11 +1132,13 @@ ending = CR / LF / CRLF payload-manifest = 1*payload-manifest-line payload-manifest-line = checksum 1*WSP filename ending checksum = 1*case-hexdig -case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" / - "d" / "E"/ "e"/ "F" / "f" -filename = "data/" 1*( unreserved / pct-encoded / sub-delims ) +case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / + "D" / "d" / "E"/ "e"/ "F" / "f" +filename = "data/" + 1*( unreserved / pct-encoded / sub-delims ) unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" -sub-delims = "!" / "$" / "&" / DQUOTE / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" / "/" +sub-delims = "!" / "$" / "&" / DQUOTE / "'" / "(" / ")" / + "*" / "+" / "," / ";" / "=" / "/" pct-encoded = "%" HEXDIG HEXDIG ending = CR / LF / CRLF ]]> @@ -1153,7 +1155,8 @@ key = 1*non-reserved value = 1*non-reserved continuation = WSP 1*non-reserved non-reserved = VCHAR / WSP - ;any valid character for the specific encoding except those that match "ending" + ; any valid character for the specific encoding + ; except those that match "ending" ending = CR / LF / CRLF ]]>
@@ -1167,7 +1170,8 @@ fetch = 1*fetch-line fetch-line = url 1*WSP length 1*WSP filename ending url = length = 1*DIGIT / "-" -filename = ("data/" 1*( unreserved / pct-encoded / sub-delims )) +filename = ("data/" + 1*( unreserved / pct-encoded / sub-delims )) ending = CR / LF / CRLF ]]>
From 6dfaf3e69369dcd0004868c228c02d48f339bdbf Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Fri, 20 Apr 2018 08:50:21 +0100 Subject: [PATCH 083/144] Can't say "only cryptographic-quality" .. and then link to a section that also lists md5 and sha1 --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index f49574ee..34b35868 100644 --- a/bagit.xml +++ b/bagit.xml @@ -227,7 +227,7 @@ BagIt differs from serialized archive formats such as MIME, TAR, or ZIP in two general areas: - Strong integrity assurances. The format supports only cryptographic-quality + Strong integrity assurances. The format supports cryptographic-quality hash algorithms (see ) and allows for in-place upgrades to add additional manifests using stronger algorithms without breaking backwards compatibility. From 851cf0dba37922a46e9661e97090ecbd1e4c3ba3 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Fri, 20 Apr 2018 08:57:11 +0100 Subject: [PATCH 084/144] Use sha512 as example, not sha1 .. also no need to show normalization here, as we link to subsection --- bagit.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index f49574ee..48b9425e 100644 --- a/bagit.xml +++ b/bagit.xml @@ -279,8 +279,8 @@ document are to be interpreted as described in . The name of a cryptographic checksum algorithm which has been normalized - for use in a manifest or tag manifest file name (e.g. "SHA-1" becomes - "sha1") as described in . + for use in a manifest or tag manifest file name (e.g. "sha512") + as described in . The data encapsulated by the bag. The contents of the payload From 72bcdfceb54851f0a5fc68bc180e8f8e95a937bf Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Fri, 20 Apr 2018 09:01:05 +0100 Subject: [PATCH 085/144] Payload is a set of named files .. not a single stream of octets (Should we have "payload files" in the terminology as well?) --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index f49574ee..062cea86 100644 --- a/bagit.xml +++ b/bagit.xml @@ -283,7 +283,7 @@ document are to be interpreted as described in . "sha1") as described in . - The data encapsulated by the bag. The contents of the payload + The data encapsulated by the bag as a set of named files. The contents of the payload files are opaque to this specification, and, with respect to BagIt processing, are always considered as a sequence of uninterpreted octets. See . From d5617564813cdfb3bd2ac3986d44f3f7a433882d Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Fri, 20 Apr 2018 09:11:41 +0100 Subject: [PATCH 086/144] List all types of tag files .. not just bag-info and tagmanifest --- bagit.xml | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/bagit.xml b/bagit.xml index f49574ee..3f9d56d1 100644 --- a/bagit.xml +++ b/bagit.xml @@ -292,12 +292,16 @@ document are to be interpreted as described in . A directory that contains one or more tag files. - A file which contains metadata. The specification defines two standard tag - files: tag manifests, which describe other tag files - , and the "bag-info.txt" file containing - human-meaningful metadata . + A file which contains metadata about the bag or its payload. + This specification defines the standard + BagIt tag files: + the bag declaration in "bagit.txt" , + payload manifests , + tag manifests , + bag metadata in "bag-info.txt" , + and remote payload in "fetch.txt" . - The specification also allows other arbitrary tag files as described in + This specification also allows other arbitrary tag files as described in . From e0b411d40433efab1b46b89f5c6b0a7948a1bcb5 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Fri, 20 Apr 2018 09:32:09 +0100 Subject: [PATCH 087/144] *manifest-sha512* in examples .. not md5/sha1 -- people might get the wrong idea! --- bagit.xml | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/bagit.xml b/bagit.xml index 48b9425e..deb25775 100644 --- a/bagit.xml +++ b/bagit.xml @@ -426,8 +426,8 @@ Tag-File-Character-Encoding: UTF-8
Example payload manifest filenames -manifest-md5.txt -manifest-sha1.txt +manifest-sha256.txt +manifest-sha512.txt
@@ -492,8 +492,8 @@ placeholder file with a name such as ".keep".
Example tag manifest filenames: -tagmanifest-md5.txt -tagmanifest-sha1.txt +tagmanifest-sha256.txt +tagmanifest-sha512.txt
@@ -864,7 +864,7 @@ highsmith-tahoe/ special directory characters. - For example, a maliciously crafted "tagmanifest-md5.txt" file might + For example, a maliciously crafted "tagmanifest-sha512.txt" file might contain entries which begin with a path character such as "/", "..", or a "~username" home directory reference in an attempt to cause a naive implementation to leak or overwrite targeted files on a POSIX From 853c870e58e2d017dff5f084648232110fb5b430 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Fri, 20 Apr 2018 09:32:41 +0100 Subject: [PATCH 088/144] MD5 has never been called "MD-5" --- bagit.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index deb25775..d526ab90 100644 --- a/bagit.xml +++ b/bagit.xml @@ -736,7 +736,7 @@ manifest's filename by lowercasing the common name of the algorithm and removing all non-alphanumeric characters. Following is a partial list mapping common algorithm names to normalized names: - MD-5: md5 + MD5: md5 SHA-1: sha1 SHA-256: sha256 SHA-512: sha512 @@ -748,7 +748,7 @@ mapping common algorithm names to normalized names: SHA-512 by default when creating new bags. For backwards-compatibility implementers &should; support - MD-5 and SHA-1 . + MD5 and SHA-1 . Implementers are encouraged to simplify the process of adding additional manifests using new algorithms to streamline the process of in-place From f1e34b1d3564e55f6b1c3d39861c2ae187754950 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Fri, 20 Apr 2018 09:32:52 +0100 Subject: [PATCH 089/144] Explain why example use md5 sha512 would break our figure width! --- bagit.xml | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/bagit.xml b/bagit.xml index d526ab90..8b827d30 100644 --- a/bagit.xml +++ b/bagit.xml @@ -795,6 +795,8 @@ A valid bag &must; meet the following requirements: This is the layout of a basic bag containing an image and a companion OCR file. Lines of file content are shown with added parentheses to indicate each complete line. + For brevity this example uses the algorithm "md5" + ather than the recommended "sha512".
@@ -826,6 +828,8 @@ myfirstbag/ files listed in the payload manifests prior to validation. Lines of file content are shown with added parentheses to indicate each complete line. + For brevity this example uses the algorithm "md5" + rather than the recommended "sha512".
From 75add1b5de1bf825dc9899ede587d7fcdf589899 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Fri, 20 Apr 2018 09:44:11 +0100 Subject: [PATCH 090/144] Avoid confusing "sub-directory must not be changed" It is out of scope for this specification to specify "changes" to bags or the process of archiving an existing directory file structure into bagit. Rather we should say that the payload files may be in sub-directories (data/ is not flat), and just that the directory and filenames have no given meaning .. HERE. Obviously, just like the octet bytes they have a meaning to someone. (If we are going to talk about "change" this opens a can of worms about changes to checksums, changes to tag files, augmenting bag-info.txt etc etc) --- bagit.xml | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index f49574ee..e9d81eaf 100644 --- a/bagit.xml +++ b/bagit.xml @@ -391,8 +391,9 @@ Tag-File-Character-Encoding: UTF-8 The files under the payload directory are called payload files, or the payload. Each payload file is treated as an opaque octet stream when verifying file correctness. - Any sub-directory structure within the payload &must-not; be changed but is - otherwise ignored for purposes relating to this specification. + Payload files &may; be organized in arbitrary sub-directory structures + within the payload directory, however for the purpose of this specification + such sub-directory structures and file names have no given meaning.
From 7bc88cc1636ebb7105833c45f22f2f6ff015d112 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Fri, 20 Apr 2018 08:03:58 -0400 Subject: [PATCH 091/144] refs #2 - changed text to be more clear that the previous version is just a note --- bagit.xml | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/bagit.xml b/bagit.xml index ee05531a..896edc33 100644 --- a/bagit.xml +++ b/bagit.xml @@ -768,10 +768,10 @@ requirements: Every required element &must; be present (). Every file listed in every tag manifest &must; be present. Every file listed in every payload manifest &must; be present. - For BagIt 1.0, every payload file &must; be listed in every payload manifest. - For compatibility with previous versions every payload file &must; - be listed in at least one payload manifest. Additionally in previous versions, - a payload file &may; be listed in multiple payload manifests. + For BagIt 1.0, every payload file &must; be listed in every payload manifest. + Note that older versions of this specification allowed payload files to be + listed in just one of the manifests. + Every element present &must; comply with this specification. From 4bf26894f13a751bdc64b0b1244745cd37f25317 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Fri, 20 Apr 2018 08:05:07 -0400 Subject: [PATCH 092/144] updated date --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 896edc33..1eca8309 100644 --- a/bagit.xml +++ b/bagit.xml @@ -197,7 +197,7 @@ brian@ardvaark.net - + This document specifies BagIt, a set of hierarchical file layout conventions for From fa015eb5319275ab0cc283d1a1a0f910f8aa8445 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Fri, 20 Apr 2018 08:08:29 -0400 Subject: [PATCH 093/144] changed the bagit.txt example to make it more clear that you can choose which encoding the other tag files are in --- bagit.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index 1eca8309..0c57231b 100644 --- a/bagit.xml +++ b/bagit.xml @@ -367,11 +367,11 @@ The base directory &may; have any name.
BagIt-Version: M.N -Tag-File-Character-Encoding: UTF-8 +Tag-File-Character-Encoding: ENCODING M.N identifies the BagIt major (M) and minor (N) version numbers, - and UTF-8 identifies the character set encoding used by the tag files. + and ENCODING identifies the character set encoding used by the tag files. The bag declaration &must; be encoded in UTF-8, and &must-not; contain a byte-order mark (BOM) . From f5a72aa85d8942a15d219e87ac29b4f3333aeeee Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Mon, 23 Apr 2018 10:30:29 +0100 Subject: [PATCH 094/144] Typo "ather" -> "rather" Fixes https://github.com/LibraryOfCongress/bagit-spec/pull/10#pullrequestreview-114182845 --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 8b827d30..14214934 100644 --- a/bagit.xml +++ b/bagit.xml @@ -796,7 +796,7 @@ A valid bag &must; meet the following requirements: OCR file. Lines of file content are shown with added parentheses to indicate each complete line. For brevity this example uses the algorithm "md5" - ather than the recommended "sha512". + rather than the recommended "sha512".
From b6bbf229c073ba6d49c7ad51f48c4142fb7bcf2d Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Mon, 23 Apr 2018 10:32:12 +0100 Subject: [PATCH 095/144] filenames, not file names Fixes https://github.com/LibraryOfCongress/bagit-spec/pull/13#pullrequestreview-114182762 --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index e9d81eaf..843bda2c 100644 --- a/bagit.xml +++ b/bagit.xml @@ -393,7 +393,7 @@ Tag-File-Character-Encoding: UTF-8 correctness. Payload files &may; be organized in arbitrary sub-directory structures within the payload directory, however for the purpose of this specification - such sub-directory structures and file names have no given meaning. + such sub-directory structures and filenames have no given meaning.
From 7e296f1f1ebb3ead640370c49c4705b96623290e Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Mon, 23 Apr 2018 10:35:59 +0100 Subject: [PATCH 096/144] files may be organized in subdirectories --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 062cea86..52135d3c 100644 --- a/bagit.xml +++ b/bagit.xml @@ -283,7 +283,7 @@ document are to be interpreted as described in . "sha1") as described in . - The data encapsulated by the bag as a set of named files. The contents of the payload files + The data encapsulated by the bag as a set of named files, which may be organized in sub-directories. The contents of the payload files are opaque to this specification, and, with respect to BagIt processing, are always considered as a sequence of uninterpreted octets. See . From 0bfdbb5277ac4fbc55626994720fd87e8fc45dba Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Mon, 23 Apr 2018 10:38:18 +0100 Subject: [PATCH 097/144] correct plural (many files, many sequences) --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 52135d3c..04227e47 100644 --- a/bagit.xml +++ b/bagit.xml @@ -285,7 +285,7 @@ document are to be interpreted as described in . The data encapsulated by the bag as a set of named files, which may be organized in sub-directories. The contents of the payload files are opaque to this specification, and, with respect to BagIt processing, - are always considered as a sequence of uninterpreted octets. + are always considered as sequences of uninterpreted octets. See . From 9e3417bd8261161861719c8b70882e47973b03dd Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Mon, 23 Apr 2018 15:49:19 +0100 Subject: [PATCH 098/144] Avoid RFC "MAY" for "base directory" ... as it implies you should only use "any name" if you really know what you are doing. But really we don't recommend any particular name, so we can't use MAY here. --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 1eca8309..400127b5 100644 --- a/bagit.xml +++ b/bagit.xml @@ -338,7 +338,7 @@ in arbitrary file hierarchies and &may; have any name that is not reserved for a file or directory in this specification. -The base directory &may; have any name. +The base directory can have any name.
From 5fa2515404949e92afcd69581053f64c146cf636 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Mon, 23 Apr 2018 16:07:07 +0100 Subject: [PATCH 099/144] Use "filepath" instead of "filename" .. to imply that the filepath also will include relative directories. Filename is often seen as just the last segment of a filepath. "pathname" is on the otherside only the "directory" side. filepath: data/fred/soup.txt filename: soup.txt pathname: data/fred/ --- bagit.xml | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/bagit.xml b/bagit.xml index 1eca8309..f4ab8f58 100644 --- a/bagit.xml +++ b/bagit.xml @@ -434,9 +434,9 @@ manifest-sha1.txt Each line of a payload manifest file &must; be of the form:
- checksum filename + checksum filepath - where filename is the pathname of a file + where filepath is the pathname of a file relative to the base directory, and checksum is a hex-encoded checksum calculated according to algorithm over every octet in the file. @@ -446,14 +446,14 @@ manifest-sha1.txt The hex-encoded checksum &may; use uppercase and/or lowercase letters. The slash character ('/') &must; be used as a path separator - in filename. + in filepath. One or more linear whitespace characters (spaces or tabs) &must; separate checksum from - filename. + filepath. There is no limitation on the length of a pathname. The payload manifest &must-not; reference files outside the payload directory. - If a filename includes a newline + If a filepath includes a newline (LF), a carriage return (CR), or carriage return plus newline (CRLF) it &must; be percent-encoded following . @@ -500,7 +500,7 @@ tagmanifest-sha1.txt A tag manifest file has the same form as the payload file manifest file described in , but &must-not; list any payload files. -As a result, no filename listed in a tag manifest begins "data/". +As a result, no filepath listed in a tag manifest begins "data/".
@@ -649,20 +649,20 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created Each line of a fetch file &must; be of the form:
- url length filename + url length filepath where url identifies the file to be fetched and must be an absolute URI as defined in , length is the number of octets in the file (or "-", to leave it unspecified), - and filename identifies the + and filepath identifies the corresponding payload file, relative to the base directory.
The slash character ('/') &must; be used as a path separator in - filename. One or more linear whitespace + filepath. One or more linear whitespace characters (spaces or tabs) &must; separate these three values, and any such characters in the url &must; be percent-encoded . There is no @@ -1084,14 +1084,14 @@ the same results but on systems like Windows they can produce different results based on the file contents. The md5sum output format has two characters between the checksum and the -filename: the first is always a space and the second is an asterisk ("*") for +filepath: the first is always a space and the second is an asterisk ("*") for binary mode and a space for text mode. -A final note about md5sum-generated manifests is that for a filename containing +A final note about md5sum-generated manifests is that for a filepath containing a backslash ('\'), the manifest line will have a backslash inserted in front of the checksum and, under Windows, the backslashes inside -filename can be doubled. +fiilepath can be doubled. Implementers &may; wish to accept this format by ignoring a leading asterisk or @@ -1130,11 +1130,11 @@ ending = CR / LF / CRLF Payload Manifest ABNF: fetch.txt ABNF: length = 1*DIGIT / "-" -filename = ("data/" +filepath = ("data/" 1*( unreserved / pct-encoded / sub-delims )) ending = CR / LF / CRLF ]]> From b86cd545020096196ca7e3e5adbe7a0312ca33f3 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Mon, 23 Apr 2018 17:13:18 +0100 Subject: [PATCH 100/144] Avoid unknown Source-URL in example If we want to show "custom" keys then that should be a different example --- bagit.xml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 1eca8309..9d7d416b 100644 --- a/bagit.xml +++ b/bagit.xml @@ -844,7 +844,8 @@ highsmith-tahoe/ | (Tag-File-Character-Encoding: UTF-8 ) | | bag-info.txt -| (Source-URL: https://www.loc.gov/resource/highsm.23364/ ) +| (Internal-Sender-Description: Download link found at ) +| ( https://www.loc.gov/resource/highsm.23364/ )
From 92711ec2025e6daad562d06a48fc015b753bb412 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Mon, 23 Apr 2018 17:51:50 +0100 Subject: [PATCH 101/144] Attempt to clarify whitespace in bag-info.txt The more liberal <1.0 format is not part of this spec, and only mentioned as a SHOULD below. Note that this still allows space WITHIN the label (but not start or end). Example: Valid: Long Description Label: And even longer description value Invalid (in 1.0): Can't end with space : Can we? Cantstartwithspace: even on first line --- bagit.xml | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/bagit.xml b/bagit.xml index 1eca8309..c002bd83 100644 --- a/bagit.xml +++ b/bagit.xml @@ -515,21 +515,27 @@ As a result, no filename listed in a tag manifest be metadata elements &must; be preserved.
- A metadata element &must; consist of a label, a colon, at least one - linear whitespace character, and a value. The label &may; contain - linear whitespace characters, but &must-not; be preceded by - linear whitespace. It is &recommended; that - lines not exceed 79 characters in length. Long values &may; be + A metadata element &must; consist of a label, a colon ":", a single + linear whitespace character (space or tab), and a value, terminated with a newline (CR), carriage return (LF) or + carriage return plus newline (CRLF). + + + The label &must-not; contain colon (:), newlines (CR) or carriage returns (LF). + The label &may; contain linear whitespace characters, but &must-not; start or + end with whitespace. + + + It is &recommended; that lines not exceed 79 characters in length. Long values &may; be continued onto the next line by inserting a newline (LF), a carriage return (CR), or carriage return plus newline (CRLF) and indenting - the next line with linear white space (spaces or tabs). + the next line with one or more linear white space (spaces or tabs). + Except for linebreaks such padding does not form part of the value. - For BagIt 1.0, the colon separating the key from the value &must; be - followed by a single linear whitespace character. For compatibility - with previous versions, implementations &must; accept multiple - linear whitespace before and after the colon when the bag version is - earlier than 1.0. + Implementations wishing to support previous BagIt versions + &must; accept multiple linear whitespace before and after the + colon when the bag version is earlier than 1.0; such whitespace + does not form part of the label or value. The following are reserved metadata elements. The use of these reserved From 84c0ae3efb31bf230f6f39fa055c3fba1b97206a Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Mon, 23 Apr 2018 18:10:31 +0100 Subject: [PATCH 102/144] About multiplicity of metadata elements Some of these don't make sense to repeat, but except for Payload-Oxum there is probably no need to be strict. --- bagit.xml | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 1eca8309..cea973a8 100644 --- a/bagit.xml +++ b/bagit.xml @@ -534,7 +534,8 @@ As a result, no filename listed in a tag manifest be The following are reserved metadata elements. The use of these reserved metadata elements are &optional; but encouraged. Reserved metadata - element names are case-insensitive. + element names are case-insensitive. Except where indicated otherwise, + these metadata element names &may; be repeated to capture multiple values. @@ -560,6 +561,7 @@ As a result, no filename listed in a tag manifest be Date (YYYY-MM-DD) that the content was prepared for delivery. + This metadata element &should-not; be repeated. A sender-supplied identifier for the bag. @@ -569,6 +571,7 @@ As a result, no filename listed in a tag manifest be by an abbreviation such as MB (megabytes), GB, or TB; for example, 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum (described next), Bag-Size is intended for human consumption. + This metadata element &should-not; be repeated. The "octetstream sum" of the payload, intended for the @@ -581,6 +584,7 @@ As a result, no filename listed in a tag manifest be octets (8-bit bytes) across all payload file content and StreamCount is the total number of payload files. + This metadata element &must-not; be repeated. A sender-supplied identifier for the set, if any, of bags @@ -588,12 +592,16 @@ As a result, no filename listed in a tag manifest be This identifier must be unique across the sender's content, and if recognizable as belonging to a globally unique scheme, the receiver should make an effort to honor reference to it. + This metadata element &should-not; be repeated. Two numbers separated by "of", in particular, "N of T", where T is the total number of bags in a group of bags and N is the ordinal number within the group; if T is not known, specify it as "?" (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 145. + This metadata element &should-not; be repeated. + If this metadata element is present, the Bag-Group-Identifier element + &should; be present. An alternate sender-specific identifier for the content From c6af1ccd0cb770f4c491236689af9c5eb0f522c2 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Tue, 24 Apr 2018 07:25:09 -0400 Subject: [PATCH 103/144] clarified that the encoding is used by the remaining tag files --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 0c57231b..c99713cf 100644 --- a/bagit.xml +++ b/bagit.xml @@ -371,7 +371,7 @@ Tag-File-Character-Encoding: ENCODING M.N identifies the BagIt major (M) and minor (N) version numbers, - and ENCODING identifies the character set encoding used by the tag files. + and ENCODING identifies the character set encoding used by the remaining tag files. The bag declaration &must; be encoded in UTF-8, and &must-not; contain a byte-order mark (BOM) . From 72a60fefc5c1cc94e0b23b92093ee66a40aa6290 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Tue, 24 Apr 2018 10:20:33 -0400 Subject: [PATCH 104/144] added Stian Soiland-Reyes to list of contributors --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index d7055426..a85ac527 100644 --- a/bagit.xml +++ b/bagit.xml @@ -1200,7 +1200,7 @@ ending = CR / LF / CRLF BagIt owes much to many thoughtful contributors and reviewers, including Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Dave Crocker, Brad Hards, Scott Fisher, Keith Johnson, Erik Hetzner, Leslie Johnston, David Loy, Mark Phillips, Tracy Seneca, -Brian Tingle, Adam Turoff, and Jim Tuttle. +Brian Tingle, Adam Turoff, Jim Tuttle, and Stian Soiland-Reyes.
From e45f0f1ccebc247f5cf6285acb456b6d18f688ee Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Tue, 24 Apr 2018 15:25:47 +0100 Subject: [PATCH 105/144] RECOMMENDED instead of SHOULD BagCount/-GroupID link .. as this is more of a best practice recommendation --- bagit.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index cea973a8..9609a91c 100644 --- a/bagit.xml +++ b/bagit.xml @@ -600,8 +600,8 @@ As a result, no filename listed in a tag manifest be ordinal number within the group; if T is not known, specify it as "?" (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 145. This metadata element &should-not; be repeated. - If this metadata element is present, the Bag-Group-Identifier element - &should; be present. + If this metadata element is present, it is &recommended; to also + include the Bag-Group-Identifier element. An alternate sender-specific identifier for the content From 772a59bab467bfbc41cab1b9bd3a2b43a307c75e Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Tue, 1 May 2018 12:03:04 +0100 Subject: [PATCH 106/144] Consistent "line feed (LF)" instead of "newline" As agreed in https://github.com/LibraryOfCongress/bagit-spec/pull/17#discussion_r184976972 --- bagit.xml | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/bagit.xml b/bagit.xml index a85ac527..a01e91b3 100644 --- a/bagit.xml +++ b/bagit.xml @@ -458,9 +458,9 @@ manifest-sha512.txt There is no limitation on the length of a pathname. The payload manifest &must-not; reference files outside the payload directory. - If a filename includes a newline + If a filename includes a line feed (LF), a carriage return (CR), - or carriage return plus newline (CRLF) it &must; be + or carriage return plus line feed (CRLF) it &must; be percent-encoded following . @@ -521,18 +521,18 @@ As a result, no filename listed in a tag manifest be A metadata element &must; consist of a label, a colon ":", a single - linear whitespace character (space or tab), and a value, terminated with a newline (CR), carriage return (LF) or - carriage return plus newline (CRLF). + linear whitespace character (space or tab), and a value, terminated with a line feed (CR), carriage return (LF) or + carriage return plus line feed (CRLF). - The label &must-not; contain colon (:), newlines (CR) or carriage returns (LF). + The label &must-not; contain colon (:), line feeds (LF) or carriage returns (CR). The label &may; contain linear whitespace characters, but &must-not; start or end with whitespace. It is &recommended; that lines not exceed 79 characters in length. Long values &may; be - continued onto the next line by inserting a newline (LF), a carriage - return (CR), or carriage return plus newline (CRLF) and indenting + continued onto the next line by inserting a line feed (LF), a carriage + return (CR), or carriage return plus line feed (CRLF) and indenting the next line with one or more linear white space (spaces or tabs). Except for linebreaks such padding does not form part of the value. @@ -714,8 +714,8 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created Text tag files are line-oriented, and each line &must; be terminated - by a newline (LF), a carriage return (CR), or carriage return plus - newline (CRLF). Text tag file names &must; end in the extension + by a line feed (LF), a carriage return (CR), or carriage return plus + line feed (CRLF). Text tag file names &must; end in the extension ".txt". From 5bcb9c507c59c7c020aa5592b95ff516c964dca4 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Tue, 1 May 2018 12:21:10 +0100 Subject: [PATCH 107/144] Typo fiilepath Fixes https://github.com/LibraryOfCongress/bagit-spec/pull/16#discussion_r183479826 --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index f4ab8f58..09cedf18 100644 --- a/bagit.xml +++ b/bagit.xml @@ -1091,7 +1091,7 @@ binary mode and a space for text mode. A final note about md5sum-generated manifests is that for a filepath containing a backslash ('\'), the manifest line will have a backslash inserted in front of the checksum and, under Windows, the backslashes inside -fiilepath can be doubled. +filepath can be doubled. Implementers &may; wish to accept this format by ignoring a leading asterisk or From 1bbf1f5438b1f8fa44c4d8c156e95827b2c77362 Mon Sep 17 00:00:00 2001 From: Stian Soiland-Reyes Date: Tue, 1 May 2018 12:35:42 +0100 Subject: [PATCH 108/144] Reference RFC2989 registry for ENCODING text as suggested by @acdha in https://github.com/LibraryOfCongress/bagit-spec/pull/14#issuecomment-385392299 I also used to separate ENCODING/MUST/UTF-8 as it got a bit capsy :) --- bagit.xml | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/bagit.xml b/bagit.xml index c99713cf..e26da389 100644 --- a/bagit.xml +++ b/bagit.xml @@ -9,6 +9,7 @@ + @@ -370,10 +371,15 @@ BagIt-Version: M.N Tag-File-Character-Encoding: ENCODING - M.N identifies the BagIt major (M) and minor (N) version numbers, - and ENCODING identifies the character set encoding used by the remaining tag files. + M.N identifies the BagIt major (M) and minor (N) version numbers. + ENCODING identifies the character set encoding used by the remaining tag files. - The bag declaration &must; be encoded in UTF-8, and &must-not; contain a + ENCODING &should; + be UTF-8 but + for backwards compatibility it &may; be any + other encoding registered in . + + The bag declaration itself &must; be encoded in UTF-8, and &must-not; contain a byte-order mark (BOM) .
@@ -1212,13 +1218,14 @@ This draft does not request any action from IANA. - &RFC2119; &RFC1321; + &RFC2119; + &RFC2234; + &RFC2978; &RFC3174; - &RFC6234; &RFC3629; &RFC3986; - &RFC2234; + &RFC6234; From 44be57234b2cd4469b4e437d80543f2eca39b016 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Fri, 11 May 2018 12:17:21 -0400 Subject: [PATCH 109/144] refs #28 - changed wording for pull request --- bagit.xml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/bagit.xml b/bagit.xml index bae1a79d..a1fe64a5 100644 --- a/bagit.xml +++ b/bagit.xml @@ -500,6 +500,9 @@ placeholder file with a name such as ".keep". the format described in specifying the bag checksum algorithm used in that manifest. + + Tag manifests &should; use the same algorithms as the payload manifests that are present in the bag. +
Example tag manifest filenames: From 040fb7ef3b03db641d5c68c216e40fce28539721 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Fri, 11 May 2018 12:26:48 -0400 Subject: [PATCH 110/144] refs #17 - fixed merge conflicts for only percent encode CR, LF, and CRLF --- bagit.xml | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index a1fe64a5..ea9e2abb 100644 --- a/bagit.xml +++ b/bagit.xml @@ -466,7 +466,8 @@ manifest-sha512.txt If a filepath includes a line feed (LF), a carriage return (CR), - or carriage return plus line feed (CRLF) it &must; be + carriage return plus line feed (CRLF) or + percent sign (%), those characters (and only those) &must; be percent-encoded following . @@ -693,7 +694,13 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created filepath. One or more linear whitespace characters (spaces or tabs) &must; separate these three values, and any such characters in the url - &must; be percent-encoded . There is no + &must; be percent-encoded . + If filename includes a line feed + (LF), a carriage return (CR), + carriage return plus line feed (CRLF) or + percent sign (%), those characters (and only those) &must; be + percent-encoded following . + There is no limitation on the length of any of the fields in the fetch file. @@ -1172,7 +1179,7 @@ filepath = "data/" unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" sub-delims = "!" / "$" / "&" / DQUOTE / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" / "/" -pct-encoded = "%" HEXDIG HEXDIG +pct-encoded = "%0D" / "%0d" / "%0A" / "%0a" / "%25" ending = CR / LF / CRLF ]]>
From aee2ff3cbcbe965c0c0bad03ec8d14f912a2ce41 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Fri, 11 May 2018 12:30:15 -0400 Subject: [PATCH 111/144] refs #22 - fixed merge conflict for tag files have optional newline at end of file --- bagit.xml | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index ea9e2abb..62b2af40 100644 --- a/bagit.xml +++ b/bagit.xml @@ -739,8 +739,9 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created Text tag files are line-oriented, and each line &must; be terminated by a line feed (LF), a carriage return (CR), or carriage return plus - line feed (CRLF). Text tag file names &must; end in the extension - ".txt". + newline (CRLF). It is &recommended; that the last line in a tag + file also ends with LF, CR, or CRLF. + Text tag file names &must; end in the extension ".txt". In all text tag files except for the bag declaration file, text &must; be @@ -1158,7 +1159,7 @@ definitions use the core rules (e.g. DIGIT, HEXDIG, etc) as defined in bagit.txt ABNF: From 7f021d4d080f12f68f4ef4bd2997319094af3d3e Mon Sep 17 00:00:00 2001 From: John Scancella Date: Fri, 11 May 2018 12:36:20 -0400 Subject: [PATCH 112/144] refs #23 - fixing merge conflict on using registry for hash names --- bagit.xml | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 62b2af40..bfc87dc0 100644 --- a/bagit.xml +++ b/bagit.xml @@ -16,6 +16,7 @@ + @@ -765,6 +766,12 @@ and tag files in a bag produced by the checksum algorithms. Checksum values &must; be encoded so as to conform to the manifest format specified in . However, the internal details of a checksum are outside the scope of this document. + + + To avoid future ambiguity, the checksum algorithm &should; be registered + in IANA's "Named Information Hash Algorithm Registry" + according to , but &may; for backwards compatibility also be the + MD5 or SHA-1 . The name of the checksum algorithm &must; be normalized for use in the @@ -774,8 +781,8 @@ mapping common algorithm names to normalized names: MD5: md5 SHA-1: sha1 - SHA-256: sha256 - SHA-512: sha512 + sha-256: sha256 + sha-512: sha512 @@ -783,7 +790,7 @@ mapping common algorithm names to normalized names: SHA-256 and SHA-512 algorithms and &should; enable SHA-512 by default when creating new bags. - For backwards-compatibility implementers &should; support + For backwards compatibility implementers &should; support MD5 and SHA-1 . Implementers are encouraged to simplify the process of adding additional @@ -1256,6 +1263,15 @@ This draft does not request any action from IANA. &RFC1321; &RFC2119; &RFC2234; + &RFC6920; + + + + Named Information Hash Algorithm Registry + IANA + + + &RFC2978; &RFC3174; &RFC3629; From 96a7809854e04e4283cafb9b042d80131f238662 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Fri, 11 May 2018 12:39:23 -0400 Subject: [PATCH 113/144] refs #19 - tag manifest should be complete and have the same list of tag files --- bagit.xml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index bfc87dc0..daae4799 100644 --- a/bagit.xml +++ b/bagit.xml @@ -490,10 +490,12 @@ placeholder file with a name such as ".keep". checksum algorithm. - A bag &may; contain one or more tag manifests. + A bag &may; contain one or more tag manifests, in which case each tag manifest &should; list the same set of tag files. Each tag manifest &must; list every payload manifest. + Each tag manifest &must-not; list any tag manifests, + but &should; list the remaining tag files present in the bag. A tag manifest file &must; have a name of the form From ffaa6262270f6b6198224e4388e92fd146772fd6 Mon Sep 17 00:00:00 2001 From: Justin Littman Date: Mon, 14 May 2018 11:00:36 -0400 Subject: [PATCH 114/144] Tweaks. --- bagit.xml | 66 +++++++++++++++++++++++++++---------------------------- 1 file changed, 32 insertions(+), 34 deletions(-) diff --git a/bagit.xml b/bagit.xml index daae4799..a830a9ab 100644 --- a/bagit.xml +++ b/bagit.xml @@ -220,7 +220,7 @@ A bag consists of a directory containing the payload files and other accompanyin metadata files known as "tag" files. The "tags" are metadata files intended to facilitate and document the storage and transfer of the bag. Processing a bag does not require any understanding of the payload file contents and the payload -can be accessed without processing the BagIt metadata. +files can be accessed without processing the BagIt metadata. The name, BagIt, is inspired by the "enclose and deposit" method @@ -296,10 +296,10 @@ document are to be interpreted as described in . A file which contains metadata about the bag or its payload. This specification defines the standard - BagIt tag files: + BagIt tag files: the bag declaration in "bagit.txt" , payload manifests , - tag manifests , + tag manifests , bag metadata in "bag-info.txt" , and remote payload in "fetch.txt" . @@ -379,9 +379,9 @@ Tag-File-Character-Encoding: ENCODING M.N identifies the BagIt major (M) and minor (N) version numbers. ENCODING identifies the character set encoding used by the remaining tag files. - ENCODING &should; - be UTF-8 but - for backwards compatibility it &may; be any + ENCODING &should; + be UTF-8 but + for backwards compatibility it &may; be any other encoding registered in . The bag declaration itself &must; be encoded in UTF-8, and &must-not; contain a @@ -534,14 +534,14 @@ As a result, no filepath listed in a tag manifest be A metadata element &must; consist of a label, a colon ":", a single - linear whitespace character (space or tab), and a value, terminated with a line feed (CR), carriage return (LF) or + linear whitespace character (space or tab), and a value, terminated with a line feed (CR), carriage return (LF) or carriage return plus line feed (CRLF). - The label &must-not; contain colon (:), line feeds (LF) or carriage returns (CR). + The label &must-not; contain colons (:), line feeds (LF) or carriage returns (CR). The label &may; contain linear whitespace characters, but &must-not; start or - end with whitespace. - + end with whitespace. + It is &recommended; that lines not exceed 79 characters in length. Long values &may; be continued onto the next line by inserting a line feed (LF), a carriage @@ -550,15 +550,15 @@ As a result, no filepath listed in a tag manifest be Except for linebreaks such padding does not form part of the value. - Implementations wishing to support previous BagIt versions - &must; accept multiple linear whitespace before and after the - colon when the bag version is earlier than 1.0; such whitespace + Implementations wishing to support previous BagIt versions + &must; accept multiple linear whitespace before and after the + colon when the bag version is earlier than 1.0; such whitespace does not form part of the label or value. The following are reserved metadata elements. The use of these reserved metadata elements are &optional; but encouraged. Reserved metadata - element names are case-insensitive. Except where indicated otherwise, + element names are case-insensitive. Except where indicated otherwise, these metadata element names &may; be repeated to capture multiple values. @@ -613,9 +613,9 @@ As a result, no filepath listed in a tag manifest be A sender-supplied identifier for the set, if any, of bags to which it logically belongs. - This identifier must be unique across the sender's content, and if + This identifier &should; be unique across the sender's content, and if recognizable as belonging to a globally unique scheme, the receiver - should make an effort to honor reference to it. + &should; make an effort to honor reference to it. This metadata element &should-not; be repeated. @@ -632,8 +632,7 @@ As a result, no filepath listed in a tag manifest be and/or bag. - A sender-local prose description of the contents of the - bag. + A sender-local explanation of the contents and provenance. @@ -684,7 +683,7 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created url length filepath where url identifies the file to be - fetched and must be an absolute URI as defined in + fetched and &must; be an absolute URI as defined in , length is the number of octets in the file (or "-", to leave it unspecified), and filepath identifies the @@ -772,7 +771,7 @@ of a checksum are outside the scope of this document. To avoid future ambiguity, the checksum algorithm &should; be registered in IANA's "Named Information Hash Algorithm Registry" - according to , but &may; for backwards compatibility also be the + according to , but &may; for backwards compatibility also be MD5 or SHA-1 . @@ -814,7 +813,7 @@ requirements: Every file listed in every tag manifest &must; be present. Every file listed in every payload manifest &must; be present. For BagIt 1.0, every payload file &must; be listed in every payload manifest. - Note that older versions of this specification allowed payload files to be + Note that older versions of this specification allowed payload files to be listed in just one of the manifests. Every element present &must; comply with this specification. @@ -840,8 +839,7 @@ A valid bag &must; meet the following requirements: This is the layout of a basic bag containing an image and a companion OCR file. Lines of file content are shown with added parentheses to indicate each complete line. - For brevity this example uses the algorithm "md5" - rather than the recommended "sha512". + For brevity this example uses MD5 rather than the recommended SHA-512.
@@ -873,8 +871,7 @@ myfirstbag/ files listed in the payload manifests prior to validation. Lines of file content are shown with added parentheses to indicate each complete line. - For brevity this example uses the algorithm "md5" - rather than the recommended "sha512". + For brevity this example uses MD5 rather than the recommended SHA-512.
@@ -926,8 +923,9 @@ highsmith-tahoe/ namespace sequences (e.g. "\\?\C:\…") described in . - To assist implementers, the Library - of Congress conformance suite has some tests for invalid bags + To assist implementers, the Library + of Congress conformance suite [] + has some tests for invalid bags which are expected to fail on POSIX or Windows clients.
@@ -966,7 +964,7 @@ highsmith-tahoe/ - The Library of Congress conformance suite + The Library of Congress conformance suite [] is provided as a public resource to test new implementations for compatibility and error handling. @@ -1159,8 +1157,8 @@ update the bag with valid manifests. The Augmented Backus-Naur form (ABNF) provided below are non-normative. If there is a discrepancy between requirements in the normative sections and -the ABNF, the requirements in the normative sections prevail. Some -definitions use the core rules (e.g. DIGIT, HEXDIG, etc) as defined in +the ABNF, the requirements in the normative sections prevail. Some +definitions use the core rules (e.g. DIGIT, HEXDIG, etc) as defined in
@@ -1182,13 +1180,13 @@ ending = CR / LF / CRLF payload-manifest = 1*payload-manifest-line payload-manifest-line = checksum 1*WSP filepath ending checksum = 1*case-hexdig -case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / +case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" / "d" / "E"/ "e"/ "F" / "f" filepath = "data/" 1*( unreserved / pct-encoded / sub-delims ) unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" sub-delims = "!" / "$" / "&" / DQUOTE / "'" / "(" / ")" / - "*" / "+" / "," / ";" / "=" / "/" + "*" / "+" / "," / ";" / "=" / "/" pct-encoded = "%0D" / "%0d" / "%0A" / "%0a" / "%25" ending = CR / LF / CRLF ]]> @@ -1204,8 +1202,8 @@ metadata-line = key ":" WSP value ending *(continuation ending) key = 1*non-reserved value = 1*non-reserved continuation = WSP 1*non-reserved -non-reserved = VCHAR / WSP - ; any valid character for the specific encoding +non-reserved = VCHAR / WSP + ; any valid character for the specific encoding ; except those that match "ending" ending = CR / LF / CRLF ]]> From 8cc9b11e33e0d193e55ed311c2f47f4046520edf Mon Sep 17 00:00:00 2001 From: Justin Littman Date: Thu, 17 May 2018 14:44:19 -0400 Subject: [PATCH 115/144] Fixed keywords in non-normative section. --- bagit.xml | 48 ++++++++++++++++++++++++------------------------ 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/bagit.xml b/bagit.xml index daae4799..4c2e939d 100644 --- a/bagit.xml +++ b/bagit.xml @@ -296,10 +296,10 @@ document are to be interpreted as described in . A file which contains metadata about the bag or its payload. This specification defines the standard - BagIt tag files: + BagIt tag files: the bag declaration in "bagit.txt" , payload manifests , - tag manifests , + tag manifests , bag metadata in "bag-info.txt" , and remote payload in "fetch.txt" . @@ -379,9 +379,9 @@ Tag-File-Character-Encoding: ENCODING M.N identifies the BagIt major (M) and minor (N) version numbers. ENCODING identifies the character set encoding used by the remaining tag files. - ENCODING &should; - be UTF-8 but - for backwards compatibility it &may; be any + ENCODING &should; + be UTF-8 but + for backwards compatibility it &may; be any other encoding registered in . The bag declaration itself &must; be encoded in UTF-8, and &must-not; contain a @@ -534,14 +534,14 @@ As a result, no filepath listed in a tag manifest be A metadata element &must; consist of a label, a colon ":", a single - linear whitespace character (space or tab), and a value, terminated with a line feed (CR), carriage return (LF) or + linear whitespace character (space or tab), and a value, terminated with a line feed (CR), carriage return (LF) or carriage return plus line feed (CRLF). - The label &must-not; contain colon (:), line feeds (LF) or carriage returns (CR). + The label &must-not; contain colon (:), line feeds (LF) or carriage returns (CR). The label &may; contain linear whitespace characters, but &must-not; start or - end with whitespace. - + end with whitespace. + It is &recommended; that lines not exceed 79 characters in length. Long values &may; be continued onto the next line by inserting a line feed (LF), a carriage @@ -550,15 +550,15 @@ As a result, no filepath listed in a tag manifest be Except for linebreaks such padding does not form part of the value. - Implementations wishing to support previous BagIt versions - &must; accept multiple linear whitespace before and after the - colon when the bag version is earlier than 1.0; such whitespace + Implementations wishing to support previous BagIt versions + &must; accept multiple linear whitespace before and after the + colon when the bag version is earlier than 1.0; such whitespace does not form part of the label or value. The following are reserved metadata elements. The use of these reserved metadata elements are &optional; but encouraged. Reserved metadata - element names are case-insensitive. Except where indicated otherwise, + element names are case-insensitive. Except where indicated otherwise, these metadata element names &may; be repeated to capture multiple values. @@ -814,7 +814,7 @@ requirements: Every file listed in every tag manifest &must; be present. Every file listed in every payload manifest &must; be present. For BagIt 1.0, every payload file &must; be listed in every payload manifest. - Note that older versions of this specification allowed payload files to be + Note that older versions of this specification allowed payload files to be listed in just one of the manifests. Every element present &must; comply with this specification. @@ -1088,9 +1088,9 @@ z 7a LATIN SMALL LETTER Z As specified above, only the Unix-based path separator ('/') may be used inside filenames listed in BagIt manifest and fetch.txt files. - When bags are exchanged between Windows and Unix platforms, care - should be taken to translate the path separator as needed. Receivers - of bags on physical media should be prepared for filesystems created + When bags are exchanged between Windows and Unix platforms, + the path separator &should; be translated as needed. Receivers + of bags on physical media &should; be prepared for filesystems created under either Windows or Unix. Besides the fundamental difference between path separators ('\' and '/'), generally, Windows filesystems have more limitations than Unix filesystems. @@ -1147,7 +1147,7 @@ the checksum and, under Windows, the backslashes ins Implementers &may; wish to accept this format by ignoring a leading asterisk or handling differences in line termination gracefully but, if so, implementations &must; warn the user that the bag in question will fail strict validation. In -such cases it is strongly encouraged that tools provide an easy option to +such cases it is &recommended; that tools provide an easy option to update the bag with valid manifests.
@@ -1159,8 +1159,8 @@ update the bag with valid manifests. The Augmented Backus-Naur form (ABNF) provided below are non-normative. If there is a discrepancy between requirements in the normative sections and -the ABNF, the requirements in the normative sections prevail. Some -definitions use the core rules (e.g. DIGIT, HEXDIG, etc) as defined in +the ABNF, the requirements in the normative sections prevail. Some +definitions use the core rules (e.g. DIGIT, HEXDIG, etc) as defined in
@@ -1182,13 +1182,13 @@ ending = CR / LF / CRLF payload-manifest = 1*payload-manifest-line payload-manifest-line = checksum 1*WSP filepath ending checksum = 1*case-hexdig -case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / +case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" / "D" / "d" / "E"/ "e"/ "F" / "f" filepath = "data/" 1*( unreserved / pct-encoded / sub-delims ) unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" sub-delims = "!" / "$" / "&" / DQUOTE / "'" / "(" / ")" / - "*" / "+" / "," / ";" / "=" / "/" + "*" / "+" / "," / ";" / "=" / "/" pct-encoded = "%0D" / "%0d" / "%0A" / "%0a" / "%25" ending = CR / LF / CRLF ]]> @@ -1204,8 +1204,8 @@ metadata-line = key ":" WSP value ending *(continuation ending) key = 1*non-reserved value = 1*non-reserved continuation = WSP 1*non-reserved -non-reserved = VCHAR / WSP - ; any valid character for the specific encoding +non-reserved = VCHAR / WSP + ; any valid character for the specific encoding ; except those that match "ending" ending = CR / LF / CRLF ]]> From 9606be25e8ffd026418c4d5ab2e2e3df1b57a1b3 Mon Sep 17 00:00:00 2001 From: John Scancella Date: Tue, 22 May 2018 13:48:36 -0400 Subject: [PATCH 116/144] updated date --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index d0734d8e..846c3dcb 100644 --- a/bagit.xml +++ b/bagit.xml @@ -199,7 +199,7 @@ brian@ardvaark.net - + This document specifies BagIt, a set of hierarchical file layout conventions for From 3b69c59af7a3dae451063f3cf769f5657bcc202b Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Tue, 22 May 2018 11:54:10 -0700 Subject: [PATCH 117/144] increment draft number --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 846c3dcb..accf5551 100644 --- a/bagit.xml +++ b/bagit.xml @@ -35,7 +35,7 @@ - + The BagIt File Packaging Format (V¤t-bagit-version;) From 59ebeadb76f82926107366f7cca194f82ddb6790 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" <jak@ucop.edu> Date: Tue, 22 May 2018 11:56:20 -0700 Subject: [PATCH 118/144] tweak to author order --- bagit.xml | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/bagit.xml b/bagit.xml index accf5551..6bfa0332 100644 --- a/bagit.xml +++ b/bagit.xml @@ -40,18 +40,6 @@ <title abbrev="BagIt"> The BagIt File Packaging Format (V¤t-bagit-version;) - -
- - 1438 Kingfisher Way - Sunnyvale - CA - 94087 - USA - - andy@boyko.net -
-
California Digital Library @@ -82,6 +70,18 @@ justinlittman@gwu.edu + +
+ + 1438 Kingfisher Way + Sunnyvale + CA + 94087 + USA + + andy@boyko.net +
+
University of Maryland From 414de088d0076261a08a1125605181bdd42387aa Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Tue, 22 May 2018 13:40:28 -0700 Subject: [PATCH 119/144] grouped unaffiliated authors together --- bagit.xml | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/bagit.xml b/bagit.xml index 6bfa0332..cc84579f 100644 --- a/bagit.xml +++ b/bagit.xml @@ -70,18 +70,6 @@ justinlittman@gwu.edu - -
- - 1438 Kingfisher Way - Sunnyvale - CA - 94087 - USA - - andy@boyko.net -
-
University of Maryland @@ -187,6 +175,18 @@ cadams@loc.gov + +
+ + 1438 Kingfisher Way + Sunnyvale + CA + 94087 + USA + + andy@boyko.net +
+
From 87a19dfccc728e3d2b4b1dea6e7605b83fca304e Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Tue, 22 May 2018 14:43:06 -0700 Subject: [PATCH 120/144] added '-' affiliation to fix formatting glitch An empty affiliation outputs a blank line after all but the last unaffiliated author in a group. --- bagit.xml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/bagit.xml b/bagit.xml index cc84579f..539ab0d6 100644 --- a/bagit.xml +++ b/bagit.xml @@ -176,6 +176,7 @@
+ -
1438 Kingfisher Way @@ -188,6 +189,7 @@
+ -
1354 Quincy St. NW From ea112c94498fcb775e4ad6cfbdca07467be97d98 Mon Sep 17 00:00:00 2001 From: Justin Littman Date: Wed, 23 May 2018 11:37:25 -0400 Subject: [PATCH 121/144] refs #31. Makes changes requested from ISE review. --- bagit.xml | 195 ++++++++++++++++++------------------------------------ 1 file changed, 63 insertions(+), 132 deletions(-) diff --git a/bagit.xml b/bagit.xml index 539ab0d6..fa2cd99d 100644 --- a/bagit.xml +++ b/bagit.xml @@ -7,7 +7,7 @@ - + @@ -17,6 +17,7 @@ + @@ -25,6 +26,7 @@ + @@ -70,22 +72,7 @@ justinlittman@gwu.edu
- - - University of Maryland - -
- - 4130 Campus Drive - College Park - MD - 20742 - USA - - ehs@pobox.com -
-
- + Library of Congress @@ -115,51 +102,6 @@ jsca@loc.gov - - - Library of Congress - -
- - 101 Independence Avenue SE - Washington - DC - 20540 - USA - - rstorey@loc.gov -
-
- - - Library of Congress - -
- - 101 Independence Avenue SE - Washington - DC - 20540 - USA - - dbrun@loc.gov -
-
- - - Library of Congress - -
- - 101 Independence Avenue SE - Washington - DC - 20540 - USA - - kzwa@loc.gov -
-
Library of Congress @@ -175,40 +117,14 @@ cadams@loc.gov - - - -
- - 1438 Kingfisher Way - Sunnyvale - CA - 94087 - USA - - andy@boyko.net -
-
- - - -
- - 1354 Quincy St. NW - Washington - DC - 20011 - USA - - brian@ardvaark.net -
-
-This document specifies BagIt, a set of hierarchical file layout conventions for +This document describes BagIt, a set of hierarchical file layout conventions for storage and transfer of arbitrary digital content. A "bag" has just enough structure to enclose descriptive metadata "tags" and a file "payload" but does not require knowledge of the payload's internal semantics. This -BagIt format should be suitable for reliable storage and transfer. +BagIt format is suitable for reliable storage and transfer.
@@ -257,8 +173,10 @@ Ghent University, New York University, and the University of California.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", -"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this -document are to be interpreted as described in . +"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" +in this document are to be interpreted as described in BCP 14 + when, and only when, they +appear in all capitals as shown here. Implementers are strongly encouraged to review the interoperability @@ -268,16 +186,16 @@ document are to be interpreted as described in .
- The following terms have precise definitions as used in this specification: + The following terms have precise definitions as used in this document: A set of opaque files contained within the structure - defined by this specification. + defined by this document. - The file required to be in all bags conforming to this specification. + The file required to be in all bags conforming to this document. Contains values necessary to process the rest of a bag. See . @@ -285,10 +203,15 @@ document are to be interpreted as described in . The name of a cryptographic checksum algorithm which has been normalized for use in a manifest or tag manifest file name (e.g. "sha512") as described in . + + + A tag file thats maps filepaths to checksums. A manifest can be a payload + manifest or a tag manifest . - The data encapsulated by the bag as a set of named files, which may be organized in sub-directories. The contents of the payload files - are opaque to this specification, and, with respect to BagIt processing, + The data encapsulated by the bag as a set of named files, which may be + organized in sub-directories. The contents of the payload files + are opaque to this document, and, with respect to BagIt processing, are always considered as sequences of uninterpreted octets. See . @@ -297,7 +220,7 @@ document are to be interpreted as described in . A file which contains metadata about the bag or its payload. - This specification defines the standard + This document defines the standard BagIt tag files: the bag declaration in "bagit.txt" , payload manifests , @@ -305,11 +228,11 @@ document are to be interpreted as described in . bag metadata in "bag-info.txt" , and remote payload in "fetch.txt" . - This specification also allows other arbitrary tag files as described in + This document also allows other arbitrary tag files as described in . - A bag which contains every element required by this specification, + A bag which contains every element required by this document, every payload file listed in a manifest, and any optional files which are listed in a tag manifest. See . @@ -343,7 +266,7 @@ a file named "bagit.txt" (see ), and zero or more additional tag files (see ). The tag files and directories are in arbitrary file hierarchies and &may; have -any name that is not reserved for a file or directory in this specification. +any name that is not reserved for a file or directory in this document. The base directory can have any name. @@ -391,7 +314,7 @@ Tag-File-Character-Encoding: ENCODING - The number for this version of the specification is "¤t-bagit-version;". + The number for this version of BagIt is "¤t-bagit-version;".
@@ -405,7 +328,7 @@ Tag-File-Character-Encoding: ENCODING Each payload file is treated as an opaque octet stream when verifying file correctness. Payload files &may; be organized in arbitrary sub-directory structures - within the payload directory, however for the purpose of this specification + within the payload directory, however for the purpose of this document such sub-directory structures and filenames have no given meaning.
@@ -725,7 +648,7 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created
A bag &may; contain other tag files that are not defined by this - specification. + document. Implementations &must; perform standard checksum validation on any tag file which is listed in a tag manifest but &must; otherwise ignore their contents. @@ -736,7 +659,7 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created
- All tag files specifically described in this specification &must; adhere to + All tag files specifically described in this document &must; adhere to the text tag file format described below. Other tag files &may; adhere to the text tag file format described below. @@ -815,10 +738,10 @@ requirements: Every file listed in every tag manifest &must; be present. Every file listed in every payload manifest &must; be present. For BagIt 1.0, every payload file &must; be listed in every payload manifest. - Note that older versions of this specification allowed payload files to be + Note that older versions of BagIt allowed payload files to be listed in just one of the manifests. - Every element present &must; comply with this specification. + Every element present &must; conform to BagIt ¤t-bagit-version;. @@ -926,7 +849,7 @@ highsmith-tahoe/ To assist implementers, the Library - of Congress conformance suite [] + of Congress conformance suite has some tests for invalid bags which are expected to fail on POSIX or Windows clients. @@ -966,7 +889,7 @@ highsmith-tahoe/ - The Library of Congress conformance suite [] + The Library of Congress conformance suite is provided as a public resource to test new implementations for compatibility and error handling. @@ -1161,7 +1084,7 @@ The Augmented Backus-Naur form (ABNF) provided below are non-normative. If there is a discrepancy between requirements in the normative sections and the ABNF, the requirements in the normative sections prevail. Some definitions use the core rules (e.g. DIGIT, HEXDIG, etc) as defined in - +
@@ -1229,42 +1152,30 @@ ending = CR / LF / CRLF
+
+ +Additional contributors to the authoring of BagIt are David Brunton, Rosie Storey, +Ed Summers, Andy Boyko, Brian Vargas, and Kate Zwaard. + +
-BagIt owes much to many thoughtful contributors and reviewers, including +BagIt benefitted from the thoughful assistance of Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Dave Crocker, Brad Hards, Scott Fisher, Keith Johnson, Erik Hetzner, Leslie Johnston, David Loy, Mark Phillips, Tracy Seneca, Brian Tingle, Adam Turoff, Jim Tuttle, and Stian Soiland-Reyes. +
This draft does not request any action from IANA.
-
- - - - A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method - - - - - - - - Naming a File - Microsoft, Inc. - - - - - + &RFC1321; &RFC2119; - &RFC2234; &RFC6920; @@ -1279,7 +1190,27 @@ This draft does not request any action from IANA. &RFC3629; &RFC3986; &RFC6234; + &RFC8174; + + + + + A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method + + + + + + + + Naming a File + Microsoft, Inc. + + + + + &RFC4234; Unicode® Standard Annex #15: Unicode Normalization Forms @@ -1304,6 +1235,6 @@ This draft does not request any action from IANA. - + From 36ec049b144543e7da410b6e70cb496742cb9ed9 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Thu, 24 May 2018 10:24:08 -0700 Subject: [PATCH 122/144] alphabetize contributors; fix typo --- bagit.xml | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/bagit.xml b/bagit.xml index fa2cd99d..5111ccec 100644 --- a/bagit.xml +++ b/bagit.xml @@ -1154,16 +1154,16 @@ ending = CR / LF / CRLF
-Additional contributors to the authoring of BagIt are David Brunton, Rosie Storey, -Ed Summers, Andy Boyko, Brian Vargas, and Kate Zwaard. +Additional contributors to the authoring of BagIt are Andy Boyko, David Brunton, Rosie Storey, +Ed Summers, Brian Vargas, and Kate Zwaard.
-BagIt benefitted from the thoughful assistance of -Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Dave Crocker, Brad Hards, Scott Fisher, Keith -Johnson, Erik Hetzner, Leslie Johnston, David Loy, Mark Phillips, Tracy Seneca, -Brian Tingle, Adam Turoff, Jim Tuttle, and Stian Soiland-Reyes. +BagIt benefitted from the thoughtful assistance of +Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Dave Crocker, Scott Fisher, Brad Hards, Erik Hetzner, +Keith Johnson, Leslie Johnston, David Loy, Mark Phillips, Tracy Seneca, Stian Soiland-Reyes, +Brian Tingle, Adam Turoff, and Jim Tuttle.
From 26be6b30861898005d4b2c9e64f720e52180bb47 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Thu, 24 May 2018 10:52:13 -0700 Subject: [PATCH 123/144] add two non-normative sentences to address ISE review comment about error handling "Upon discovering errors in bags, an implementation is free to take action (for example, logging or reporting) in an application-specific manner. This document does not mandate any particular action." --- bagit.xml | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/bagit.xml b/bagit.xml index 5111ccec..5d3b423a 100644 --- a/bagit.xml +++ b/bagit.xml @@ -887,6 +887,12 @@ highsmith-tahoe/ users. None of the points below are required but they are recommended for general-purpose usage. + + + Upon discovering errors in bags, an implementation is free to take action + (for example, logging or reporting) in an application-specific manner. + This document does not mandate any particular action. + The Library of Congress conformance suite From 5988dd735f5c2fc0c82cde95b7738cccba00bcdc Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Thu, 24 May 2018 12:08:01 -0700 Subject: [PATCH 124/144] added the missing noun ("rules") that the term "ABNF" qualifies --- bagit.xml | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/bagit.xml b/bagit.xml index 5d3b423a..bf3958cd 100644 --- a/bagit.xml +++ b/bagit.xml @@ -1086,7 +1086,7 @@ update the bag with valid manifests.
-The Augmented Backus-Naur form (ABNF) provided below are non-normative. If +The Augmented Backus-Naur Form (ABNF) rules provided below are non-normative. If there is a discrepancy between requirements in the normative sections and the ABNF, the requirements in the normative sections prevail. Some definitions use the core rules (e.g. DIGIT, HEXDIG, etc) as defined in @@ -1094,7 +1094,7 @@ definitions use the core rules (e.g. DIGIT, HEXDIG, etc) as defined in
- bagit.txt ABNF: + bagit.txt ABNF rules:
- Payload Manifest ABNF: + Payload Manifest ABNF rules:
- bag-info.txt ABNF: + bag-info.txt ABNF rules:
- fetch.txt ABNF: + fetch.txt ABNF rules: Date: Thu, 24 May 2018 17:53:25 -0700 Subject: [PATCH 125/144] made artwork displays all end consistently some displays ended with an extra blank line --- bagit.xml | 42 ++++++++++++++---------------------------- 1 file changed, 14 insertions(+), 28 deletions(-) diff --git a/bagit.xml b/bagit.xml index bf3958cd..c607c518 100644 --- a/bagit.xml +++ b/bagit.xml @@ -287,8 +287,7 @@ The base directory can have any name. | +-- [optional tag directories]/ | - +-- [optional tag files] - + +-- [optional tag files]
@@ -298,8 +297,7 @@ The base directory can have any name.
BagIt-Version: M.N -Tag-File-Character-Encoding: ENCODING - +Tag-File-Character-Encoding: ENCODING M.N identifies the BagIt major (M) and minor (N) version numbers. ENCODING identifies the character set encoding used by the remaining tag files. @@ -364,8 +362,7 @@ Tag-File-Character-Encoding: ENCODING Example payload manifest filenames manifest-sha256.txt -manifest-sha512.txt - +manifest-sha512.txt
Each line of a payload manifest file &must; be of the form: @@ -436,8 +433,7 @@ placeholder file with a name such as ".keep". Example tag manifest filenames: tagmanifest-sha256.txt -tagmanifest-sha512.txt - +tagmanifest-sha512.txt
A tag manifest file has the same form as the payload file manifest @@ -582,8 +578,7 @@ Bag-Group-Identifier: university_foo Bag-Count: 1 of 15 Internal-Sender-Identifier: /storage/images/foo Internal-Sender-Description: Uncompressed greyscale TIFFs created - from microfilm and are... - + from microfilm and are...
@@ -786,8 +781,7 @@ myfirstbag/ | | 27613-h/images/q172.txt | (... OCR text ... ) - .... -
+ ....
@@ -816,8 +810,7 @@ highsmith-tahoe/ | | bag-info.txt | (Internal-Sender-Description: Download link found at ) -| ( https://www.loc.gov/resource/highsm.23364/ ) - +| ( https://www.loc.gov/resource/highsm.23364/ )
@@ -958,8 +951,7 @@ N 4e LATIN CAPITAL LETTER N ú c3ba LATIN SMALL LETTER U WITH ACUTE ñ c3b1 LATIN SMALL LETTER N WITH TILDE e 65 LATIN SMALL LETTER E -z 7a LATIN SMALL LETTER Z - ]]> +z 7a LATIN SMALL LETTER Z ]]> Unicode normalization is relevant to BagIt implementors because different @@ -1032,8 +1024,7 @@ z 7a LATIN SMALL LETTER Z - < > : " / | ? * - + < > : " / | ? *
@@ -1042,8 +1033,7 @@ z 7a LATIN SMALL LETTER Z CON, PRN, AUX, NUL COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9 - LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9 - + LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9
See for more information and possible alternatives. @@ -1099,8 +1089,7 @@ definitions use the core rules (e.g. DIGIT, HEXDIG, etc) as defined in bagit-txt = "BagIt-Version: " 1*DIGIT "." 1*DIGIT ending "Tag-File-Character-Encoding: " encoding ending encoding = 1*CHAR -ending = CR / LF / CRLF - ]]> +ending = CR / LF / CRLF ]]>
@@ -1119,8 +1108,7 @@ unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" sub-delims = "!" / "$" / "&" / DQUOTE / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" / "/" pct-encoded = "%0D" / "%0d" / "%0A" / "%0a" / "%25" -ending = CR / LF / CRLF -]]> +ending = CR / LF / CRLF ]]>
@@ -1136,8 +1124,7 @@ continuation = WSP 1*non-reserved non-reserved = VCHAR / WSP ; any valid character for the specific encoding ; except those that match "ending" -ending = CR / LF / CRLF -]]> +ending = CR / LF / CRLF ]]>
@@ -1151,8 +1138,7 @@ url = length = 1*DIGIT / "-" filepath = ("data/" 1*( unreserved / pct-encoded / sub-delims )) -ending = CR / LF / CRLF -]]> +ending = CR / LF / CRLF ]]> From f165f2d74049b8618a99b62213aa6ece22715245 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Fri, 25 May 2018 10:20:35 -0700 Subject: [PATCH 126/144] reduced redundant use of "optional" MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per reviewer comment: > Section 2.1.3), a file named "bagit.txt" (see Section 2.1.1), and > zero or more additional tag files (see Section 2.2). The tag files > in the optional tag directories are arbitrary file hierarchies and > the tag directories MAY have any name that is not reserved for a file > or directory in this specification. Above (2) seems to say that all tag directories are optional. Hence constantly including the word 'optional' for them, in the rest of the document, is distracting. > > The base directory MAY have any name. > > / > | bagit.txt > | manifest-.txt > | [optional additional tag files] > \--- data/ > | [payload files] > \--- [optional tag directories]/ > | [optional tag files] The square brackets are probably enough to indicate being optional. The word just makes things wordier. _The word “optional” has been removed as redundant, given the bracketing and that all tag directories have been described previously as optional._ --- bagit.xml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index c607c518..892b3598 100644 --- a/bagit.xml +++ b/bagit.xml @@ -279,15 +279,15 @@ The base directory can have any name. | +-- manifest-<algorithm>.txt | - +-- [optional additional tag files] + +-- [additional tag files] | +-- data/ | | | +-- [payload files] | - +-- [optional tag directories]/ + +-- [tag directories]/ | - +-- [optional tag files] + +-- [tag files]
From a5b1efcf08259d9e515a73c8cd4f399bf7b069ea Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Fri, 25 May 2018 10:34:08 -0700 Subject: [PATCH 127/144] tightened "file" reference to "file name" MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per reviewer comment: > A payload manifest is a tag file that lists payload files and probably: that lists payload file names and Clarified. Saying "lists" does imply names and not the file contents, but for some reason I think the modified form will be clearer. > checksums for those payload files generated using a particular bag I'm pretty sure it's not the payload files that are generated using a checksum algorithm... I assume it's a manifest payload file listing... _That sentence was stricken during recent editing rounds. A similar sentence has been reworded: “Every payload manifest MUST list every payload file name exactly once.”_ --- bagit.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index 892b3598..bf934aaf 100644 --- a/bagit.xml +++ b/bagit.xml @@ -333,7 +333,7 @@ Tag-File-Character-Encoding: ENCODING
- A payload manifest file provides a complete listing of each payload file along + A payload manifest file provides a complete listing of each payload file name along with a corresponding checksum to permit data integrity checking. Manifest entries &must; satisfy the following constraints: @@ -345,7 +345,7 @@ Tag-File-Character-Encoding: ENCODING more than one. - Every payload manifest &must; list every payload file exactly once. + Every payload manifest &must; list every payload file name exactly once. A payload manifest file &must; have a name of the form From d8ad39fbd18b344cdd967a775be20404e86f7ffb Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Fri, 25 May 2018 10:40:02 -0700 Subject: [PATCH 128/144] added an extra summary section sentence Per reviewer comment: > checksum algorithm. Every bag MUST contain one payload manifest > file, and MAY contain more than one. A payload manifest file MUST I think this is unusual enough to warrant, again, an initial, summary statement. If I'm understanding, it should be something like: A bag can have more than one data integrity manifest, with each using a different validation algorithm. _This sentence has been added: A bag can have more than one payload manifest, with each using a different validation algorithm._ --- bagit.xml | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index bf934aaf..0789e345 100644 --- a/bagit.xml +++ b/bagit.xml @@ -334,8 +334,9 @@ Tag-File-Character-Encoding: ENCODING
A payload manifest file provides a complete listing of each payload file name along - with a corresponding checksum to permit data integrity checking. Manifest entries - &must; satisfy the following constraints: + with a corresponding checksum to permit data integrity checking. A bag can have more + than one payload manifest, with each using a different validation algorithm. + Manifest entries &must; satisfy the following constraints: From 04e7d849177f117ecf7863f84fb02420a09260c4 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Fri, 25 May 2018 10:47:23 -0700 Subject: [PATCH 129/144] clarified metadata Per reviewer comment: > Source-Organization Organization transferring the content. ... > Organization-Address Mailing address of the organization. organization -> source organization > Contact-Name Person at the source organization who is responsible > for the content transfer. > > Contact-Phone International format telephone number of person or > position responsible. > > Contact-Email Fully qualified email address of person or position > responsible. > ... > External-Description A brief explanation of the contents and > provenance. ... > Bagging-Date Date (YYYY-MM-DD) that the content was prepared for > delivery. I think you mean 'transfer' rather than 'delivery'... --- bagit.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index 0789e345..bcac4767 100644 --- a/bagit.xml +++ b/bagit.xml @@ -490,7 +490,7 @@ As a result, no filepath listed in a tag manifest be Organization transferring the content. - Mailing address of the organization. + Mailing address of the source organization. Person at the source organization who is responsible for the content @@ -506,7 +506,7 @@ As a result, no filepath listed in a tag manifest be A brief explanation of the contents and provenance. - Date (YYYY-MM-DD) that the content was prepared for delivery. + Date (YYYY-MM-DD) that the content was prepared for transfer. This metadata element &should-not; be repeated. From e4f0c7d7d00ef6acfb59fc7e135afdbe66e80703 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Fri, 25 May 2018 10:53:32 -0700 Subject: [PATCH 130/144] moved summary paragraph from end of section to beginning of section Per reviewer comment: > The "fetch.txt" file allows a bag to be transmitted with "holes" in > it, which can be practical for several reasons. For example, it > obviates the need for the sender to stage a large serialized copy of > the content while the bag is transferred to the receiver. Also, this > method allows a sender to construct a bag from components that are > either a subset of logically related components (e.g., the localized > logical object could be much larger than what is intended for export) > or assembled from logically distributed sources (e.g., the object > components for export are not stored locally under one filesystem > tree). This paragraph would be a better introduction to the section. _Done._ --- bagit.xml | 25 +++++++++++-------------- 1 file changed, 11 insertions(+), 14 deletions(-) diff --git a/bagit.xml b/bagit.xml index bcac4767..756fce3b 100644 --- a/bagit.xml +++ b/bagit.xml @@ -588,8 +588,17 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created For reasons of efficiency, a bag &may; be sent with a list of files to be fetched and added to the payload before it can meaningfully be checked - for completeness. An &optional; tag file called the fetch file - contains such a list. + for completeness. + The fetch file allows a bag to be transmitted with + "holes" in it, which can be practical for several reasons. For example, + it obviates the need for the sender to stage a large serialized copy of + the content while the bag is transferred to the receiver. Also, this + method allows a sender to construct a bag from components that are either + a subset of logically related components (e.g., the localized logical + object could be much larger than what is intended for export) or + assembled from logically distributed sources (e.g., the object components + for export are not stored locally under one filesystem tree). + An &optional; tag file called the fetch file contains such a list. @@ -627,18 +636,6 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created limitation on the length of any of the fields in the fetch file. - - The fetch file allows a bag to be transmitted with - "holes" in it, which can be practical for several reasons. For example, - it obviates the need for the sender to stage a large serialized copy of - the content while the bag is transferred to the receiver. Also, this - method allows a sender to construct a bag from components that are either - a subset of logically related components (e.g., the localized logical - object could be much larger than what is intended for export) or - assembled from logically distributed sources (e.g., the object components - for export are not stored locally under one filesystem tree). - -
From 2613492779443892b7364f9328449ede92643d35 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Fri, 25 May 2018 11:01:55 -0700 Subject: [PATCH 131/144] clarified security implications of older checksum algorithms Per reviewer comment: > Implementors of tools that complete bags by retrieving URLs listed in > a "fetch.txt" file need to be aware that some of those URLs may point > to hosts, intentionally or unintentionally, that are not under > control of the bag's sender. Checksums are intended as a reasonable > guarantee against corruption during transit, not a strong > cryptographic protection against intentional spoofing. Oh? _This wording was meant to apply to checksums as they are used in bags, as well as to address criticism that many legacy bags used easily broken MD5 checksums. That last sentence has now been reworded to: Moreover, older checksum algorithms, even if reasonable for detecting corruption during transit, may not offer strong cryptographic protection against intentional spoofing._ --- bagit.xml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 756fce3b..cd8f2150 100644 --- a/bagit.xml +++ b/bagit.xml @@ -850,9 +850,9 @@ highsmith-tahoe/ Implementers of tools that complete bags by retrieving URLs listed in a fetch file need to be aware that some of those URLs might point to hosts, intentionally or unintentionally, that are not under control - of the bag's sender. Checksums are intended as a reasonable guarantee - against corruption during transit, not a strong cryptographic - protection against intentional spoofing. + of the bag's sender. Moreover, older checksum algorithms, even if + reasonable for detecting corruption during transit, may not offer strong + cryptographic protection against intentional spoofing.
From d8692c97a922d0dcd3243724cbf100ec876b56a7 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Fri, 25 May 2018 11:04:17 -0700 Subject: [PATCH 132/144] usage tweak Per reviewer comment: > In all text tag files except for the bag declaration file, text MUST > be encoded in the character encoding specified in the "bagit.txt" bag be encoded in the character encoding -> use the character encoding _Done._ --- bagit.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index cd8f2150..824de89f 100644 --- a/bagit.xml +++ b/bagit.xml @@ -664,8 +664,8 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created Text tag file names &must; end in the extension ".txt".
-In all text tag files except for the bag declaration file, text &must; be -encoded in the character encoding specified in the "bagit.txt" bag declaration +In all text tag files except for the bag declaration file, text &must; use +the character encoding specified in the "bagit.txt" bag declaration file. Text tag files except for the bag declaration file &may; include a byte-order mark (BOM) only if the specified encoding requires it for proper decoding. In accordance with , when "bagit.txt" From 855c3f3e40b2041f42ab2e3540be06c50c854115 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Fri, 25 May 2018 11:09:23 -0700 Subject: [PATCH 133/144] beefed up response to security case Per reviewer comment: > The size of files, as optionally reported in the "fetch.txt" file, > cannot be guaranteed to match the actual file size to be downloaded. > Implementors SHOULD take care to appropriately handle cases where the > actual file size does not match the file size reported in the > fetch.txt. Implementors SHOULD NOT use the file size in the > "fetch.txt" file for critical resource allocation, such as buffer > sizing or storage requisitioning. Absent specification of what "appropriately handle" means, this guidance lacks substance. _Reworded the second sentence to be: Implementers SHOULD take steps to monitor and abort transfer when the received file size exceeds the file size reported in the fetch file._ --- bagit.xml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 824de89f..0c7a84be 100644 --- a/bagit.xml +++ b/bagit.xml @@ -860,9 +860,9 @@ highsmith-tahoe/ The size of files, as optionally reported in the fetch file, cannot be guaranteed to match the actual file size to be downloaded. - Implementers &should; take care to appropriately handle cases where - the actual file size does not match the file size reported in the - fetch file. Implementers &should-not; use the file size in the + Implementers &should; take steps to monitor and abort transfer when the + received file size exceeds the file size reported in the fetch file. + Implementers &should-not; use the file size in the fetch file for critical resource allocation, such as buffer sizing or storage requisitioning. From f08da8f3663d5d665582f4b37eef4187df3ac392 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Fri, 25 May 2018 11:19:00 -0700 Subject: [PATCH 134/144] changed a MUST to a standard &must; --- bagit.xml | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index 0c7a84be..888d4d17 100644 --- a/bagit.xml +++ b/bagit.xml @@ -18,7 +18,8 @@ - + + @@ -248,7 +249,7 @@ appear in all capitals as shown here.
- A bag MUST consist of a base directory containing: + A bag &must; consist of a base directory containing: From cbc7d77f859d80e5cecb1b5840dece81f03e4dc6 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Tue, 29 May 2018 13:26:00 -0700 Subject: [PATCH 135/144] more consistent terminology (validation -> checksum) --- bagit.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 888d4d17..86fde497 100644 --- a/bagit.xml +++ b/bagit.xml @@ -336,7 +336,7 @@ Tag-File-Character-Encoding: ENCODING A payload manifest file provides a complete listing of each payload file name along with a corresponding checksum to permit data integrity checking. A bag can have more - than one payload manifest, with each using a different validation algorithm. + than one payload manifest, with each using a different checksum algorithm. Manifest entries &must; satisfy the following constraints: From 5829d837a2e2f97acbabcfdb506413f8e2501b41 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Mon, 4 Jun 2018 14:54:53 -0700 Subject: [PATCH 136/144] bumped draft number and current date --- bagit.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index 86fde497..85d8dd4a 100644 --- a/bagit.xml +++ b/bagit.xml @@ -38,7 +38,7 @@ - + The BagIt File Packaging Format (V¤t-bagit-version;) @@ -118,7 +118,7 @@ <email>cadams@loc.gov</email> </address> </author> - <date day="22" month="May" year="2018"/> + <date day="4" month="June" year="2018"/> <abstract> <t> This document describes BagIt, a set of hierarchical file layout conventions for From 8d243f07f9b7620b81f56d51615fe9b29f852b83 Mon Sep 17 00:00:00 2001 From: Justin Littman <justinlittman@gmail.com> Date: Wed, 22 Aug 2018 21:38:23 -0700 Subject: [PATCH 137/144] Update Justin's contact info --- bagit.xml | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/bagit.xml b/bagit.xml index 85d8dd4a..428b2ebd 100644 --- a/bagit.xml +++ b/bagit.xml @@ -60,17 +60,17 @@ </author> <author initials="J." surname="Littman" fullname="Justin Littman"> <organization> - George Washington University Libraries + Stanford Libraries </organization> <address> <postal> - <street>2130 H St NW</street> - <city>Washington</city> - <region>DC</region> - <code>20052</code> + <street>518 Memorial Way</street> + <city>Stanford</city> + <region>CA</region> + <code>94305</code> <country>USA</country> </postal> - <email>justinlittman@gwu.edu</email> + <email>justinlittman@stanford.edu</email> </address> </author> <author initials="E." surname="Madden" fullname="Liz Madden"> From fd267b68e091d310d8a87dd3ffe7d67253e5dd1a Mon Sep 17 00:00:00 2001 From: John Scancella <john.scancella@gmail.com> Date: Tue, 28 Aug 2018 06:42:48 -0400 Subject: [PATCH 138/144] updated email address --- bagit.xml | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/bagit.xml b/bagit.xml index 428b2ebd..08199a7d 100644 --- a/bagit.xml +++ b/bagit.xml @@ -89,18 +89,8 @@ </address> </author> <author initials="J." surname="Scancella" fullname="John Scancella"> - <organization> - Library of Congress - </organization> <address> - <postal> - <street>101 Independence Avenue SE</street> - <city>Washington</city> - <region>DC</region> - <code>20540</code> - <country>USA</country> - </postal> - <email>jsca@loc.gov</email> + <email>john.scancella@gmail.com</email> </address> </author> <author initials="C." surname="Adams" fullname="Chris Adams"> From 3d0721758459a1ce800100a4e243efedcb198886 Mon Sep 17 00:00:00 2001 From: Justin Littman <justinlittman@gmail.com> Date: Wed, 29 Aug 2018 16:51:25 -0700 Subject: [PATCH 139/144] Added clarification about malicious attackers. --- bagit.xml | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/bagit.xml b/bagit.xml index 428b2ebd..efee0d54 100644 --- a/bagit.xml +++ b/bagit.xml @@ -151,7 +151,9 @@ in two general areas: Strong integrity assurances. The format supports cryptographic-quality hash algorithms (see <xref target="bag-checksum-algorithms"/>) and allows for in-place upgrades to add additional manifests using stronger algorithms - without breaking backwards compatibility. + without breaking backwards compatibility. This integrity assurance is with + respect to data corruption, not with respect to a malicious attack seeking + to replace payload file contents. </t><t> Direct file access. Because BagIt specifies an actual filesystem hierarchy rather than a serialized representation of one, files can be accessed @@ -869,6 +871,16 @@ highsmith-tahoe/ </t> </section> + <section title="Attacks on payload file content"> + <t> + The integrity assurance provided by manifests is with respect to data + corruption, not with respect to a malicious attack seeking to replace + payload file contents. + </t> + </section> + + + <!-- End Section: Special directory characters --> </section> <!-- End Section: Security considerations --> From 0082b4614fecc41e67445d2d8508e58cda7148fa Mon Sep 17 00:00:00 2001 From: Justin Littman <justinlittman@gmail.com> Date: Wed, 29 Aug 2018 17:02:21 -0700 Subject: [PATCH 140/144] Changed reference to character set registry. --- bagit.xml | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 428b2ebd..6c6295a6 100644 --- a/bagit.xml +++ b/bagit.xml @@ -9,7 +9,6 @@ <!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"> <!ENTITY RFC4234 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4234.xml"> <!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml"> -<!ENTITY RFC2978 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2978.xml"> <!ENTITY RFC3174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3174.xml"> <!ENTITY RFC3552 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3552.xml"> <!ENTITY RFC3629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3629.xml"> @@ -306,7 +305,7 @@ Tag-File-Character-Encoding: ENCODING </artwork> <spanx style="emph">ENCODING</spanx> &should; be <spanx style="verb">UTF-8</spanx> but for backwards compatibility it &may; be any - other encoding registered in <xref target="RFC2978"/>. + other encoding registered in <xref target="cs-registry"/>. The bag declaration itself &must; be encoded in UTF-8, and &must-not; contain a byte-order mark (BOM) <xref target="RFC3629"/>. @@ -1176,7 +1175,13 @@ This draft does not request any action from IANA. <date year="2016" month="9" day="14"/> </front> </reference> - &RFC2978; <!-- character sets --> + <reference anchor="cs-registry" target="https://www.iana.org/assignments/character-sets/character-sets.xhtml"> + <front> + <title>Character Set Registry + IANA + + + &RFC3174; &RFC3629; &RFC3986; From 27da533362c66095f993d969246d3be6b967982d Mon Sep 17 00:00:00 2001 From: Justin Littman Date: Thu, 13 Sep 2018 06:58:44 -0400 Subject: [PATCH 141/144] Updates of wording. --- bagit.xml | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/bagit.xml b/bagit.xml index efee0d54..56be629c 100644 --- a/bagit.xml +++ b/bagit.xml @@ -151,9 +151,9 @@ in two general areas: Strong integrity assurances. The format supports cryptographic-quality hash algorithms (see ) and allows for in-place upgrades to add additional manifests using stronger algorithms - without breaking backwards compatibility. This integrity assurance is with - respect to data corruption, not with respect to a malicious attack seeking - to replace payload file contents. + without breaking backwards compatibility. This provides high + levels of confidence against data corruption but is not designed + to be secure against active attacks. Direct file access. Because BagIt specifies an actual filesystem hierarchy rather than a serialized representation of one, files can be accessed @@ -873,9 +873,12 @@ highsmith-tahoe/
- The integrity assurance provided by manifests is with respect to data - corruption, not with respect to a malicious attack seeking to replace - payload file contents. + The integrity assurance provided by manifests is designed to provide + high levels of confidence against data corruption but is not designed + to be secure against active attacks. Organizations which need to + secure bags against such threats will need to agree upon additional + measures such as digital signatures which are out + of scope for this specification.
From d597d91d922471af2870577df5ee6e461bd429d5 Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Thu, 13 Sep 2018 08:52:09 -0700 Subject: [PATCH 142/144] Update bagit.xml --- bagit.xml | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/bagit.xml b/bagit.xml index 2354fb8c..296f1c94 100644 --- a/bagit.xml +++ b/bagit.xml @@ -864,9 +864,9 @@ highsmith-tahoe/ The integrity assurance provided by manifests is designed to provide high levels of confidence against data corruption but is not designed - to be secure against active attacks. Organizations which need to - secure bags against such threats will need to agree upon additional - measures such as digital signatures which are out + to be secure against active attacks. Organizations that need to + secure bags against such threats &should; agree on additional + measures, such as digital signatures, that are out of scope for this specification.
From df1db486d80fc41fbc22482e944a33c3e59ecd4d Mon Sep 17 00:00:00 2001 From: "John A. Kunze" Date: Mon, 17 Sep 2018 11:17:24 -0700 Subject: [PATCH 143/144] update draft number and draft date --- bagit.xml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/bagit.xml b/bagit.xml index 296f1c94..0c6476f8 100644 --- a/bagit.xml +++ b/bagit.xml @@ -37,7 +37,7 @@ - + The BagIt File Packaging Format (V¤t-bagit-version;) @@ -107,7 +107,7 @@ <email>cadams@loc.gov</email> </address> </author> - <date day="4" month="June" year="2018"/> + <date day="17" month="September" year="2018"/> <abstract> <t> This document describes BagIt, a set of hierarchical file layout conventions for From 73a516bc83b09aa334ec5f073d10743156677707 Mon Sep 17 00:00:00 2001 From: Chris Adams <cadams@loc.gov> Date: Wed, 14 Nov 2018 15:36:02 -0500 Subject: [PATCH 144/144] Final XML from IETF --- bagit.xml | 633 +++++++++++++++++++++++++++--------------------------- 1 file changed, 319 insertions(+), 314 deletions(-) diff --git a/bagit.xml b/bagit.xml index 0c6476f8..63f8c985 100644 --- a/bagit.xml +++ b/bagit.xml @@ -1,13 +1,9 @@ -<?xml version="1.0"?> -<!-- - See http://xml.resource.org/ for formatting tools to work with the RFC 7749 - XML format ---> +<?xml version="1.0" encoding="US-ASCII"?> <!DOCTYPE rfc SYSTEM "rfc2629.dtd" [ -<!ENTITY mdash "—"> + <!ENTITY RFC1321 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.1321.xml"> <!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml"> -<!ENTITY RFC4234 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.4234.xml"> +<!ENTITY RFC5234 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5234.xml"> <!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml"> <!ENTITY RFC3174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3174.xml"> <!ENTITY RFC3552 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3552.xml"> @@ -17,30 +13,24 @@ <!ENTITY RFC6234 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6234.xml"> <!ENTITY RFC6920 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6920.xml"> <!ENTITY RFC8174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8174.xml"> -<!-- RFC 2119 entities - for convenience --> -<!ENTITY must "MUST"> -<!ENTITY must-not "MUST NOT"> -<!ENTITY required "REQUIRED"> -<!ENTITY shall "SHALL"> -<!ENTITY shall-not "SHALL NOT"> -<!ENTITY should "SHOULD"> -<!ENTITY should-not "SHOULD NOT"> -<!ENTITY recommended "RECOMMENDED"> -<!ENTITY not-recommended "NOT RECOMMENDED"> -<!ENTITY may "MAY"> -<!ENTITY optional "OPTIONAL"> -<!-- The current bagit version, for convenience. --><!ENTITY current-bagit-version "1.0"> ]> + + <?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?> <?rfc strict="yes" ?> -<?rfc comments="no"?> <?rfc inline="yes"?> <?rfc symrefs="yes"?> +<?rfc sortrefs="yes"?> <?rfc toc="yes"?> -<rfc category="info" docName="draft-kunze-bagit-17" ipr="trust200902"> + +<rfc number="8493" + category="info" + submissionType="independent" + consensus="yes" + ipr="trust200902"> <front> <title abbrev="BagIt"> - The BagIt File Packaging Format (V¤t-bagit-version;) + The BagIt File Packaging Format (V1.0) @@ -52,7 +42,7 @@ Oakland CA 94612 - US + United States of America jak@ucop.edu @@ -67,7 +57,7 @@ Stanford CA 94305 - USA + United States of America justinlittman@stanford.edu @@ -82,7 +72,7 @@ Washington DC 20540 - USA + United States of America emad@loc.gov @@ -102,12 +92,12 @@ Washington DC 20540 - USA + United States of America cadams@loc.gov - + This document describes BagIt, a set of hierarchical file layout conventions for @@ -124,16 +114,16 @@ BagIt format is suitable for reliable storage and transfer. BagIt is a set of hierarchical file layout conventions designed to support storage and transfer of arbitrary digital content. -A bag consists of a directory containing the payload files and other accompanying +A "bag" consists of a directory containing the payload files and other accompanying metadata files known as "tag" files. The "tags" are metadata files intended to facilitate and document the storage and transfer of the bag. Processing a bag -does not require any understanding of the payload file contents and the payload +does not require any understanding of the payload file contents, and the payload files can be accessed without processing the BagIt metadata. The name, BagIt, is inspired by the "enclose and deposit" method , sometimes referred to as "bag it and tag it". -BagIt differs from serialized archive formats such as MIME, TAR, or ZIP +BagIt differs from serialized archival formats such as MIME, TAR, or ZIP in two general areas: @@ -141,13 +131,13 @@ in two general areas: hash algorithms (see ) and allows for in-place upgrades to add additional manifests using stronger algorithms without breaking backwards compatibility. This provides high - levels of confidence against data corruption but is not designed + levels of confidence against data corruption, but it is not designed to be secure against active attacks. Direct file access. Because BagIt specifies an actual filesystem hierarchy rather than a serialized representation of one, files can be accessed using standard operating system utilities, implementations do not need - to process a potentially large archive file to extract a subset of data, + to process a potentially large archival file to extract a subset of data, and the format imposes no size limits for either individual files or a bag. @@ -156,7 +146,7 @@ BagIt is widely used for preserving digital assets originating from different domains. Organizations involved in digital preservation with BagIt include the Library of Congress, Dryad Data Repository, NSF DataONE, and the Rockefeller Archive Center. Software implementations are available for many -languages including Python, Ruby, Java, Perl, and PHP. It is also used in +languages, including Python, Ruby, Java, Perl, and PHP. It is also used in the libraries of many universities, such as Cornell, Purdue, Stanford, Ghent University, New York University, and the University of California. @@ -164,11 +154,11 @@ Ghent University, New York University, and the University of California.
-The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", -"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" -in this document are to be interpreted as described in BCP 14 - when, and only when, they -appear in all capitals as shown here. + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL + NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", + "MAY", and "OPTIONAL" in this document are to be interpreted as + described in BCP 14 + when, and only when, they appear in all capitals, as shown here. Implementers are strongly encouraged to review the interoperability @@ -182,53 +172,54 @@ appear in all capitals as shown here. - + A set of opaque files contained within the structure defined by this document. - + The file required to be in all bags conforming to this document. Contains values necessary to process the rest of a bag. See . - - The name of a cryptographic checksum algorithm which has been normalized - for use in a manifest or tag manifest file name (e.g. "sha512") + + The name of a cryptographic checksum algorithm that has been normalized + for use in a manifest or tag manifest file name (e.g., "sha512") as described in . - - A tag file thats maps filepaths to checksums. A manifest can be a payload - manifest or a tag manifest . + + A tag file that maps filepaths to checksums. A manifest can be a payload + manifest (see ) or a + tag manifest (see ). - + The data encapsulated by the bag as a set of named files, which may be - organized in sub-directories. The contents of the payload files + organized in subdirectories. The contents of the payload files are opaque to this document, and, with respect to BagIt processing, are always considered as sequences of uninterpreted octets. See . - + A directory that contains one or more tag files. - - A file which contains metadata about the bag or its payload. + + A file that contains metadata about the bag or its payload. This document defines the standard BagIt tag files: - the bag declaration in "bagit.txt" , - payload manifests , - tag manifests , - bag metadata in "bag-info.txt" , - and remote payload in "fetch.txt" . + the bag declaration in "bagit.txt" (see ), + payload manifests (see ), + tag manifests (see ), + bag metadata in "bag-info.txt" (see ), + and remote payload in "fetch.txt" (see ). This document also allows other arbitrary tag files as described in . - - A bag which contains every element required by this document, - every payload file listed in a manifest, and any optional files which are + + A bag that contains every element required by this document, + every payload file listed in a manifest, and any optional files that are listed in a tag manifest. See . - + A complete bag where every checksum in every manifest has been successfully verified against the corresponding file. @@ -240,28 +231,30 @@ appear in all capitals as shown here.
- A bag &must; consist of a base directory containing: + A bag MUST consist of a base directory containing the following: - a set of required and optional tag files - a sub-directory named "data", called the payload directory. - a set of optional tag directories + a set of required and optional tag files (see ); + a subdirectory named "data", called the payload directory (see + ); and + a set of optional tag directories. The tag files in the base directory consist of one or more files named "manifest-algorithm.txt" -(see and -), +(see Sections and +), a file named "bagit.txt" (see ), and zero or more additional tag files (see ). The tag files and directories are -in arbitrary file hierarchies and &may; have +in arbitrary file hierarchies and MAY have any name that is not reserved for a file or directory in this document. + -The base directory can have any name. +The base directory can have any name, as illustrated by the figure below.
@@ -284,7 +277,7 @@ The base directory can have any name.
- The "bagit.txt" tag file &must; consist of exactly two lines in this order: + The "bagit.txt" tag file MUST consist of exactly two lines in this order:
@@ -294,32 +287,32 @@ Tag-File-Character-Encoding: ENCODING M.N identifies the BagIt major (M) and minor (N) version numbers. ENCODING identifies the character set encoding used by the remaining tag files. - ENCODING &should; - be UTF-8 but - for backwards compatibility it &may; be any + ENCODING SHOULD + be UTF-8, but + for backwards compatibility it MAY be any other encoding registered in . - The bag declaration itself &must; be encoded in UTF-8, and &must-not; contain a - byte-order mark (BOM) . + The bag declaration itself MUST be encoded in UTF-8 and MUST NOT contain a + Byte Order Mark (BOM) .
- The number for this version of BagIt is "¤t-bagit-version;". + The number for this version of BagIt is "1.0".
- The base directory &must; contain a sub-directory named "data". + The base directory MUST contain a subdirectory named "data". The payload directory contains the arbitrary digital content within the bag. The files under the payload directory are called payload files, or the payload. Each payload file is treated as an opaque octet stream when verifying file correctness. - Payload files &may; be organized in arbitrary sub-directory structures - within the payload directory, however for the purpose of this document - such sub-directory structures and filenames have no given meaning. + Payload files MAY be organized in arbitrary subdirectory structures + within the payload directory; however, for the purpose of this document, + such subdirectory structures and filenames have no given meaning.
@@ -328,20 +321,20 @@ Tag-File-Character-Encoding: ENCODING A payload manifest file provides a complete listing of each payload file name along with a corresponding checksum to permit data integrity checking. A bag can have more than one payload manifest, with each using a different checksum algorithm. - Manifest entries &must; satisfy the following constraints: + Manifest entries MUST satisfy the following constraints: - Every bag &must; contain at least one payload manifest file and &may; contain + Every bag MUST contain at least one payload manifest file and MAY contain more than one. - Every payload manifest &must; list every payload file name exactly once. + Every payload manifest MUST list every payload file name exactly once. - A payload manifest file &must; have a name of the form + A payload manifest file MUST have a name of the form "manifest-algorithm.txt", where algorithm is a string specifying the checksum algorithm used by that @@ -351,45 +344,46 @@ Tag-File-Character-Encoding: ENCODING +Example payload manifest filenames:
- Example payload manifest filenames manifest-sha256.txt -manifest-sha512.txt +manifest-sha512.txt +
- Each line of a payload manifest file &must; be of the form: + Each line of a payload manifest file MUST be of the form
checksum filepath - - where filepath is the pathname of a file - relative to the base directory, and checksum is - a hex-encoded checksum calculated according to - algorithm over every octet in the file. -
+where filepath is the pathname of a file +relative to the base directory, and checksum is a +hex-encoded checksum calculated by applying algorithm over the file. + + - The hex-encoded checksum &may; use uppercase and/or lowercase letters. - The slash character ('/') &must; be used as a path separator + The hex-encoded checksum MAY use uppercase and/or lowercase letters. + The slash character ('/') MUST be used as a path separator in filepath. One or more linear whitespace characters (spaces or tabs) - &must; separate checksum from + MUST separate checksum from filepath. There is no limitation on the length of a pathname. - The payload manifest &must-not; reference files outside the payload directory. + The payload manifest MUST NOT reference files outside the payload directory. - If a filepath includes a line feed - (LF), a carriage return (CR), - carriage return plus line feed (CRLF) or - percent sign (%), those characters (and only those) &must; be + If a filepath includes a Line Feed + (LF), a Carriage Return (CR), + a Carriage-Return Line Feed (CRLF), or a + percent sign (%), those characters (and only those) MUST be percent-encoded following . -A manifest &must-not; reference directories. Bag creators who wish to create +A manifest MUST NOT reference directories. Bag creators who wish to create an otherwise empty directory have typically done so by creating an empty placeholder file with a name such as ".keep". @@ -405,33 +399,33 @@ placeholder file with a name such as ".keep". checksum algorithm. - A bag &may; contain one or more tag manifests, in which case each tag manifest &should; list the same set of tag files. + A bag MAY contain one or more tag manifests, in which case each tag manifest SHOULD list the same set of tag files. - Each tag manifest &must; list every payload manifest. - Each tag manifest &must-not; list any tag manifests, - but &should; list the remaining tag files present in the bag. + Each tag manifest MUST list every payload manifest. + Each tag manifest MUST NOT list any tag manifests + but SHOULD list the remaining tag files present in the bag. - A tag manifest file &must; have a name of the form + A tag manifest file MUST have a name of the form "tagmanifest-algorithm.txt", where algorithm is a string following the format described in - specifying the bag checksum algorithm used in that manifest. + that specifies the bag checksum algorithm used in that manifest. - Tag manifests &should; use the same algorithms as the payload manifests that are present in the bag. + Tag manifests SHOULD use the same algorithms as the payload manifests that are present in the bag. +Example tag manifest filenames:
- Example tag manifest filenames: tagmanifest-sha256.txt tagmanifest-sha512.txt
-A tag manifest file has the same form as the payload file manifest -file described in , -but &must-not; list any payload files. +A tag manifest file has the same form as the payload manifest file +described in +but MUST NOT list any payload files. As a result, no filepath listed in a tag manifest begins "data/".
@@ -441,121 +435,121 @@ As a result, no filepath listed in a tag manifest be The "bag-info.txt" file is a tag file that contains metadata elements describing the bag and the payload. The metadata elements contained in the "bag-info.txt" file are intended primarily for - human use. All metadata elements are &optional; and &may; be repeated. + human use. All metadata elements are OPTIONAL and MAY be repeated. Because "bag-info.txt" is intended for human reading - and editing, ordering &may; be significant and the ordering of - metadata elements &must; be preserved. + and editing, ordering MAY be significant and the ordering of + metadata elements MUST be preserved. - A metadata element &must; consist of a label, a colon ":", a single - linear whitespace character (space or tab), and a value, terminated with a line feed (CR), carriage return (LF) or - carriage return plus line feed (CRLF). + A metadata element MUST consist of a label, a colon ":", a single + linear whitespace character (space or tab), and a value that is + terminated with an LF, a CR, or a CRLF. - The label &must-not; contain colon (:), line feeds (LF) or carriage returns (CR). - The label &may; contain linear whitespace characters, but &must-not; start or + The label MUST NOT contain a colon (:), LF, or CR. + The label MAY contain linear whitespace characters but MUST NOT start or end with whitespace. - It is &recommended; that lines not exceed 79 characters in length. Long values &may; be - continued onto the next line by inserting a line feed (LF), a carriage - return (CR), or carriage return plus line feed (CRLF) and indenting - the next line with one or more linear white space (spaces or tabs). - Except for linebreaks such padding does not form part of the value. + It is RECOMMENDED that lines not exceed 79 characters in length. Long values MAY be + continued onto the next line by inserting a LF, CR, or CRLF, and then indenting + the next line with one or more linear white space characters (spaces or tabs). + Except for linebreaks, such padding does not form part of the value. Implementations wishing to support previous BagIt versions - &must; accept multiple linear whitespace before and after the + MUST accept multiple linear whitespace characters before and after the colon when the bag version is earlier than 1.0; such whitespace does not form part of the label or value. The following are reserved metadata elements. The use of these reserved - metadata elements are &optional; but encouraged. Reserved metadata - element names are case-insensitive. Except where indicated otherwise, - these metadata element names &may; be repeated to capture multiple values. + metadata elements is OPTIONAL but encouraged. Reserved metadata + element names are case insensitive. Except where indicated otherwise, + these metadata element names MAY be repeated to capture multiple values. - + Organization transferring the content. - + Mailing address of the source organization. - + Person at the source organization who is responsible for the content transfer. - + International format telephone number of person or position responsible. - + Fully qualified email address of person or position responsible. - + A brief explanation of the contents and provenance. - + Date (YYYY-MM-DD) that the content was prepared for transfer. - This metadata element &should-not; be repeated. + This metadata element SHOULD NOT be repeated. - + A sender-supplied identifier for the bag. - - Size or approximate size of the bag being transferred, followed - by an abbreviation such as MB (megabytes), GB, or TB; for example, + + The size or approximate size of the bag being transferred, followed + by an abbreviation such as MB (megabytes), GB (gigabytes), or + TB (terabytes): for example, 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum (described next), Bag-Size is intended for human consumption. - This metadata element &should-not; be repeated. + This metadata element SHOULD NOT be repeated. - - The "octetstream sum" of the payload, intended for the + + The "octetstream sum" of the payload, which is intended for the purpose of quickly detecting incomplete bags before performing checksum - validation. This is strictly an optimization and implementations &must; perform + validation. This is strictly an optimization, and implementations MUST perform the standard checksum validation process before proclaiming a bag to be valid. - This element &must-not; be present more than once and, if present, &must; + This element MUST NOT be present more than once and, if present, MUST be in the form "OctetCount.StreamCount", where OctetCount is the total number of octets (8-bit bytes) across all payload file content and StreamCount is the total number of payload files. - This metadata element &must-not; be repeated. + This metadata element MUST NOT be repeated. - + A sender-supplied identifier for the set, if any, of bags to which it logically belongs. - This identifier &should; be unique across the sender's content, and if - recognizable as belonging to a globally unique scheme, the receiver - &should; make an effort to honor reference to it. - This metadata element &should-not; be repeated. + This identifier SHOULD be unique across the sender's content, + and if it is recognizable as belonging to a globally unique scheme, the receiver + SHOULD make an effort to honor the reference to it. + This metadata element SHOULD NOT be repeated. - + Two numbers separated by "of", in particular, "N of T", where T is the total number of bags in a group of bags and N is the - ordinal number within the group; if T is not known, specify it as "?" - (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 145. - This metadata element &should-not; be repeated. - If this metadata element is present, it is &recommended; to also + ordinal number within the group. If T is not known, specify it as "?" + (question mark): for example, 1 of 2, 4 of 4, 3 of ?, 89 of 145. + This metadata element SHOULD NOT be repeated. + If this metadata element is present, it is RECOMMENDED to also include the Bag-Group-Identifier element. - + An alternate sender-specific identifier for the content and/or bag. - + A sender-local explanation of the contents and provenance. In addition to these metadata elements, other arbitrary metadata - elements &may; also be present. + elements MAY also be present.
- An example "bag-info.txt" file + An example of "bag-info.txt" file is as follows: Source-Organization: FOO University Organization-Address: 1 Main St., Cupertino, California, 11111 @@ -578,7 +572,7 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created
- For reasons of efficiency, a bag &may; be sent with a list of files to be + For reasons of efficiency, a bag MAY be sent with a list of files to be fetched and added to the payload before it can meaningfully be checked for completeness. The fetch file allows a bag to be transmitted with @@ -590,22 +584,22 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created object could be much larger than what is intended for export) or assembled from logically distributed sources (e.g., the object components for export are not stored locally under one filesystem tree). - An &optional; tag file called the fetch file contains such a list. + An OPTIONAL tag file, called the fetch file, contains such a list. - The fetch file &must; be named "fetch.txt". Every file listed in - the fetch file &must; be listed in every - payload manifest. A fetch file &must-not; list any tag files. + The fetch file MUST be named "fetch.txt". Every file listed in + the fetch file MUST be listed in every + payload manifest. A fetch file MUST NOT list any tag files. - Each line of a fetch file &must; be of the form: + Each line of a fetch file MUST be of the form
url length filepath where url identifies the file to be - fetched and &must; be an absolute URI as defined in + fetched and MUST be an absolute URI as defined in , length is the number of octets in the file (or "-", to leave it unspecified), and filepath identifies the @@ -614,16 +608,14 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created
- The slash character ('/') &must; be used as a path separator in + The slash character ('/') MUST be used as a path separator in filepath. One or more linear whitespace - characters (spaces or tabs) &must; separate these + characters (spaces or tabs) MUST separate these three values, and any such characters in the url - &must; be percent-encoded . - If filename includes a line feed - (LF), a carriage return (CR), - carriage return plus line feed (CRLF) or - percent sign (%), those characters (and only those) &must; be - percent-encoded following . + MUST be percent-encoded . + If filename includes an LF, a CR, + a CRLF, or a percent sign (%), those characters (and only those) MUST be + percent-encoded as described in . There is no limitation on the length of any of the fields in the fetch file. @@ -632,11 +624,11 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created
- A bag &may; contain other tag files that are not defined by this + A bag MAY contain other tag files that are not defined by this document. - Implementations &must; perform standard checksum validation on any tag file - which is listed in a tag manifest but &must; otherwise ignore their contents. + Implementations MUST perform standard checksum validation on any tag file + that is listed in a tag manifest but MUST otherwise ignore their contents.
@@ -644,51 +636,51 @@ Internal-Sender-Description: Uncompressed greyscale TIFFs created
- All tag files specifically described in this document &must; adhere to - the text tag file format described below. Other tag files &may; adhere to + All tag files specifically described in this document MUST adhere to + the text tag file format described below. Other tag files MAY adhere to the text tag file format described below. + - Text tag files are line-oriented, and each line &must; be terminated - by a line feed (LF), a carriage return (CR), or carriage return plus - newline (CRLF). It is &recommended; that the last line in a tag - file also ends with LF, CR, or CRLF. - Text tag file names &must; end in the extension ".txt". + Text tag files are line oriented, and each line MUST be terminated + by an LF, a CR, or a CRLF. It is RECOMMENDED that the last line in a tag + file also end with LF, CR, or CRLF. + Text tag file names MUST end in the extension ".txt". -In all text tag files except for the bag declaration file, text &must; use +In all text tag files except for the bag declaration file, text MUST use the character encoding specified in the "bagit.txt" bag declaration -file. Text tag files except for the bag declaration file &may; include a -byte-order mark (BOM) only if the specified encoding requires it for +file. Text tag files except for the bag declaration file MAY include a +Byte Order Mark (BOM) only if the specified encoding requires it for proper decoding. In accordance with , when "bagit.txt" -specifies UTF-8 the tag files &must-not; begin with a byte-order mark (BOM). -See +specifies UTF-8, the tag files MUST NOT begin with a BOM. +See . -The use of UTF-8 for text tag files is strongly &recommended;. A future version +The use of UTF-8 for text tag files is strongly RECOMMENDED. A future version of BagIt may disallow encodings other than UTF-8.
-The payload manifest and tag manifests permit validating the integrity of the payload +The payload manifest and tag manifest permit validating the integrity of the payload and tag files in a bag produced by the checksum algorithms. -Checksum values &must; be encoded so as to conform to the manifest format +Checksum values MUST be encoded so as to conform to the manifest format specified in . However, the internal details of a checksum are outside the scope of this document. - To avoid future ambiguity, the checksum algorithm &should; be registered + To avoid future ambiguity, the checksum algorithm SHOULD be registered in IANA's "Named Information Hash Algorithm Registry" - according to , but &may; for backwards compatibility also be + according to but MAY, for backwards compatibility, also be MD5 or SHA-1 . -The name of the checksum algorithm &must; be normalized for use in the +The name of the checksum algorithm MUST be normalized for use in the manifest's filename by lowercasing the common name of the algorithm and removing all non-alphanumeric characters. Following is a partial list -mapping common algorithm names to normalized names: +that maps common algorithm names to normalized names: MD5: md5 SHA-1: sha1 @@ -697,11 +689,11 @@ mapping common algorithm names to normalized names: - Starting with BagIt 1.0, bag creation and validation tools &must; support the - SHA-256 and SHA-512 algorithms and &should; enable + Starting with BagIt 1.0, bag creation and validation tools MUST support the + SHA-256 and SHA-512 algorithms and SHOULD enable SHA-512 by default when creating new bags. - For backwards compatibility implementers &should; support + For backwards compatibility, implementers SHOULD support MD5 and SHA-1 . Implementers are encouraged to simplify the process of adding additional @@ -712,29 +704,29 @@ mapping common algorithm names to normalized names:
-
+
-A complete bag &must; meet the following +A complete bag MUST meet the following requirements: - Every required element &must; be present (). - Every file listed in every tag manifest &must; be present. - Every file listed in every payload manifest &must; be present. - For BagIt 1.0, every payload file &must; be listed in every payload manifest. + Every required element MUST be present (see ). + Every file listed in every tag manifest MUST be present. + Every file listed in every payload manifest MUST be present. + For BagIt 1.0, every payload file MUST be listed in every payload manifest. Note that older versions of BagIt allowed payload files to be listed in just one of the manifests. - Every element present &must; conform to BagIt ¤t-bagit-version;. + Every element present MUST conform to BagIt 1.0. -A valid bag &must; meet the following requirements: +A valid bag MUST meet the following requirements: - The bag &must; be complete. + The bag MUST be complete. Every checksum in every payload manifest and tag manifest has been successfully verified against the contents of the corresponding file. @@ -744,12 +736,12 @@ A valid bag &must; meet the following requirements:
-
+
This is the layout of a basic bag containing an image and a companion - OCR file. Lines of file content are shown with added parentheses to + Optical Character Recognition (OCR) file. Lines of file content are shown with added parentheses to indicate each complete line. - For brevity this example uses MD5 rather than the recommended SHA-512. + For brevity, this example uses MD5 rather than the recommended SHA-512.
@@ -774,13 +766,13 @@ myfirstbag/ ....
-
+
- This is the layout of a bag which expects the receiver to download the + This is the layout of a bag that expects the receiver to download the files listed in the payload manifests prior to validation. Lines of file content are shown with added parentheses to indicate each complete line. - For brevity this example uses MD5 rather than the recommended SHA-512. + For brevity, this example uses MD5 rather than the recommended SHA-512.
@@ -806,35 +798,35 @@ highsmith-tahoe/
-
+
The paths specified in the payload manifests, tag manifests, and - fetch files do not prohibit special directory characters which have - special meaning on some operating systems. Implementers &must; ensure + fetch files do not prohibit special directory characters that have + special meaning on some operating systems. Implementers MUST ensure that files outside the bag directory structure are not accessed when reading or writing files based on paths specified in a bag. - All implementations &should; have a test suite to guard against + All implementations SHOULD have a test suite to guard against special directory characters. For example, a maliciously crafted "tagmanifest-sha512.txt" file might - contain entries which begin with a path character such as "/", "..", + contain entries that begin with a path character such as "/", "..", or a "~username" home directory reference in an attempt to cause a naive implementation to leak or overwrite targeted files on a POSIX operating system. - Windows implementations &should; test their implementations to ensure - that safety-checks prevent use of drive letters and the less commonly used - namespace sequences (e.g. "\\?\C:\…") described in . + Windows implementations SHOULD test their implementations to ensure + that safety checks prevent use of drive letters and the less commonly used + namespace sequences (e.g., "\\?\C:\...") described in . To assist implementers, the Library of Congress conformance suite has some tests for invalid bags - which are expected to fail on POSIX or Windows clients. + that are expected to fail on POSIX or Windows clients.
@@ -848,24 +840,24 @@ highsmith-tahoe/
-
+
The size of files, as optionally reported in the fetch file, cannot be guaranteed to match the actual file size to be downloaded. - Implementers &should; take steps to monitor and abort transfer when the + Implementers SHOULD take steps to monitor and abort transfer when the received file size exceeds the file size reported in the fetch file. - Implementers &should-not; use the file size in the + Implementers SHOULD NOT use the file size in the fetch file for critical resource allocation, such as buffer sizing or storage requisitioning.
-
+
The integrity assurance provided by manifests is designed to provide high levels of confidence against data corruption but is not designed to be secure against active attacks. Organizations that need to - secure bags against such threats &should; agree on additional + secure bags against such threats SHOULD agree on additional measures, such as digital signatures, that are out of scope for this specification. @@ -876,11 +868,11 @@ highsmith-tahoe/
-
+
This section lists practical considerations for implementers and - users. None of the points below are required but they are recommended + users. None of the points below are required, but they are recommended for general-purpose usage. @@ -896,45 +888,47 @@ highsmith-tahoe/ error handling. -
+
This section provides background information on various challenges caused by differences in how operating systems, filesystems, and common tools handle - filenames followed by a list of recommendations for implementers in + filenames. This section is followed by a list of recommendations for implementers in . -
+ +
- There are two challenges for interoperability related to filename case: + There are three challenges for interoperability related to filename case: - Filesystems such as FAT or EXFAT always convert filenames to uppercase: - "example.txt" will be stored as "EXAMPLE.TXT" + Filesystems such as File Allocation Table (FAT) or Extended File + Allocation Table (EXFAT) always convert filenames to uppercase: + "example.txt" will be stored as "EXAMPLE.TXT". - Many Unix filesystems save filenames exactly as provided, allowing - multiple files which differ only in case: "example.txt" and - "Example.txt" are separate files + Many Unix filesystems save filenames exactly as provided, which allows + multiple files that differ only in case: "example.txt" and + "Example.txt" are separate files. - NTFS and Apple's HFS Plus usually preserve case when storing files but are - case-insensitive when retrieving them. A file saved as "Example.txt" + New Technology File System (NTFS) and Apple's Hierarchical File System + (HFS) Plus usually preserve case when storing files but are + case insensitive when retrieving them. A file saved as "Example.txt" will be retrieved by that name but will also be retrieved as "EXAMPLE.TXT", "example.txt", etc.
-
+
The Unicode specification has common cases where different character sequences -produce the same human-meaningful text. These are referred to as “canonically -equivalent” and the Unicode specification defines different normalization -forms — see for the full details and a brief -example below: - +produce the same human-meaningful text. +These are referred to as "canonically equivalent" and the Unicode +specification defines different normalization forms - see for the full details.
- The common surname "Núñez" normalized in different forms +The example below shows the common surname "Nunez" normalized in different forms.
@@ -962,16 +956,16 @@ z 7a LATIN SMALL LETTER Z ]]> Apple's HFS Plus filesystem always normalizes filenames to a - fully-decomposed form based on the Unicode 2.0 specification (see ). + fully decomposed form based on the Unicode 2.0 specification (see ). Windows treats filenames as opaque character sequences (see ) and will store and return the encoded bytes exactly as provided. Linux and other common Unix systems are generally similar to Windows in - storing and returning opaque byte streams but this behaviour is - technically filesystem-dependent. + storing and returning opaque byte streams, but this behavior is + technically dependent on the filesystem. - Utilities used for file management, transfer, and archival may ignore this + Utilities used for file management, transfer, and archiving may ignore this issue, apply an arbitrary normalization form, or allow the user to control how normalization is applied. @@ -980,41 +974,41 @@ z 7a LATIN SMALL LETTER Z ]]> In practice, this means that the encoded filename stored in a manifest may fail a simple file existence check because the filename's normalization was changed at some point after the manifest was written. This situation is very - confusing for users because the filenames are visually indistinguishable and - the “missing” file is obviously present in the payload directory. + confusing for users because the filenames are visually indistinguishable, and + the "missing" file is obviously present in the payload directory.
- Implementations &should; discourage the creation of bags containing - files which differ only in case. + Implementations SHOULD discourage the creation of bags containing + files that differ only in case. - Implementations &should; prevent the creation of bags containing files - which differ only in normalization form. + Implementations SHOULD prevent the creation of bags containing files + that differ only in normalization form. - BagIt implementations &should; tolerate differences in normalization + BagIt implementations SHOULD tolerate differences in normalization form by comparing both the list of filesystem and manifest names after applying the same normalization form to both. - Implementations &should; issue a warning when multiple manifests are - present which differ only in case or normalization form. + Implementations SHOULD issue a warning when multiple manifests are + present that differ only in case or normalization form.
-
+
As specified above, only the Unix-based path separator ('/') may be used inside filenames listed in BagIt manifest and fetch.txt files. When bags are exchanged between Windows and Unix platforms, - the path separator &should; be translated as needed. Receivers - of bags on physical media &should; be prepared for filesystems created + the path separator SHOULD be translated as needed. Receivers + of bags on physical media SHOULD be prepared for filesystems created under either Windows or Unix. Besides the fundamental difference between path separators ('\' and '/'), generally, Windows filesystems have more limitations than Unix filesystems. @@ -1042,34 +1036,34 @@ z 7a LATIN SMALL LETTER Z ]]> See for more information and possible alternatives.
-
+
Some bags have been manually assembled using checksum utilities such as those contained in the GNU Coreutils package (md5sum, sha1sum, etc.), collectively referred to here as "md5sum". Implementers who desire wide support of legacy -content should be aware of some known quirks of these tools: +content should be aware of some known quirks of these tools. -md5sum can be run in “text mode” which causes it to normalize line-endings -on some operating systems. On Unix-like systems both modes will usually produce -the same results but on systems like Windows they can produce different results +md5sum can be run in "text mode", which causes it to normalize line endings +on some operating systems. On Unix-like systems, both modes will usually produce +the same results; on systems like Windows, they can produce different results based on the file contents. The md5sum output format has two characters between the checksum and the -filepath: the first is always a space and the second is an asterisk ("*") for +filepath: the first is always a space, and the second is an asterisk ("*") for binary mode and a space for text mode. -A final note about md5sum-generated manifests is that for a filepath containing +A final note about md5sum-generated manifests is that, for a filepath containing a backslash ('\'), the manifest line will have a backslash inserted in front of the checksum and, under Windows, the backslashes inside filepath can be doubled. -Implementers &may; wish to accept this format by ignoring a leading asterisk or +Implementers MAY wish to accept this format by ignoring a leading asterisk or handling differences in line termination gracefully but, if so, implementations -&must; warn the user that the bag in question will fail strict validation. In -such cases it is &recommended; that tools provide an easy option to +MUST warn the user that the bag in question will fail strict validation. In +such cases, it is RECOMMENDED that tools provide an easy option to update the bag with valid manifests.
@@ -1077,13 +1071,13 @@ update the bag with valid manifests.
-
+
The Augmented Backus-Naur Form (ABNF) rules provided below are non-normative. If there is a discrepancy between requirements in the normative sections and the ABNF, the requirements in the normative sections prevail. Some -definitions use the core rules (e.g. DIGIT, HEXDIG, etc) as defined in - +definitions use the core rules (e.g., DIGIT, HEXDIG, etc) as defined in +.
@@ -1147,44 +1141,31 @@ ending = CR / LF / CRLF ]]>
-
- -Additional contributors to the authoring of BagIt are Andy Boyko, David Brunton, Rosie Storey, -Ed Summers, Brian Vargas, and Kate Zwaard. - -
-
- -BagIt benefitted from the thoughtful assistance of -Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Dave Crocker, Scott Fisher, Brad Hards, Erik Hetzner, -Keith Johnson, Leslie Johnston, David Loy, Mark Phillips, Tracy Seneca, Stian Soiland-Reyes, -Brian Tingle, Adam Turoff, and Jim Tuttle. - -
-This draft does not request any action from IANA. +This document has no IANA actions.
+ &RFC1321; &RFC2119; &RFC6920; - + - Named Information Hash Algorithm Registry + Named Information Hash Algorithm IANA - + - + - Character Set Registry + Character Set IANA - + &RFC3174; @@ -1194,30 +1175,39 @@ This draft does not request any action from IANA. &RFC8174; - + + A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method + + + + + - Naming a File + Naming Files, Paths, and Namespaces Microsoft, Inc. - + - &RFC4234; + &RFC5234; + - Unicode® Standard Annex #15: Unicode Normalization Forms + Unicode Standard Annex #15: Unicode Normalization Forms Unicode Consortium - + + @@ -1225,17 +1215,32 @@ This draft does not request any action from IANA. Technical Note TN1150: HFS Plus Volume Format Apple Inc. - + - BagIt Conformance Suite + Test cases for validating Bagit Implementations The Library of Congress - + + + +
+ BagIt benefitted from the thoughtful assistance of Stephen Abrams, + Mike Ashenfelder, Dan Chudnov, Dave Crocker, Scott Fisher, Brad Hards, + Erik Hetzner, Keith Johnson, Leslie Johnston, David Loy, Mark Phillips, + Tracy Seneca, Stian Soiland-Reyes, Brian Tingle, Adam Turoff, and Jim + Tuttle. + +
+
+ Additional contributors to the authoring of BagIt are Andy Boyko, + David Brunton, Rosie Storey, Ed Summers, Brian Vargas, and Kate + Zwaard. +