diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 00000000..dd89b145 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,5 @@ +# Activate this after installing xml2rfc with this configuration: +# git config --local diff.xml2rfc.textconv "xml2rfc --quiet --out=/dev/stdout" +# +# To temporarily disable this formatting, use `git diff --no-textconv` +bagit.xml diff=xml2rfc diff --git a/Makefile b/Makefile new file mode 100644 index 00000000..6e033680 --- /dev/null +++ b/Makefile @@ -0,0 +1,13 @@ +default: html text + +text: + xml2rfc bagit.xml + +html: + xml2rfc --html bagit.xml + +format: + # We can't enable c14n because that triggers external DTD fetching and + # libxml2 currently does not support HTTPS, which is a problem now that all + # of the xml.resource.org URLs redirect: + xmllint --format --output bagit.xml bagit.xml diff --git a/bagit.xml b/bagit.xml index 68e83397..63f8c985 100644 --- a/bagit.xml +++ b/bagit.xml @@ -1,811 +1,751 @@ - - - - - + - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + ]> - - + + + - - - - - - - - - The BagIt File Packaging Format (V¤t-bagit-version;) +<rfc number="8493" + category="info" + submissionType="independent" + consensus="yes" + ipr="trust200902"> + <front> + <title abbrev="BagIt"> + The BagIt File Packaging Format (V1.0) - - -
- - 1438 Kingfisher Way - Sunnyvale CA - 94087 - USA - - andy@boyko.net -
-
- - - - California Digital Library + + + California Digital Library -
- - 415 20th St, 4th Floor - Oakland CA - 94612 - US - - jak@ucop.edu -
-
- - - - Library of Congress +
+ + 415 20th St, 4th Floor + Oakland + CA + 94612 + United States of America + + jak@ucop.edu +
+
+ + + Stanford Libraries -
- - 101 Independence Avenue SE - Washington DC - 20540 - USA - - jlit@loc.gov -
-
- - - +
+ + 518 Memorial Way + Stanford + CA + 94305 + United States of America + + justinlittman@stanford.edu +
+
+ + Library of Congress -
- - 101 Independence Avenue SE - Washington DC - 20540 - USA - - emad@loc.gov -
-
- - - +
+ + 101 Independence Avenue SE + Washington + DC + 20540 + United States of America + + emad@loc.gov +
+
+ +
+ john.scancella@gmail.com +
+
+ + Library of Congress -
- - 101 Independence Avenue SE - Washington DC - 20540 - USA - - ehs@pobox.com -
-
- - -
- - 1354 Quincy St. NW - Washington DC - 20011 - USA - - brian@ardvaark.net -
-
- - - - - - -This document specifies BagIt, a hierarchical file packaging format for -storage and transfer of arbitrary digital content. A "bag" has just enough -structure to enclose descriptive "tags" and a "payload" but -does not require knowledge of the payload's internal semantics. This -BagIt format should be suitable for disk-based or network-based storage and -transfer. - - - - -
- - -
-
- -BagIt is a hierarchical file packaging format designed to support -disk-based or network-based storage and transfer of arbitrary digital -content. A bag consists of a "payload" and "tags". The content of the payload -is the custodial focus of the bag and is treated as semantically opaque. -The "tags" are metadata files intended to facilitate and document the storage -and transfer of the bag. The name, BagIt, is inspired by the "enclose and deposit" method -, sometimes referred to as "bag it and tag it". - - - - -Implementors of BagIt tools should consider interoperability -between different platforms, operating systems, toolsets, and languages. -Differences in path separators, newline characters, reserved -file names, and maximum path lengths are all possible barriers to -moving bags between different systems. Discussion of these issues may be -found in the Interoperability section of this document. - -
- -
- -The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", -"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this -document are to be interpreted as described in . - - - -An implementation is not compliant if it fails to satisfy one or -more of the MUST or REQUIRED level requirements for the protocols -it implements. An implementation that satisfies all the MUST or -REQUIRED level and all the SHOULD level requirements for its protocols -is said to be "unconditionally compliant"; one that satisfies all -the MUST level requirements but not all the SHOULD level requirements -for its protocols is said to be "conditionally compliant." - -
- -
- -This specification uses a number of terms to describe BagIt, some -of which are in common use, some of which are newly defined by this -specification, and others which may have meanings obvious only -to those in the community from which this spec arose. Terms defined -in this section are intended to clarify any ambiguity. - - - - - - A set of opaque data contained within the structure defined - by this specification. - - - - The tag file required to be in all bags conforming to this - specification. Contains tags necessary for bootstrapping the - reading and processing of the rest of a bag. See . - - - - A reference to a cryptographic checksum algorithm, such as MD5 or - SHA-1, with its name normalized for use in a manifest or tag - manifest file name. See . - - - - A bag which comprises all elements required by this specification, - with all files listed in all payload and tag manifests present, - all payload files present listed in at least one manifest. See - . +
+ + 101 Independence Avenue SE + Washington + DC + 20540 + United States of America + + cadams@loc.gov +
+ + + + +This document describes BagIt, a set of hierarchical file layout conventions for +storage and transfer of arbitrary digital content. A "bag" has just enough +structure to enclose descriptive metadata "tags" and a file "payload" but +does not require knowledge of the payload's internal semantics. This +BagIt format is suitable for reliable storage and transfer. + + + + +
+
+ +BagIt is a set of hierarchical file layout conventions designed to support +storage and transfer of arbitrary digital content. +A "bag" consists of a directory containing the payload files and other accompanying +metadata files known as "tag" files. The "tags" are metadata files intended to +facilitate and document the storage and transfer of the bag. Processing a bag +does not require any understanding of the payload file contents, and the payload +files can be accessed without processing the BagIt metadata. + + +The name, BagIt, is inspired by the "enclose and deposit" method +, sometimes referred to as "bag it and tag it". +BagIt differs from serialized archival formats such as MIME, TAR, or ZIP +in two general areas: + + + Strong integrity assurances. The format supports cryptographic-quality + hash algorithms (see ) and allows + for in-place upgrades to add additional manifests using stronger algorithms + without breaking backwards compatibility. This provides high + levels of confidence against data corruption, but it is not designed + to be secure against active attacks. + + Direct file access. Because BagIt specifies an actual filesystem hierarchy + rather than a serialized representation of one, files can be accessed + using standard operating system utilities, implementations do not need + to process a potentially large archival file to extract a subset of data, + and the format imposes no size limits for either individual files or a bag. + + + +BagIt is widely used for preserving digital assets originating from different +domains. Organizations involved in digital preservation with BagIt include +the Library of Congress, Dryad Data Repository, NSF DataONE, and the +Rockefeller Archive Center. Software implementations are available for many +languages, including Python, Ruby, Java, Perl, and PHP. It is also used in +the libraries of many universities, such as Cornell, Purdue, Stanford, +Ghent University, New York University, and the University of California. + +
+ +
+ + The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL + NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", + "MAY", and "OPTIONAL" in this document are to be interpreted as + described in BCP 14 + when, and only when, they appear in all capitals, as shown here. + + + Implementers are strongly encouraged to review the interoperability + considerations described in . + +
+ +
+ + The following terms have precise definitions as used in this document: + + + + + A set of opaque files contained within the structure + defined by this document. + + + The file required to be in all bags conforming to this document. + Contains values necessary to process the rest of a bag. + See . + + + The name of a cryptographic checksum algorithm that has been normalized + for use in a manifest or tag manifest file name (e.g., "sha512") + as described in . + + + A tag file that maps filepaths to checksums. A manifest can be a payload + manifest (see ) or a + tag manifest (see ). + + + The data encapsulated by the bag as a set of named files, which may be + organized in subdirectories. The contents of the payload files + are opaque to this document, and, with respect to BagIt processing, + are always considered as sequences of uninterpreted octets. + See . + + + A directory that contains one or more tag files. + + + A file that contains metadata about the bag or its payload. + This document defines the standard + BagIt tag files: + the bag declaration in "bagit.txt" (see ), + payload manifests (see ), + tag manifests (see ), + bag metadata in "bag-info.txt" (see ), + and remote payload in "fetch.txt" (see ). + + This document also allows other arbitrary tag files as described in + . + + + A bag that contains every element required by this document, + every payload file listed in a manifest, and any optional files that are + listed in a tag manifest. See . + + + A complete bag where every checksum in every manifest has been + successfully verified against the corresponding file. + + - - - The data encapsulated by the bag. The contents of the payload - are opaque to this specification, and are always considered as a - set of octet streams. See . - - - - A bag that has been serialized into a single, monolithic file. See - . - - - - A directory that contains one or more tag files. - - - - A file that contains metadata intended to facilitate and document - the storage and transfer of the bag. - - - - A complete bag wherein every checksum in every payload manifest and - tag manifest can be successfully verified against the corresponding - payload file. See . - - - -
- - - -
- -
- -A bag consists of a base directory containing (1) a set of required -and optional tag files; (2) a sub-directory named "data", called the payload -directory; and (3) a set of optional tag directories. The payload files in the -payload directory are an arbitrary file hierarchy -(see ). +
+ +
+ +
+ + A bag MUST consist of a base directory containing the following: + + + + a set of required and optional tag files (see ); + a subdirectory named "data", called the payload directory (see + ); and + a set of optional tag directories. + + + The tag files in the base directory consist of one or more files named "manifest-algorithm.txt" -(see ), a file named "bagit.txt" -(see ), and zero or more additional tag -files (see ). The tag files in the -optional tag directories are arbitrary file hierarchies and the tag directories -&may; have any name that is not reserved for a file or directory in this specification. - - - -The base directory &may; have any name. - - -
- - <base directory>/ - | bagit.txt - | manifest-<algorithm>.txt - | [optional additional tag files] - \--- data/ - | [payload files] - \--- [optional tag directories]/ - | [optional tag files] - -
- -
-
- -The "bagit.txt" tag file &must; consist of exactly two lines: - -
- +(see Sections and +), +a file named "bagit.txt" (see ), +and zero or more additional tag files (see +). The tag files and directories are +in arbitrary file hierarchies and MAY have +any name that is not reserved for a file or directory in this document. + + + +The base directory can have any name, as illustrated by the figure below. + +
+ + <base directory>/ + | + +-- bagit.txt + | + +-- manifest-<algorithm>.txt + | + +-- [additional tag files] + | + +-- data/ + | | + | +-- [payload files] + | + +-- [tag directories]/ + | + +-- [tag files] +
+
+
+ + The "bagit.txt" tag file MUST consist of exactly two lines in this order: + +
+ BagIt-Version: M.N -Tag-File-Character-Encoding: UTF-8 - -
- -where M.N identifies the BagIt major (M) and minor (N) version numbers, -and UTF-8 identifies the character set encoding of tag files. The bag -declaration &must; be encoded in UTF-8, and &must-not; contain a byte-order -mark (BOM). - - - - -The appropriate version for a bag that conforms to -this version of the specification is "¤t-bagit-version;". - -
- -
- -The base directory &must; contain a sub-directory named "data", called the -payload directory. - - - -The payload directory contains the custodial content within the bag. -The files under the payload directory are called payload files, or -the payload. -The payload is treated as octet streams for all purposes relating to this -specification, and is not otherwise prescribed. - -
- -
- - -A payload manifest is a tag file that lists payload files and checksums for those -payload files generated using a particular bag checksum algorithm. -Every bag &must; contain one payload manifest file, and &may; contain -more than one. A payload manifest file &must; -have a name of the form manifest-algorithm.txt, where -algorithm is a string specifying -the bag checksum algorithm used in that manifest, such as: - - -
- -manifest-md5.txt -manifest-sha1.txt - -
- -A bag &must-not; contain more than one payload manifest for a particular -bag checksum algorithm. - -Each line of a payload manifest file &must; be of the form: - - -
- -CHECKSUM FILENAME - -
- - -where FILENAME is the pathname of a file relative to the base directory -and CHECKSUM is a hex-encoded checksum calculated according to algorithm over every octet in the file. The hex-encoded -checksum &may; use uppercase and/or lowercase letters. The slash -character ('/') &must; be used as a path separator in FILENAME. One -or more linear whitespace characters (spaces or tabs) &must; separate -CHECKSUM from FILENAME. An asterisk ('*') &may; preceed FILENAME for -interoperability on some platforms (see ). There is no limitation on the length of a pathname. The payload -manifest &must-not; reference files outside the payload directory. If -a FILENAME includes a newline (LF), a carriage return (CR), or carriage -return plus newline (CRLF) it &must; be percent-encoded -. - - - - -Payload manifests only include the pathnames of files. Because of this, -a payload manifest cannot reference empty directories. To account for -an empty directory, a bag creator may wish to include at least one file -in that directory; it suffices, for example, to include a zero-length -file named ".keep". - -
-
- -
-
- - -A tag manifest is a tag file that lists other tag files and checksums for -those tag files generated using a particular bag checksum algorithm. -A bag &may; contain one or more tag manifests. -A tag manifest file &must; have a name of the form -"tagmanifest-algorithm.txt", where -algorithm is a string specifying -the bag checksum algorithm used in that manifest, such as: - - -
- -tagmanifest-md5.txt -tagmanifest-sha1.txt - -
- - -A tag manifest file has the same form as the payload file manifest -file described in , -but &must-not; list any payload files. -As a result, no FILENAME listed in a tag manifest begins "data/". - - -
- -
- - -The "bag-info.txt" file is a tag file that contains metadata elements -describing the bag and the payload. The metadata elements contained in -the "bag-info.txt" file are intended primarily for human readability. -All metadata elements are optional and &may; be repeated. Implementations -&should; assume that the ordering is significant and provide access to the -metadata elements in the order they are given in the "bag-info.txt" file. - - -A metadata element &must; consist of a label, a colon, and a value, -each separated by optional whitespace. It is &recommended; that -lines not exceed 79 characters in length. Long values may be continued -onto the next line by inserting a newline (LF), a carriage return (CR), -or carriage return plus newline (CRLF) and indenting the next line with -linear white space (spaces or tabs). - - -Reserved metadata element names are case-insensitive and defined as follows. - - - - - - Organization transferring the content. - - - Mailing address of the organization. - - - Person at the source organization who is responsible for the content - transfer. - - - International format telephone number of person or position responsible. - - - Fully qualified email address of person or position responsible. - - - A brief explanation of the contents and provenance. - - - Date (YYYY-MM-DD) that the content was prepared for delivery. - - - A sender-supplied identifier for the bag. - - - Size or approximate size of the bag being transferred, followed - by an abbreviation such as MB (megabytes), GB, or TB; for example, - 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum (described - next), Bag-Size is intended for human consumption. - - - The "octetstream sum" of the payload, namely, a two-part number - of the form "OctetCount.StreamCount", where OctetCount is the - total number of octets (8-bit bytes) across all payload file content - and StreamCount is the total number of payload files. Payload-Oxum - should be included in "bag-info.txt" if at all - possible. Compared to Bag-Size (above), Payload-Oxum is - intended for machine consumption. - - - A sender-supplied identifier for the set, if any, of bags - to which it logically belongs. - This identifier must be unique across the sender's content, and if - recognizable as belonging to a globally unique scheme, the receiver - should make an effort to honor reference to it. - - - Two numbers separated by "of", in particular, "N of T", - where T is the total number of bags in a group of bags and N is the - ordinal number within the group; if T is not known, specify it as "?" - (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 145. - - - An alternate sender-specific identifier for the content - and/or bag. - - - A sender-local prose description of the contents of the - bag. - - - - - -In addition to these metadata elements, other arbitrary metadata elements may also be present. - - - -Here is an example "bag-info.txt" file. - -
- - Source-Organization: Spengler University - Organization-Address: 1400 Elm St., Cupertino, California, 95014 - Contact-Name: Edna Janssen - Contact-Phone: +1 408-555-1212 - Contact-Email: ej@spengler.edu - External-Description: Uncompressed greyscale TIFF images from the - Yoshimuri papers colle... - Bagging-Date: 2008-01-15 - External-Identifier: spengler_yoshimuri_001 - Bag-Size: 260 GB - Payload-Oxum: 279164409832.1198 - Bag-Group-Identifier: spengler_yoshimuri - Bag-Count: 1 of 15 - Internal-Sender-Identifier: /storage/images/yoshimuri - Internal-Sender-Description: Uncompressed greyscale TIFFs created - from microfilm and are... - -
-
- -
- -
- - -For reasons of efficiency, a bag &may; be sent with a list of files to be -fetched and added to the payload before it can meaningfully be checked -for completeness. An &optional; tag file named "fetch.txt" -contains such a list. Each line of "fetch.txt" has the form - -
- -URL LENGTH FILENAME - -
- -where URL identifies the file to be fetched, LENGTH is the number of -octets in the file (or "-", to leave it unspecified), and FILENAME -identifies the corresponding payload file, relative to the base directory. -The slash character ('/') &must; be used as a path separator in FILENAME. -If FILENAME begins with a slash character, the destination &must; still be -treated as relative to the bag base directory. -One or more linear whitespace characters (spaces or tabs) &must; separate these -three values, and any such characters in the URL &must; be percent-encoded -. There is no limitation on the length of any -of the fields in the "fetch.txt". -
- - -The "fetch.txt" file allows a bag to be transmitted with -"holes" in it, which can be practical for several reasons. For example, -it obviates the need for the sender to stage a large serialized copy of -the content while the bag is transferred to the receiver. Also, this -method allows a sender to construct a bag from components that are either -a subset of logically related components (e.g., the localized logical -object could be much larger than what is intended for export) or -assembled from logically distributed sources (e.g., the object components -for export are not stored locally under one filesystem tree). - - -
- -
- -A bag &may; contain other tag files that are not defined by this -specification. -Implementations &should; ignore the content of any unexpected tag files, -except when they are listed in a tag manifest. -When unexpected tag files are listed in a tag manifest, implementations -&must; only treat the content of those tag files as octet streams for the -purpose of checksum verification. - -
-
- -
- -All tag files specifically described in this specification &must; adhere to -the text tag file format described below. Other tag files &may; adhere to -the text tag file format described below. - - -Text tag files are line-oriented, and each line &must; be -terminated by a newline (LF), a carriage return (CR), or carriage return -plus newline (CRLF). -Text tag files &must; end in the extension ".txt". - - - -In all text tag files except for the bag declaration file, text &must; be -encoded in the character encoding specified in the "bagit.txt" bag declaration -file. Text tag files except for the bag declaration file &may; include a -byte-order mark (BOM) only if the specified encoding requires it for -proper decoding. (Note that UTF-8 does not.) - - - -As specified in , the bag declaration -file must be encoded in UTF-8 and must not include a byte-order mark. - - - - -
- -
- -The payload manifest and tag manifests assert integrity of the payload -and tags in a bag using checksum algorithms. The operation -of those algorithms, and the formatting of their output within a manifest -file, are generally beyond the scope of this specification, except that the -output format &must; be able to fit in the manifest format specified in -. - - - -The name of the checksum algorithm &must; be normalized for use in the -manifest's filename by lowercasing the common name of the algorithm and -removing all non-alphanumeric characters. - - - -Implementors of tools that create and validate bags &should; support at -least two widely implemented checksum algorithms: "md5" - and "sha1" . - -
- -
- -
- -A complete bag &must; have the following -attributes: - - - - - Every required element &must; be present - (). - Every file in every payload manifest &must; be present. - Every file in every tag manifest &must; be present. - Tag files not listed in a tag manifest &may; be present. - Every payload file &must; be listed in at least one manifest. - Payload files &may; be listed in more than one payload manifest. - Every element present &must; comply with this specification. - - - - -A bag is incomplete when it exhibits any of -the following exceptions to the attributes of a complete bag: - - - - - One or more files in any payload manifest are absent. - One or more files in any tag manifest are absent. - A fetch.txt is present. Any files listed in - any payload manifest or any tag manifest which are - absent &must; be listed in the fetch.txt. - - - - -A valid bag must have the following -attributes: - - - - - The bag &must; be complete. - Every CHECKSUM in every payload manifest and tag manifest - can be sucessfully verified against the contents of its - corresponding FILENAME. - - - - -If a bag is neither valid, complete, nor incomplete, it is -invalid. Definitions for the various -ways a bag may be invalid are not covered by this specification. - - - -Tag files that do not appear in a tag manifest can be modified, added -to, or removed from a bag without impacting the completeness or validity -of the bag. - - -
- -
- - -In some scenarios, it may be convenient to serialize the -bag's filesystem hierarchy (i.e., the base directory) into a -single-file archive format such as TAR or ZIP (the serialization) and then -later deserialize the serialization to recreate the filesystem hierarchy. -Several rules govern the serialization of a bag and apply equally -to all types of archive files: - - - - - -The top-level directory of a serialization &must; contain only one bag. - - -The serialization &should; have the same name as the bag's base directory, -but &must; have an extension added to identify the format. For example, the -receiver of "mybag.tar.gz" expects the corresponding base directory -to be created as "mybag". - - -A bag &must-not; be serialized from within its base directory, but from the -parent of the base directory (where the base directory appears as an -entry). Thus, after a bag is deserialized in an empty directory, -a listing of that directory shows exactly one entry. For example, -deserializing "mybag.zip" in an empty directory causes the creation -of the base directory "mybag" and, beneath "mybag", the creation of -all payload and tag files. - - -The deserialization of a bag &must; produce a single base directory -bag with the top-level structure as described in this specification without -requiring any additional un-archiving step. For example, after one -un-archiving step it would be an error for the "data/" directory to -appear as "data.tar.gz". TAR and ZIP files may appear inside the payload -beneath the "data/" directory, where they would be treated -as any other payload file. - - - - - -When serializing a bag, care must be taken to -ensure that the archive format's restrictions on file naming, such as allowable -characters, length, or character encoding, will support the -requirements of the systems on which it will be used. See -. - - -
- -
-
+Tag-File-Character-Encoding: ENCODING + + M.N identifies the BagIt major (M) and minor (N) version numbers. + ENCODING identifies the character set encoding used by the remaining tag files. + + ENCODING SHOULD + be UTF-8, but + for backwards compatibility it MAY be any + other encoding registered in . + + The bag declaration itself MUST be encoded in UTF-8 and MUST NOT contain a + Byte Order Mark (BOM) . + + + + The number for this version of BagIt is "1.0". + +
+ +
+ + The base directory MUST contain a subdirectory named "data". + + + The payload directory contains the arbitrary digital content within the bag. + The files under the payload directory are called payload files, or the payload. + Each payload file is treated as an opaque octet stream when verifying file + correctness. + Payload files MAY be organized in arbitrary subdirectory structures + within the payload directory; however, for the purpose of this document, + such subdirectory structures and filenames have no given meaning. + +
+ +
+ + A payload manifest file provides a complete listing of each payload file name along + with a corresponding checksum to permit data integrity checking. A bag can have more + than one payload manifest, with each using a different checksum algorithm. + Manifest entries MUST satisfy the following constraints: + + + + + + Every bag MUST contain at least one payload manifest file and MAY contain + more than one. + + + Every payload manifest MUST list every payload file name exactly once. + + + A payload manifest file MUST have a name of the form + "manifest-algorithm.txt", where + algorithm + is a string specifying the checksum algorithm used by that + manifest as described in . + + + + +Example payload manifest filenames: +
+ +manifest-sha256.txt +manifest-sha512.txt + +
+ + Each line of a payload manifest file MUST be of the form + +
+ checksum filepath +
+where filepath is the pathname of a file +relative to the base directory, and checksum is a +hex-encoded checksum calculated by applying algorithm over the file. + + + + + The hex-encoded checksum MAY use uppercase and/or lowercase letters. + The slash character ('/') MUST be used as a path separator + in filepath. + One or more linear whitespace characters (spaces or tabs) + MUST separate checksum from + filepath. + There is no limitation on the length of a pathname. + The payload manifest MUST NOT reference files outside the payload directory. + + If a filepath includes a Line Feed + (LF), a Carriage Return (CR), + a Carriage-Return Line Feed (CRLF), or a + percent sign (%), those characters (and only those) MUST be + percent-encoded following . + + + + +A manifest MUST NOT reference directories. Bag creators who wish to create +an otherwise empty directory have typically done so by creating an empty +placeholder file with a name such as ".keep". + +
+ +
+ +
+
+ + A tag manifest is a tag file that lists other tag files and + checksums for those tag files generated using a particular bag + checksum algorithm. + + + A bag MAY contain one or more tag manifests, in which case each tag manifest SHOULD list the same set of tag files. + + + Each tag manifest MUST list every payload manifest. + Each tag manifest MUST NOT list any tag manifests + but SHOULD list the remaining tag files present in the bag. + + + A tag manifest file MUST have a name of the form + "tagmanifest-algorithm.txt", + where algorithm is a string following + the format described in + that specifies the bag checksum algorithm used in that manifest. + + + Tag manifests SHOULD use the same algorithms as the payload manifests that are present in the bag. + +Example tag manifest filenames: +
+ +tagmanifest-sha256.txt +tagmanifest-sha512.txt +
+ +A tag manifest file has the same form as the payload manifest file +described in +but MUST NOT list any payload files. +As a result, no filepath listed in a tag manifest begins "data/". + +
+ +
+ + The "bag-info.txt" file is a tag file that contains metadata + elements describing the bag and the payload. The metadata elements + contained in the "bag-info.txt" file are intended primarily for + human use. All metadata elements are OPTIONAL and MAY be repeated. + Because "bag-info.txt" is intended for human reading + and editing, ordering MAY be significant and the ordering of + metadata elements MUST be preserved. + + + A metadata element MUST consist of a label, a colon ":", a single + linear whitespace character (space or tab), and a value that is + terminated with an LF, a CR, or a CRLF. + + + The label MUST NOT contain a colon (:), LF, or CR. + The label MAY contain linear whitespace characters but MUST NOT start or + end with whitespace. + + + It is RECOMMENDED that lines not exceed 79 characters in length. Long values MAY be + continued onto the next line by inserting a LF, CR, or CRLF, and then indenting + the next line with one or more linear white space characters (spaces or tabs). + Except for linebreaks, such padding does not form part of the value. + + + Implementations wishing to support previous BagIt versions + MUST accept multiple linear whitespace characters before and after the + colon when the bag version is earlier than 1.0; such whitespace + does not form part of the label or value. + + + The following are reserved metadata elements. The use of these reserved + metadata elements is OPTIONAL but encouraged. Reserved metadata + element names are case insensitive. Except where indicated otherwise, + these metadata element names MAY be repeated to capture multiple values. + + + + + + Organization transferring the content. + + + Mailing address of the source organization. + + + Person at the source organization who is responsible for the content + transfer. + + + International format telephone number of person or position responsible. + + + Fully qualified email address of person or position responsible. + + + A brief explanation of the contents and provenance. + + + Date (YYYY-MM-DD) that the content was prepared for transfer. + This metadata element SHOULD NOT be repeated. + + + A sender-supplied identifier for the bag. + + + The size or approximate size of the bag being transferred, followed + by an abbreviation such as MB (megabytes), GB (gigabytes), or + TB (terabytes): for example, + 42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum (described + next), Bag-Size is intended for human consumption. + This metadata element SHOULD NOT be repeated. + + + The "octetstream sum" of the payload, which is intended for the + purpose of quickly detecting incomplete bags before performing checksum + validation. This is strictly an optimization, and implementations MUST perform + the standard checksum validation process before proclaiming a bag to be valid. + This element MUST NOT be present more than once and, if present, MUST + be in the form "OctetCount.StreamCount", + where OctetCount is the total number of + octets (8-bit bytes) across all payload file content and + StreamCount is the total number of + payload files. + This metadata element MUST NOT be repeated. + + + A sender-supplied identifier for the set, if any, of bags + to which it logically belongs. + This identifier SHOULD be unique across the sender's content, + and if it is recognizable as belonging to a globally unique scheme, the receiver + SHOULD make an effort to honor the reference to it. + This metadata element SHOULD NOT be repeated. + + + Two numbers separated by "of", in particular, "N of T", + where T is the total number of bags in a group of bags and N is the + ordinal number within the group. If T is not known, specify it as "?" + (question mark): for example, 1 of 2, 4 of 4, 3 of ?, 89 of 145. + This metadata element SHOULD NOT be repeated. + If this metadata element is present, it is RECOMMENDED to also + include the Bag-Group-Identifier element. + + + An alternate sender-specific identifier for the content + and/or bag. + + + A sender-local explanation of the contents and provenance. + + + + + In addition to these metadata elements, other arbitrary metadata + elements MAY also be present. + +
+ An example of "bag-info.txt" file is as follows: + +Source-Organization: FOO University +Organization-Address: 1 Main St., Cupertino, California, 11111 +Contact-Name: Jane Doe +Contact-Phone: +1 111-111-1111 +Contact-Email: example@example.com +External-Description: Uncompressed greyscale TIFF images from the + FOO papers colle... +Bagging-Date: 2008-01-15 +External-Identifier: university_foo_001 +Payload-Oxum: 279164409832.1198 +Bag-Group-Identifier: university_foo +Bag-Count: 1 of 15 +Internal-Sender-Identifier: /storage/images/foo +Internal-Sender-Description: Uncompressed greyscale TIFFs created + from microfilm and are... +
+
+ +
+ + + For reasons of efficiency, a bag MAY be sent with a list of files to be + fetched and added to the payload before it can meaningfully be checked + for completeness. + The fetch file allows a bag to be transmitted with + "holes" in it, which can be practical for several reasons. For example, + it obviates the need for the sender to stage a large serialized copy of + the content while the bag is transferred to the receiver. Also, this + method allows a sender to construct a bag from components that are either + a subset of logically related components (e.g., the localized logical + object could be much larger than what is intended for export) or + assembled from logically distributed sources (e.g., the object components + for export are not stored locally under one filesystem tree). + An OPTIONAL tag file, called the fetch file, contains such a list. + + + + The fetch file MUST be named "fetch.txt". Every file listed in + the fetch file MUST be listed in every + payload manifest. A fetch file MUST NOT list any tag files. + + + Each line of a fetch file MUST be of the form + +
+ url length filepath + + where url identifies the file to be + fetched and MUST be an absolute URI as defined in + , length is + the number of octets in the file (or "-", to leave it unspecified), + and filepath identifies the + corresponding payload file, relative to the base directory. + +
+ + + The slash character ('/') MUST be used as a path separator in + filepath. One or more linear whitespace + characters (spaces or tabs) MUST separate these + three values, and any such characters in the url + MUST be percent-encoded . + If filename includes an LF, a CR, + a CRLF, or a percent sign (%), those characters (and only those) MUST be + percent-encoded as described in . + There is no + limitation on the length of any of the fields in the fetch file. + + +
+ +
+ + A bag MAY contain other tag files that are not defined by this + document. + + Implementations MUST perform standard checksum validation on any tag file + that is listed in a tag manifest but MUST otherwise ignore their contents. + +
+ +
+ +
+ + All tag files specifically described in this document MUST adhere to + the text tag file format described below. Other tag files MAY adhere to + the text tag file format described below. + - -This is the layout of a basic bag containing an image and a companion -OCR file. Lines of file content are shown in parentheses beneath the -file name. + + Text tag files are line oriented, and each line MUST be terminated + by an LF, a CR, or a CRLF. It is RECOMMENDED that the last line in a tag + file also end with LF, CR, or CRLF. + Text tag file names MUST end in the extension ".txt". + + +In all text tag files except for the bag declaration file, text MUST use +the character encoding specified in the "bagit.txt" bag declaration +file. Text tag files except for the bag declaration file MAY include a +Byte Order Mark (BOM) only if the specified encoding requires it for +proper decoding. In accordance with , when "bagit.txt" +specifies UTF-8, the tag files MUST NOT begin with a BOM. +See . + + +The use of UTF-8 for text tag files is strongly RECOMMENDED. A future version +of BagIt may disallow encodings other than UTF-8. + +
+ +
+ +The payload manifest and tag manifest permit validating the integrity of the payload +and tag files in a bag produced by the checksum algorithms. +Checksum values MUST be encoded so as to conform to the manifest format +specified in . However, the internal details +of a checksum are outside the scope of this document. + + + To avoid future ambiguity, the checksum algorithm SHOULD be registered + in IANA's "Named Information Hash Algorithm Registry" + according to but MAY, for backwards compatibility, also be + MD5 or SHA-1 . + + +The name of the checksum algorithm MUST be normalized for use in the +manifest's filename by lowercasing the common name of the algorithm and +removing all non-alphanumeric characters. Following is a partial list +that maps common algorithm names to normalized names: + + MD5: md5 + SHA-1: sha1 + sha-256: sha256 + sha-512: sha512 + + + + Starting with BagIt 1.0, bag creation and validation tools MUST support the + SHA-256 and SHA-512 algorithms and SHOULD enable + SHA-512 by default when creating new bags. + + For backwards compatibility, implementers SHOULD support + MD5 and SHA-1 . + + Implementers are encouraged to simplify the process of adding additional + manifests using new algorithms to streamline the process of in-place + upgrades. + +
+ +
+ +
+ +A complete bag MUST meet the following +requirements: + + + + Every required element MUST be present (see ). + Every file listed in every tag manifest MUST be present. + Every file listed in every payload manifest MUST be present. + For BagIt 1.0, every payload file MUST be listed in every payload manifest. + Note that older versions of BagIt allowed payload files to be + listed in just one of the manifests. + + Every element present MUST conform to BagIt 1.0. + + + +A valid bag MUST meet the following requirements: + + + + The bag MUST be complete. + + Every checksum in every payload manifest and tag manifest has been + successfully verified against the contents of the corresponding file. + + + +
+ +
+
+ + This is the layout of a basic bag containing an image and a companion + Optical Character Recognition (OCR) file. Lines of file content are shown with added parentheses to + indicate each complete line. + For brevity, this example uses MD5 rather than the recommended SHA-512. + - -
- +
+ myfirstbag/ | | manifest-md5.txt @@ -813,7 +753,7 @@ myfirstbag/ | (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt) | | bagit.txt -| (BagIt-version: 0.96 ) +| (BagIt-version: 1.0 ) | (Tag-File-Character-Encoding: UTF-8 ) | \--- data/ @@ -823,656 +763,484 @@ myfirstbag/ | | 27613-h/images/q172.txt | (... OCR text ... ) - .... - -
- - -
- - - - -
- - -The following example bag contains content from a web crawler. -As before, lines of file content are shown in parentheses beneath the -file name, with long lines continued indented on subsequent lines. -This bag is not complete until every -component listed in the "fetch.txt" file is retrieved. + .... + +
+
+ + This is the layout of a bag that expects the receiver to download the + files listed in the payload manifests prior to validation. Lines of + file content are shown with added parentheses to indicate each + complete line. + For brevity, this example uses MD5 rather than the recommended SHA-512. + -
- -mysecondbag/ +
+ +highsmith-tahoe/ | | manifest-md5.txt -| (93c53193ef96732c76e00b3fdd8f9dd3 data/Collection Overview.txt ) -| (e9c5753d65b1ef5aeb281c0bb880c6c8 data/Seed List.txt ) -| (61c96810788283dc7be157b340e4eff4 data/gov-20060601-050019.arc.gz) -| (55c7c80c6635d5a4c8fe76a940bf353e data/gov-20060601-100002.arc.gz) +| (102b0e6effe208ef9b29864946de9e22 data/23364a.tif ) | -| fetch.txt -| (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-050019.arc.gz -| 26583985 data/gov-20060601-050019.arc.gz ) -| (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-100002.arc.gz -| 99509720 data/gov-20060601-100002.arc.gz ) -| ( ...............................................................) -| -| bag-info.txt -| (Source-organization: California Digital Library ) -| (Organization-address: 415 20th St, 4th Floor, Oakland, CA 94612) -| (Contact-name: A. E. Newman ) -| (Contact-phone: +1 510-555-1234 ) -| (Contact-email: alfred@ucop.edu ) -| (External-Description: The collection "Local Davis Flood Control ) -| Collection" includes captured California State and local ) -| websites containing information on flood control resources for ) -| the Davis and Sacramento area. Sites were captured by UC Davis) -| curator Wrigley Spyder using the Web Archiving Service in ) -| February 2007 and October 2007. ) -| (Bag-date: 2008.04.15 ) -| (External-identifier: ark:/13030/fk4jm2bcp ) -| (Bag-size: about 22Gb ) -| (Payload-Oxum: 21836794142.831 ) -| (Internal-sender-identifier: UCDL ) -| (Internal-sender-description: UC Davis Libraries ) +| fetch.txt +| (https://cdn.loc.gov/master/pnp/highsm/23300/23364a.tif +| 216951362 data/23364a.tif ) | | bagit.txt -| (BagIt-version: 0.96 ) -| (Tag-File-Character-Encoding: UTF-8 ) +| (BagIt-version: 1.0 ) +| (Tag-File-Character-Encoding: UTF-8 ) | -\--- data/ - | - | Collection Overview.txt - | (... narrative description ... ) - | - | Seed List.txt - | (... list of crawler starting point URLs ... ) - .... - -
- - - -
-
- -
- -
- - -The paths specified in the payload manifest, tag manifest, and -"fetch.txt" file do not prohibit special directory characters which might be -significant on implementing systems. Implementors &should; take care that -files outside the bag directory structure are not accessed when reading or -writing files based on paths specified in a bag. - - - -For example, path characters such as ".." or "~" -in a maliciously crafted "fetch.txt" file might cause a naive implementation to -overwrite critical system files. - -
- -
- -Implementors of tools that complete bags by retrieving URLs listed in a -"fetch.txt" file need to be aware that some of those URLs may point to hosts, -intentionally or unintentionally, that are not under control of the bag's -sender. Checksums are intended as a reasonable guarantee against corruption -during transit, not a strong cryptographic protection against intentional -spoofing. - -
- -
- - -The size of files, as optionally reported in the "fetch.txt" file, cannot be -guaranteed to match the actual file size to be downloaded. Implementors &should; -take care to appropriately handle cases where the actual file size does not -match the file size reported in the fetch.txt. Implementors &should-not; use -the file size in the "fetch.txt" file for critical resource allocation, such as -buffer sizing or storage requisitioning. - -
- -
- -
-
- - -When creating a bag on physical media (such as hard disk, CD-ROM, or -DVD) for transfer to another organization, the sender should select -and format the media in a manner compatible with both the content -requirements (e.g., file names and sizes) and the receiver's technical -infrastructure. If the receiver's infrastructure is not known or the -media needs to be compatible with a range of potential receivers, -consideration should be given to portability and common usage. For -example, a "lowest common denominator" for some potential receivers -could be USB disk drives formatted with the FAT32 filesystem. - - - -Although overall bag size is unlimited in principle, network-based -transfers may involve constraints on the amount of bag data that a -receiver can receive at one time. It may be practical to split a -large bag into several smaller bags. - - - -Transmitting a whole bag in serialized form as a single file will tend -to be the most straightforward mode of transfer. When throughput is a -priority, use of "fetch.txt" lends itself to an easy, application-level -parallelism in which the list of URL-addressed items to fetch is divided -among multiple processes. -The mechanics of sending and receiving bags over networks is otherwise -out of scope of the present document and may be facilitated by protocols -such as and . - - -
- -
- - -This section is not part of the BagIt specification. It describes some -practical considerations for bag creators and receivers circa 2010. - - -
- - -Some cautions regarding bag interchange arise in regard to the -commonly available checksum tools distributed with the GNU Coreutils -package (md5sum, sha1sum, etc.), collectively referred to here as -"md5sum". First, md5sum can be run in binary or text -mode; text mode sometimes normalizes line-endings. While these -modes appear to produce the same checksums under Unix-like systems, they -can produce different checksums under Windows. When using md5sum, it -may be safest to run it in binary mode, with one caveat: a side-effect -of binary mode is that md5sum requires a space and an asterisk ('*'), -compared to two spaces in text mode, between the CHECKSUM and FILENAME in -its manifest format. - - - -Due to the widespread use of md5sum (and its relatives), it is not -unexpected for bag receivers to see manifests in which CHECKSUM and -FILENAME are separated by a space followed by an asterisk. Implementors -creating or processing bags with md5sum should be aware of these subtle -differences, and ensure compliance with the manifest specification in this -document. Implementors creating and processing bags with other tools may wish -to be tolerant of asterisks found in the manifests. - - -A final note about md5sum-generated manifests is that for a -FILENAME containing a backslash ('\'), the manifest line will have a -backslash inserted in front of the CHECKSUM and, under Windows, the -backslashes inside FILENAME may be doubled. - - -
- -
- - -As specified above, only the Unix-based path separator ('/') may be -used inside filenames listed in BagIt manifests and "fetch.txt" files. -When bags are exchanged between Windows and Unix platforms, care should -be taken to translate the path separator as needed. Receivers of bags on -physical media should be prepared for filesystems created under either -Windows or Unix. Besides the fundamental difference between path -separators ('\' and '/'), generally, Windows filesystems have more -limitations than Unix filesystems. Windows path names have a maximum of -255 characters, and none of these characters may be used in a path -component: - -
- - < > : " / | ? * - -
- -Windows also reserves the following names: CON, PRN, AUX, NUL, COM1, -COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, -LPT5, LPT6, LPT7, LPT8, and LPT9. See for more -information. -
- -
- -
-
- -
- - -BagIt owes much to many thoughtful contributers and reviewers, including -Stephen Abrams, Mike Ashenfelder, Dan Chudnov, Brad Hards, Scott Fisher, Keith Johnson, Erik -Hetzner, Leslie Johnston, David Loy, Mark Phillips, Tracy Seneca, Brian Tingle, Adam Turoff, and Jim Tuttle. - - -
- - -This draft does not request any action from IANA. - - -
- -
- - - - - - - - - A Collaboration Model between Archival Systems to Enhance - the Reliability of Preservation by an Enclose-and-Deposit Method - - - - - - - - - The GrabIt File Exchange Protocol - - - - - - - - - Naming a File - - - - - - - &rfc1321; - &rfc2119; - &rfc3174; - &rfc3629; - &rfc3986; - - - - Simple Web-service Offering Repository Deposit (SWORD) - - - - - - - - -
- - -(This appendix to be removed in the final draft.) - - -
- -Allowing tag directories. - - - -Fixed definition of valid. - - - -Clarified that tag files do not need to be text files. - - - -Clarified that repeatability and ordering of metadata elements in bag-info.txt. - - - -Clarified case of hex-encoding in manifests. - - -
- -
- -Re-replaced entity reference for current version number in artwork, -where it doesn't appear to work (xml2rfc bug?). -Updated to latest IETF Trust Legal Provisions 200902. (jak) - - - -Re-wording Tag File Format section. - - - -Adding new section for Other Tag Files. - - - -Minor clarification on the Fetch File description. - - - -Synchronized the language between the Payload Manifest and the Tag Manifest sections. - - - -Minor grammatical corrections and clarifications to the Payload Manifest section. - - - -Re-worded and re-ordered payload section and structure intro. Except for the base directory naming, the structure intro is strictly explanatory. - - - -Replaced current version number with entity reference. - - - -Move checksum algorithm information into its own section. - - - -Major re-wording of section on validity and completeness to provide -explicit, enumerated definitions for "valid", "complete", and "incomplete" bags. - - - -Added explicit wording about byte order marks (BOM) in UTF-8. - - - -Re-named section titles for better clarity. - - - -Re-wording security consideration on checksum purposes to more accurately -reflect the real purposes of the checksums. - - - -Major restructuring of the document for brevity and -precision. - - - -Added RFC 2119 language. - - - -Added terminology section. - - - -Cleaning up example artwork so that parenthesis are more consistently used. - - - -Explicitly stated version number required for comforming to the current -version of the specification. - - - -Various minor tweaks to grammar and wording. - -
- -
- -Re-worded interoperability statement in the Introduction. (Justin) - - - -Added statements regarding no limitations on various paths, URI, and other -lengths. - - - -Clarified that the bag directory may not contain any other directories except -for the "data" directory. - - - -A soel carriage return character is now explicitly allowed as a valid line -separator. - - - -Tag file encoding requirements are now required to be as-stated in the -"bagit.txt". The "bagit.txt" file is explicitly required to be in UTF-8. - - - -Wording cleanup, clarifying payload file manifests and tag file manifests. - - - -Tags in "bag-info.txt" no longer have any ordering requirement. - - - -Tag formatting now explicitly states where significant whitespace begins -in the tag. - - - -After some consideration, added some security considerations. - - - -Made it clear that a bag may contain other bags, re: serialization. - - - -Re-worded interoperabiilty to concerns to require creators to be -spec-compliant, and readers to be tolerant of known potential issues. - - - -Specificity to the FILENAME element in "fetch.txt" is relative to the bag -root, and to make sure to treat leading slashes as relative. - - - -Updated acknowledgements. - - - -Various other minor edits for clarity and readibility. - -
- -
- - -Added language to require the slash ('/') as path separator, -regardless of the platform where the bag was created. -Added an extra co-author and an Acknowledgements section. - - - -Deleted the unnecessary "(optional)" from four of the metadata elements, -since all metadata elements are optional. Softened the equivalence of -the serialization name and name of the contained bag base directory. -Replaced the reference to RFC2822 with an inline description of the -simpler bag-info.txt format. - - - -Changed to a variable linear whitespace separator in the description -of manifest layout and in manifest examples. -Added two paragraphs under a new "Checksum tools" subsection of the -Interoperability section to describe some of the peculiarities of -dealing with the widely used GNU Coreutils checksum tools. - - - -With the new version, 0.96, there is an important and incompatible change -of file name (package-info.txt -> bag-info.txt), metadata element names -(Package-Size -> Bag-Size, Packing-Date -> Bagging-Date), and -descriptive language to replace the noun "package" with "bag" throughout -the spec. This was to reduce unnecessary synonymy and free up the noun -"package" to name the physical container (e.g., a mailing carton) used to -transfer hard disks. - - - -In section 7, another important change is the introduction of the -Payload-Oxum ("octetstream sum") metadata element to convey precise, -machine-readable payload size information for capacity planning -(especially useful when preparing to receive files listed in fetch.txt). -The Bag-size definition was adjusted to steer it more towards human -consumption. - - - -In section 2.2 the spec now requires exactly two spaces between checksum -and filename in manifests. This results from the experience that as of -2008, not all widely available validation tools are flexible in the -kind of separating whitespace recognized. The examples have been -updated to include use the two-space form as well. - - - -Comment added that while overall bag size is unlimited, practical -limitations on the amount of data that a receiver can stage may -warrant splitting a large bag into several smaller bags. - - - -Added a reference to the SWORD protocol. - - - -Minor edits for scanning and reformatting to cut down line length for -some figures that exceeded 72 chars (limit for Internet-Drafts). - - -
- -
- - -Added mention of preserving empty directories. - - - -Simplified function of "tag checksum file" to "tag manifest", having same -format as payload manifest. The tag manifest is optional and need not -include every tag file. - - - -Loosened interpretation of payload manifest to "union" concept: -every payload file must be listed in at least one manifest but -need not be listed in every manifest. - - - -Shortened the Introduction's first paragraph to be less duplicative -of text in the Abstract. - - - -Changed Delivery-Date to Packing-Date. - - - -Correctly sorted the author list and clarification of -deserialization wording. - - -
- -
- - -Author address corrections and miscellaneous stylistic edits. - - - -Added some mention of physical media-based transfers, preferred -characteristics of transfer filesystems, and network transfer issues. - - - -Added basic bag example early and changed the narrative to more clearly -delineate component files. - - - -Wording changes under fetch.txt, and note that fetch.txt will need to be -modified before bag return. - - - -Fixed checksum encoding reference to base64 rather than hex. (B. Vargas) - - - -Described simple normalization approach for checksum algorithm names. (B. Vargas) - - - -In the example bag, add the ARC files found in the fetch.txt to the manifest as well (A. Turoff) - - -
+| bag-info.txt +| (Internal-Sender-Description: Download link found at ) +| ( https://www.loc.gov/resource/highsm.23364/ ) + +
+
+ +
+
+ + The paths specified in the payload manifests, tag manifests, and + fetch files do not prohibit special directory characters that have + special meaning on some operating systems. Implementers MUST ensure + that files outside the bag directory structure are not accessed when + reading or writing files based on paths specified in a bag. + + + All implementations SHOULD have a test suite to guard against + special directory characters. + + + For example, a maliciously crafted "tagmanifest-sha512.txt" file might + contain entries that begin with a path character such as "/", "..", + or a "~username" home directory reference in an attempt to cause a + naive implementation to leak or overwrite targeted files on a POSIX + operating system. + + + Windows implementations SHOULD test their implementations to ensure + that safety checks prevent use of drive letters and the less commonly used + namespace sequences (e.g., "\\?\C:\...") described in . + + + To assist implementers, the Library + of Congress conformance suite + has some tests for invalid bags + that are expected to fail on POSIX or Windows clients. + +
+
+ + Implementers of tools that complete bags by retrieving URLs listed in + a fetch file need to be aware that some of those URLs might point + to hosts, intentionally or unintentionally, that are not under control + of the bag's sender. Moreover, older checksum algorithms, even if + reasonable for detecting corruption during transit, may not offer strong + cryptographic protection against intentional spoofing. + +
+ +
+ + The size of files, as optionally reported in the fetch file, + cannot be guaranteed to match the actual file size to be downloaded. + Implementers SHOULD take steps to monitor and abort transfer when the + received file size exceeds the file size reported in the fetch file. + Implementers SHOULD NOT use the file size in the + fetch file for critical resource allocation, such as buffer + sizing or storage requisitioning. + +
+ +
+ + The integrity assurance provided by manifests is designed to provide + high levels of confidence against data corruption but is not designed + to be secure against active attacks. Organizations that need to + secure bags against such threats SHOULD agree on additional + measures, such as digital signatures, that are out + of scope for this specification. + +
+ + + + +
+ +
+
+ + This section lists practical considerations for implementers and + users. None of the points below are required, but they are recommended + for general-purpose usage. + + + + Upon discovering errors in bags, an implementation is free to take action + (for example, logging or reporting) in an application-specific manner. + This document does not mandate any particular action. + -
- + + The Library of Congress conformance suite + is provided as a public resource to test new implementations for compatibility and + error handling. + +
+ + This section provides background information on various challenges caused by + differences in how operating systems, filesystems, and common tools handle + filenames. This section is followed by a list of recommendations for implementers in + . + + +
+ + There are three challenges for interoperability related to filename case: + + Filesystems such as File Allocation Table (FAT) or Extended File + Allocation Table (EXFAT) always convert filenames to uppercase: + "example.txt" will be stored as "EXAMPLE.TXT". + + Many Unix filesystems save filenames exactly as provided, which allows + multiple files that differ only in case: "example.txt" and + "Example.txt" are separate files. + + New Technology File System (NTFS) and Apple's Hierarchical File System + (HFS) Plus usually preserve case when storing files but are + case insensitive when retrieving them. A file saved as "Example.txt" + will be retrieved by that name but will also be retrieved as + "EXAMPLE.TXT", "example.txt", etc. + + +
+
+ +The Unicode specification has common cases where different character sequences +produce the same human-meaningful text. +These are referred to as "canonically equivalent" and the Unicode +specification defines different normalization forms - see for the full details. +
+ +The example below shows the common surname "Nunez" normalized in different forms. + + +
+ + Unicode normalization is relevant to BagIt implementors because different + systems have different standards for normalization: + + + Apple's HFS Plus filesystem always normalizes filenames to a + fully decomposed form based on the Unicode 2.0 specification (see ). + + Windows treats filenames as opaque character sequences (see ) and will store and return the encoded bytes exactly + as provided. + + Linux and other common Unix systems are generally similar to Windows in + storing and returning opaque byte streams, but this behavior is + technically dependent on the filesystem. + + Utilities used for file management, transfer, and archiving may ignore this + issue, apply an arbitrary normalization form, or allow the user to control + how normalization is applied. + + + + In practice, this means that the encoded filename stored in a manifest may + fail a simple file existence check because the filename's normalization was + changed at some point after the manifest was written. This situation is very + confusing for users because the filenames are visually indistinguishable, and + the "missing" file is obviously present in the payload directory. + +
+
+ + + + Implementations SHOULD discourage the creation of bags containing + files that differ only in case. + + + Implementations SHOULD prevent the creation of bags containing files + that differ only in normalization form. + + + BagIt implementations SHOULD tolerate differences in normalization + form by comparing both the list of filesystem and manifest names after + applying the same normalization form to both. + + + Implementations SHOULD issue a warning when multiple manifests are + present that differ only in case or normalization form. + + + +
+
+
+ + As specified above, only the Unix-based path separator ('/') may be + used inside filenames listed in BagIt manifest and fetch.txt files. + When bags are exchanged between Windows and Unix platforms, + the path separator SHOULD be translated as needed. Receivers + of bags on physical media SHOULD be prepared for filesystems created + under either Windows or Unix. Besides the fundamental difference + between path separators ('\' and '/'), generally, Windows + filesystems have more limitations than Unix filesystems. + +
+ + Windows path names have a maximum of + 255 characters, and none of these characters may be used in a path + component: + + + + < > : " / | ? * +
+
+ + Windows also reserves the following names, with or without a file extension: + + + CON, PRN, AUX, NUL + COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9 + LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9 +
+ + See for more information and possible alternatives. + +
+
+ +Some bags have been manually assembled using checksum utilities such as those +contained in the GNU Coreutils package (md5sum, sha1sum, etc.), collectively +referred to here as "md5sum". Implementers who desire wide support of legacy +content should be aware of some known quirks of these tools. + + +md5sum can be run in "text mode", which causes it to normalize line endings +on some operating systems. On Unix-like systems, both modes will usually produce +the same results; on systems like Windows, they can produce different results +based on the file contents. + +The md5sum output format has two characters between the checksum and the +filepath: the first is always a space, and the second is an asterisk ("*") for +binary mode and a space for text mode. + + +A final note about md5sum-generated manifests is that, for a filepath containing +a backslash ('\'), the manifest line will have a backslash inserted in front of +the checksum and, under Windows, the backslashes inside +filepath can be doubled. + + +Implementers MAY wish to accept this format by ignoring a leading asterisk or +handling differences in line termination gracefully but, if so, implementations +MUST warn the user that the bag in question will fail strict validation. In +such cases, it is RECOMMENDED that tools provide an easy option to +update the bag with valid manifests. + +
+
+ +
+ +
+ +The Augmented Backus-Naur Form (ABNF) rules provided below are non-normative. If +there is a discrepancy between requirements in the normative sections and +the ABNF, the requirements in the normative sections prevail. Some +definitions use the core rules (e.g., DIGIT, HEXDIG, etc) as defined in +. + +
+
+ bagit.txt ABNF rules: + +
+
+ +
+
+ Payload Manifest ABNF rules: + +
+
+ +
+
+ bag-info.txt ABNF rules: + +
+
+ +
+
+ fetch.txt ABNF rules: + +length = 1*DIGIT / "-" +filepath = ("data/" + 1*( unreserved / pct-encoded / sub-delims )) +ending = CR / LF / CRLF ]]> +
+
+ +
+ +
+ +This document has no IANA actions. + +
+
+ + + + &RFC1321; + &RFC2119; + &RFC6920; + + + + Named Information Hash Algorithm + IANA + + + + + + Character Set + IANA + + + + &RFC3174; + &RFC3629; + &RFC3986; + &RFC6234; + &RFC8174; + + + + + + A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method + + + + + + + + + + + + + Naming Files, Paths, and Namespaces + Microsoft, Inc. + + + + + + &RFC5234; + + + + Unicode Standard Annex #15: Unicode Normalization Forms + Unicode Consortium + + + + + + + + + Technical Note TN1150: HFS Plus Volume Format + Apple Inc. + + + + + + + Test cases for validating Bagit Implementations + The Library of Congress + + + + + + +
+ BagIt benefitted from the thoughtful assistance of Stephen Abrams, + Mike Ashenfelder, Dan Chudnov, Dave Crocker, Scott Fisher, Brad Hards, + Erik Hetzner, Keith Johnson, Leslie Johnston, David Loy, Mark Phillips, + Tracy Seneca, Stian Soiland-Reyes, Brian Tingle, Adam Turoff, and Jim + Tuttle. + +
+
+ Additional contributors to the authoring of BagIt are Andy Boyko, + David Brunton, Rosie Storey, Ed Summers, Brian Vargas, and Kate + Zwaard. +
+
diff --git a/makefile b/makefile deleted file mode 100644 index 58e03ec3..00000000 --- a/makefile +++ /dev/null @@ -1,7 +0,0 @@ -default: html text - -html: - xml2rfc bagit.xml - -text: - xml2rfc --html bagit.xml