BagIt
Encyclopedia
BagIt is a hierarchical file packaging format designed to support disk-based storage and network transfer of arbitrary digital content. A "bag" consists of a "payload" (the arbitrary content) and "tags", which are metadata files intended to document the storage and transfer of the bag. A required tag file contains a manifest listing every file in the payload together with its corresponding checksum. The name, BagIt, is inspired by the "enclose and deposit" method , sometimes referred to as "bag it and tag it".

Bags are ideal for digital content normally kept as a collection of files. They are also well-suited to the export, for archival purposes, of content normally kept in database structures that receiving parties are unlikely to support. Relying on cross-platform (Windows and Unix) filesystem naming conventions, a bag's payload may include any number of directories and sub-directories (folders and sub-folders). A bag can specify payload content indirectly via a "fetch.txt" file that lists URLs for content that can be fetched over the network to complete the bag; simple parallelization (e.g., running 10 instances of "wget") can exploit this feature to transfer large bags very quickly. Benefits of bags include
  • Wide adoption in digital libraries (e.g., the Library of Congress).
  • Easy to implement using ubiquitous and ordinary filesystem tools.
  • Content that originates as files need only be copied to the payload directory.
  • Compared to XML wrapping, content need not be encoded, saving time and storage space.
  • Received content is ready-to-go in a familiar filesystem tree.
  • Easy to implement fast network transfer by running ordinary transfer tools in parallel.

Specification

BagIt is currently defined in an IETF internet draft that defines a simple file naming convention used by the digital curation
Digital curation
Digital curation is the selection, preservation, maintenance, collection and archiving of digital assets.Digital curation is generally referred to the process of establishing and developing long term repositories of digital assets for current and future reference by researchers, scientists,...

 community for packaging up arbitrary digital content, so that it can be reliably transported both via physical media (hard disk drive, CD-ROM
CD-ROM
A CD-ROM is a pre-pressed compact disc that contains data accessible to, but not writable by, a computer for data storage and music playback. The 1985 “Yellow Book” standard developed by Sony and Philips adapted the format to hold any form of binary data....

, DVD
DVD
A DVD is an optical disc storage media format, invented and developed by Philips, Sony, Toshiba, and Panasonic in 1995. DVDs offer higher storage capacity than Compact Discs while having the same dimensions....

) as well as network transfers (FTP, HTTP, rsync
Rsync
rsync is a software application and network protocol for Unix-like and Windows systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar...

, etc.). BagIt is also used for managing the digital preservation
Digital preservation
Digital preservation is the set of processes, activities and management of digital information over time to ensure its long term accessibility. The goal of digital preservation is to preserve materials resulting from digital reformatting, and particularly information that is born-digital with no...

 of content over time. Discussion about the specification and its future directions takes place on the Digital Curation discussion list.

The BagIt specification is organized around the notion of a “bag”. A bag is a named file system directory that minimally contains:
  • a “data” directory that includes the payload, or data files that comprise the digital content being preserved. Files can also be placed in subdirectories, but empty directories are not supported
  • at least one manifest file that itemizes the filenames present in the “data” directory, as well as their checksum
    Checksum
    A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...

    s. The particular checksum algorithm is included as part of the manifest filename. For instance a manifest file with MD5
    MD5
    The MD5 Message-Digest Algorithm is a widely used cryptographic hash function that produces a 128-bit hash value. Specified in RFC 1321, MD5 has been employed in a wide variety of security applications, and is also commonly used to check data integrity...

     checksums is named “manifest-md5.txt”
  • a “bagit.txt” file that identifies the directory as a bag, the version of the BagIt specification that it adheres to, and the character encoding
    Character encoding
    A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

     used for tag files


On receipt of a bag a piece of software can examine the manifest file to make sure that the payload files are present, and that their checksums are correct. This allows for accidentally removed, or corrupted files to be identified. Below is an example of a minimal bag “myfirstbag” that encloses two files of payload. The contents of the tag files are included below their filenames.


myfirstbag/
|-- data
| \-- 27613-h
| \-- images
| |-- q172.png
| \-- q172.txt
|-- manifest-md5.txt
| 49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png
| 408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt
\-- bagit.txt
BagIt-version: 0.96
Tag-File-Character-Encoding: UTF-8


In this example the payload happens to consist of a Portable Network Graphics image file and an Optical Character Recognition
Optical character recognition
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping...

 text file. In general the identification and definition of file format
File format
A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...

s is out of the scope of the BagIt specification. File attribute
File attribute
A file attribute is metadata that describes or is associated with a computer file. For example, an operating system often keeps track of the date a file was created and last modified, as well as the file's size and extension . File permissions are also kept track of...

s are neither covered.

The specification allows for several optional tag files (in addition to the manifest). Their character encoding must be identified in “bagit.txt”, which itself must always be encoded in UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

. The specification defines the following optional tag files:
  • a “bag-info.txt” file which details metadata for the bag, using colon-separated key/value pairs (similar to HTTP headers)
  • a tag manifest file which lists tag files and their associated checksums (e.g. “tagmanifest-md5.txt”)
  • a “fetch.txt” that lists URLs where payload files can be retrieved from in addition or to replace payload files in the “data” directory


The draft also describes how to serialize a bag in an archive file
Archive file
An archive file is a file that is composed of one or more files along with metadata that can include source volume and medium information, file directory structure, error detection and recovery information, file comments, and usually employs some form of lossless compression. Archive files may be...

, such as ZIP
ZIP (file format)
Zip is a file format used for data compression and archiving. A zip file contains one or more files that have been compressed, to reduce file size, or stored as is...

 or TAR
Tar (file format)
In computing, tar is both a file format and the name of a program used to handle such files...

.

History

The BagIt specification was a natural outgrowth of work done by The Library of Congress and the California Digital Library
California Digital Library
The California Digital Library is the University of California's 11th University Library. The CDL was founded to assist the ten University of California libraries in sharing their resources and holdings more effectively, in part through negotiating and acquiring consortial licenses on behalf of...

 in transferring digital content created as part of the National Digital Information Infrastructure and Preservation Program
National Digital Information Infrastructure and Preservation Program
The National Digital Information Infrastructure and Preservation Program is an archival program led by the Library of Congress to archive and provide access to digital resources. The U.S. Congress established the program in 2000...

. The origins of the idea date back to work done at the University of Tsukuba
University of Tsukuba
is located in the city of Tsukuba, Ibaraki Prefecture in the Kantō region of Japan. The University has 28 college clusters and schools with a total of around 15,000 students...

 on the "enclose and deposit" model, for mutually depositing archived resources to enable long-term digital preservation
Digital preservation
Digital preservation is the set of processes, activities and management of digital information over time to ensure its long term accessibility. The goal of digital preservation is to preserve materials resulting from digital reformatting, and particularly information that is born-digital with no...

. The practice of using manifests and checksums is fairly common practice as evidenced by their use in ZIP (file format)
ZIP (file format)
Zip is a file format used for data compression and archiving. A zip file contains one or more files that have been compressed, to reduce file size, or stored as is...

, the Deb (file format)
Deb (file format)
deb is the extension of the Debian software package format and the most often used name for such binary packages. Like the "Deb" part of the term Debian, it originates from the name of Debra, erstwhile girlfriend and now ex-wife of Debian's founder Ian Murdock.Debian packages are also used in...

, as well as on public FTP sites.

In 2007 the California Digital Library
California Digital Library
The California Digital Library is the University of California's 11th University Library. The CDL was founded to assist the ten University of California libraries in sharing their resources and holdings more effectively, in part through negotiating and acquiring consortial licenses on behalf of...

 needed to transfer several terabytes of content (largely Web archiving
Web archiving
Web archiving is the process of collecting portions of the World Wide Web and ensuring the collection is preserved in an archive, such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists typically employ web crawlers for...

 data) to the Library of Congress
Library of Congress
The Library of Congress is the research library of the United States Congress, de facto national library of the United States, and the oldest federal cultural institution in the United States. Located in three buildings in Washington, D.C., it is the largest library in the world by shelf space and...

. The BagIt specification allowed the content to be packaged up in "bags" with package metadata, and a manifest that detailed file checksums, which were later verified on receipt of the bags. The specification was written up as an IETF draft by John Kunze in December 2008, where it has seen several revisions. In 2009 the Library of Congress
Library of Congress
The Library of Congress is the research library of the United States Congress, de facto national library of the United States, and the oldest federal cultural institution in the United States. Located in three buildings in Washington, D.C., it is the largest library in the world by shelf space and...

 produced a video that describes the specification and the use cases around it.

Use

  • The Library of Congress
    Library of Congress
    The Library of Congress is the research library of the United States Congress, de facto national library of the United States, and the oldest federal cultural institution in the United States. Located in three buildings in Washington, D.C., it is the largest library in the world by shelf space and...

     is using the BagIt specification in several projects including its Content Transfer Services which allow digital content to be inventoried, and copied to production access and storage environments.
  • Archivematica is an open source digital preservation system which uses BagIt to create OAIS
    OAIS
    An Open Archival Information System is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community....

     Archival Information Packages (AIP).
  • Ghent University
    Ghent University
    Ghent University is a Dutch-speaking public university located in Ghent, Belgium. It is one of the larger Flemish universities, consisting of 32,000 students and 7,100 staff members. The current rector is Paul Van Cauwenberge.It was established in 1817 by King William I of the Netherlands...

     library is using the BagIt specification as archival format for its digital collections and as interchange format when adding new external collections (such as Google Books) to the local repositories.
  • The Dryad Data Repository, a repository of data underlying scientific publications, is using the BagIt specification to share data and related metadata with TreeBASE, a repository of phylogenetic information.
  • Towards Interoperable Preservation Repositories (TIPR) is a partnership between the Florida Center for Library Automation, Cornell University
    Cornell University
    Cornell University is an Ivy League university located in Ithaca, New York, United States. It is a private land-grant university, receiving annual funding from the State of New York for certain educational missions...

     and New York University
    New York University
    New York University is a private, nonsectarian research university based in New York City. NYU's main campus is situated in the Greenwich Village section of Manhattan...

     to develop, test and promote a standard interchange format for exchanging information pacakges among OAIS-based repositories. The proposed RXP format is using the BagIt specification to exchange package bundles via HTTP.
  • The Stanford Digital Repository (SDR) uses BagIt as the primary transfer format for content being deposited into the SDR.
  • Chronopolis, a large scale preservation system, uses BagIt as the transfer format for content that is deposited into the system.
  • The University of North Texas
    University of North Texas
    The University of North Texas is a public institution of higher education and research in Denton. Founded in 1890, UNT is part of the University of North Texas System. As of the fall of 2010, the University of North Texas, Denton campus, had a certified enrollment of 36,067...

     Libraries uses the BagIt specification as an archival container format in its digital repository and as an interchange format for importing and exporting digital objects from its repository.
  • The ERIS software from the Central Connecticut State University
    Central Connecticut State University
    Central Connecticut State University is a state university in New Britain, Connecticut, United States.The school was moved to its present campus in 1922...

     Library uses BagIt to verify archival packages that are deposited on Amazon S3
    Amazon S3
    Amazon S3 is an online storage web service offered by Amazon Web Services. Amazon S3 provides storage through web services interfaces...


Tools

The BagIt specification was designed for ease-of-use using familiar Unix utilities such as md5deep. However several BagIt specific tools have been created that can ease bag creation in several programming environments:

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK