Sanitization is the process of removing sensitive information from a document or other medium, so that it may be distributed to a broader audience. When dealing with
classified informationClassified information is sensitive information to which access is restricted by law or regulation to particular groups of persons. A formal security clearance is required to handle classified documents or access classified data. The clearance process requires a satisfactory background investigation...
, sanitization attempts to reduce the document's classification level, possibly yielding an unclassified document. Originally, the term sanitization was applied to
printedPrinting is a process for reproducing text and image, typically with ink on paper using a printing press. It is often carried out as a large-scale industrial process, and is an essential part of publishing and transaction printing....
documents; it has since been extended to apply to
computerA computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...
media and the problem of
data remanenceData remanence is the residual representation of data that remains even after attempts have been made to remove or erase the data. This residue may result from data being left intact by a nominal file deletion operation, by reformatting of storage media that does not remove data previously written...
as well.
Redaction generally refers to the
editingEditing is the process of selecting and preparing written, visual, audible, and film media used to convey information through the processes of correction, condensation, organization, and other modifications performed with an intention of producing a correct, consistent, accurate, and complete...
or blacking out of text in a document, or to the result of such an effort. It is intended to allow the selective disclosure of information in a document while keeping other parts of the document secret. Typically the result is a document that is suitable for
publicationTo publish is to make content available to the public. While specific use of the term may vary among countries, it is usually applied to text, images, or other audio-visual content on any medium, including paper or electronic publishing forms such as websites, e-books, Compact Discs and MP3s...
, or for dissemination to others than the intended audience of the original document. For example, when a document is
subpoenaA subpoena is a writ by a government agency, most often a court, that has authority to compel testimony by a witness or production of evidence under a penalty for failure. There are two common types of subpoena:...
ed in a court case, information not specifically relevant to the case at hand is often redacted.
Government secrecy
In the context of
governmentGovernment refers to the legislators, administrators, and arbitrators in the administrative bureaucracy who control a state at a given time, and to the system of government by which they are organized...
documents,
redaction (also called
sanitization) generally refers more specifically to the process of removing sensitive or
classified informationClassified information is sensitive information to which access is restricted by law or regulation to particular groups of persons. A formal security clearance is required to handle classified documents or access classified data. The clearance process requires a satisfactory background investigation...
from a document prior to its publication, during
declassificationDeclassification is the process of documents that formerly were classified as secret ceasing to be so restricted, often under the principle of freedom of information. Procedures for declassification vary by country...
.
Secure document redaction techniques
The traditional technique of redacting confidential material from a paper document before its public release involves crossing out portions of text with a wide black pen, followed by
photocopyingA photocopier is a machine that makes paper copies of documents and other visual images quickly and cheaply. Most current photocopiers use a technology called xerography, a dry process using heat...
the result. This is a relatively easy to understand process and has only minor risks associated with it. For example, if the black pen is not wide enough, careful examination of the resulting photocopy may still reveal partial information about the text, such as the difference between short and tall letters. The exact length of the removed text also remains recognizable, which may help to guess plausible wordings for shorter redacted sections. Where computer-generated proportional fonts were used, even more information can leak out of the redacted section in the form of the exact position of nearby visible characters.
The National Archives (UK) published a document,
Redaction Toolkit, Guidelines for the Editing of Exempt Information from Documents Prior to Release (2004), "to provide guidance on the editing of exempt material from information held by public bodies."
Secure redacting is a far more complicated problem with word processing file formats. These may also save a revision history of the edited text that still contains the redacted text. In some file formats, unused portions of memory are saved that may still contain fragments of previous versions of the text. Where text is redacted by overlaying graphical elements (usually black rectangles) on top of text, the original text remains in the file and can be uncovered by simply deleting the overlaying graphics. Effective redaction of electronic documents requires the actual removal of the text or image data from the document file. This either requires a very detailed understanding of the internal operation of the document processing software and file formats used, which most computer users lack, or software tools designed for sanitizing electronic documents (see external links below).
Redaction usually requires a marking of the redacted area with the reason that the content is being restricted. Government documents being released under the Freedom of Information Act are marked with exemption codes that denote the reason why the content has been sanitized.
The
National Security AgencyThe National Security Agency/Central Security Service is a cryptologic intelligence agency of the United States Department of Defense responsible for the collection and analysis of foreign communications and foreign signals intelligence, as well as protecting U.S...
published a document,
Redacting with Confidence: How to Safely Publish Sanitized Reports Converted from Word to PDF, which provides instructions for redacting
Microsoft WordMicrosoft Word is a word processor designed by Microsoft. It was first released in 1983 under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS , the Apple Macintosh , the AT&T Unix PC , Atari ST , SCO UNIX,...
generated PDF files.
Printed matter
A printed document which contains classified or sensitive information will frequently contain a great deal of information which is less sensitive. There may be a need to release the less sensitive portions to
unclearedA security clearance is a status granted to individuals allowing them access to classified information, i.e., state secrets, or to restricted areas after completion of a thorough background check. The term "security clearance" is also sometimes used in private organizations that have a formal...
personnel. The printed document will thus be sanitized to obscure or remove the sensitive information. The term
redactionRedaction is a form of editing in which multiple source texts are combined and subjected to minor alteration to make them into a single work. Often this is a method of collecting a series of writings on a similar theme and creating a definitive and coherent work...
is also used to describe this process, though that term is more often used in literary contexts.
In some cases, sanitizing a classified document removes enough information to reduce the classification from a higher level to a lower one. For example, raw
intelligence reportsMilitary intelligence is a military discipline that exploits a number of information collection and analysis approaches to provide guidance and direction to commanders in support of their decisions....
may contain highly classified information, like the identities of
spiesEspionage or spying involves an individual obtaining information that is considered secret or confidential without the permission of the holder of the information. Espionage is inherently clandestine, lest the legitimate holder of the information change plans or take other countermeasures once it...
, that is removed before the reports are distributed outside the intelligence agency: the initial report may be classified as Top Secret while the sanitized report may be classified as Secret.
In other cases, like the U.S. National Security Agency's report on the
USS Liberty incidentThe USS Liberty incident was an attack on a United States Navy technical research ship, , by Israeli Air Force jet fighter aircraft and Israeli Navy torpedo boats, on June 8, 1967, during the Six-Day War. The combined air and sea attack killed 34 crew members , wounded 170 crew members, and...
(right), the report may be sanitized to remove all sensitive data, so that the report may be released to the general public.
As is seen in the USS Liberty report, paper documents are generally sanitized by covering the classified and sensitive portions and then photocopying the document, resulting in a sanitized document suitable for distribution.
Computer media and files
Computer (electronic or digital) documents are more difficult to sanitize. In many cases, when information in an information system is modified or erased, some or all of the data remains in
storageComputer data storage, often called storage or memory, refers to computer components and recording media that retain digital data. Data storage is one of the core functions and fundamental components of computers....
. This may be an accident of design, where the underlying storage mechanism (
diskDisk storage or disc storage is a general category of storage mechanisms, in which data are digitally recorded by various electronic, magnetic, optical, or mechanical methods on a surface layer deposited of one or more planar, round and rotating disks...
, RAM, etc.) still allows information to be read, despite its nominal erasure. The general term for this problem is
data remanenceData remanence is the residual representation of data that remains even after attempts have been made to remove or erase the data. This residue may result from data being left intact by a nominal file deletion operation, by reformatting of storage media that does not remove data previously written...
. In some contexts (notably the US
NSAThe National Security Agency/Central Security Service is a cryptologic intelligence agency of the United States Department of Defense responsible for the collection and analysis of foreign communications and foreign signals intelligence, as well as protecting U.S...
,
DoDThe United States Department of Defense is the U.S...
, and related organizations),
sanitization typically refers to countering the data remanence problem;
redaction is used in the sense of this article.
However, the retention may be a deliberate
featureThe Institute of Electrical and Electronics Engineers defines the term feature in IEEE 829 as "A distinguishing characteristic of a software item ." - Feature-rich :...
, in the form of an
undoUndo is a command in many computer programs. It erases the last change done to the document reverting it to an older state. In some more advanced programs such as graphic processing, undo will negate the last command done to the file being edited....
buffer, revision history, "trash can",
backupIn information technology, a backup or the process of backing up is making copies of data which may be used to restore the original after a data loss event. The verb form is back up in two words, whereas the noun is backup....
s, or the like. For example, word processing programs like
Microsoft WordMicrosoft Word is a word processor designed by Microsoft. It was first released in 1983 under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS , the Apple Macintosh , the AT&T Unix PC , Atari ST , SCO UNIX,...
will sometimes be used to edit out the sensitive information. Unfortunately, these products do not always show the user all of the information stored in a file, so it is possible that a file may still contain sensitive information. In other cases, inexperienced users will use ineffective methods which fail to sanitize the document.
Metadata removal toolMetadata removal tool or Metadata scrubber is a type of privacy software built to protect the privacy of its users by removing potentially privacy-compromising metadata from files before they are shared with others Metadata removal tool or Metadata scrubber is a type of privacy software built to...
s are designed to effectively sanitize documents by removing potentially sensitive hidden information.
In May, 2005, the US military published a report on the death of
Nicola CalipariNicola Calipari was an Italian SISMI military intelligence officer with the rank of Major General. Calipari was killed by United States soldiers while escorting a recently released Italian hostage, journalist Giuliana Sgrena, to Baghdad International Airport.- Career :Calipari was born in Reggio...
, an Italian secret agent, at a US military checkpoint in
IraqIraq ; officially the Republic of Iraq is a country in Western Asia spanning most of the northwestern end of the Zagros mountain range, the eastern part of the Syrian Desert and the northern part of the Arabian Desert....
. The report was published in
PDFPortable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....
format and had been incorrectly redacted using commercial word processing tools. Shortly thereafter, readers discovered that the blocked-out portions could be retrieved using simple
cut and pasteIn human-computer interaction, cut and paste and copy and paste offer user-interface interaction techniques for transferring text, data, files or objects from a source to a destination. Most ubiquitously, users require the ability to cut and paste sections of plain text...
operations on the posted document.
Similarly, on May 24, 2006, lawyers for the communications service provider
AT&TAT&T Inc. is an American multinational telecommunications corporation headquartered in Whitacre Tower, Dallas, Texas, United States. It is the largest provider of mobile telephony and fixed telephony in the United States, and is also a provider of broadband and subscription television services...
filed a
legal briefA brief is a written legal document used in various legal adversarial systems that is presented to a court arguing why the party to the case should prevail....
regarding their cooperation with domestic wiretapping by the NSA. Text on pages 12 through 14 of the
PDFPortable Document Format is an open standard for document exchange. This file format, created by Adobe Systems in 1993, is used for representing documents in a manner independent of application software, hardware, and operating systems....
document were incorrectly redacted, and the covered text could be retrieved using cut and paste.
At the end of 2005, the NSA released a report giving recommendations on how to safely sanitize a
WordMicrosoft Word is a word processor designed by Microsoft. It was first released in 1983 under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS , the Apple Macintosh , the AT&T Unix PC , Atari ST , SCO UNIX,...
document.
Issues such as these make it difficult to reliably implement
multilevel securityMultilevel security or Multiple Levels of Security is the application of a computer system to process information with different sensitivities , permit simultaneous access by users with different security clearances and needs-to-know, and prevent users from obtaining access to information for...
systems, in which computer users of differing security clearances may share documents.
The Challenge of Multilevel Security gives an example of a sanitization failure caused by unexpected behavior in Microsoft Word's change tracking feature.
The two most common mistakes for incorrectly redacting a document are adding an image layer over the sensitive text without removing the underlying text, and setting the background color to match the text color. In both of these cases, the redacted material still exists in the document underneath the visible appearance and is subject to searching and even simple copy and paste extraction. Proper redaction tools and procedures must be used to permanently remove the sensitive information. This is often accomplished in a multi-user workflow where one group of people mark sections of the document as proposals to be redacted, another group verifies the redaction proposals are correct, and a final group operates the redaction tool to permanently remove the proposed items.
See also
- Censorship
thumb|[[Book burning]] following the [[1973 Chilean coup d'état|1973 coup]] that installed the [[Military government of Chile |Pinochet regime]] in Chile...
- Data erasure
Data erasure is a software-based method of overwriting data that completely destroys all electronic data residing on a hard disk drive or other digital media. Permanent data erasure goes beyond basic file deletion commands, which only remove direct pointers to data disk sectors and make data...
- Data remanence
Data remanence is the residual representation of data that remains even after attempts have been made to remove or erase the data. This residue may result from data being left intact by a nominal file deletion operation, by reformatting of storage media that does not remove data previously written...
- Freedom of Information Act