All Topics  
Base64

 

   Email Print
   Bookmark   Link






 

Base64



 
 
The term Base64 refers to a specific MIME content transfer encoding
MIME

Multipurpose Internet Mail Extensions is an Internet standard that extends the format of electronic mail to support:* Text in character sets other than ASCII...
. It is also used as a generic term for any similar encoding scheme that encodes binary data by treating it numerically and translating it into a base 64
Base 64

The Base - system is a numeral system with 64 as its base. It is the largest power-of-two base that can be represented using single printable ASCII characters. This has led to its use as a transfer encoding for e-mail among other things....
 representation. The particular choice of base is due to the history of character set encoding: one can choose a set of 64 characters that is both part of the subset common to most encodings, and also printable.






Discussion
Ask a question about 'Base64'
Start a new discussion about 'Base64'
Answer questions from other users
Full Discussion Forum



Encyclopedia


The term Base64 refers to a specific MIME content transfer encoding
MIME

Multipurpose Internet Mail Extensions is an Internet standard that extends the format of electronic mail to support:* Text in character sets other than ASCII...
. It is also used as a generic term for any similar encoding scheme that encodes binary data by treating it numerically and translating it into a base 64
Base 64

The Base - system is a numeral system with 64 as its base. It is the largest power-of-two base that can be represented using single printable ASCII characters. This has led to its use as a transfer encoding for e-mail among other things....
 representation. The particular choice of base is due to the history of character set encoding: one can choose a set of 64 characters that is both part of the subset common to most encodings, and also printable. This combination leaves the data unlikely to be modified in transit through systems, such as email, which were traditionally not 8-bit clean
8-bit clean

Eight-bit clean describes a computer system that correctly handles 8-bit character , such as the ISO 8859 series and the UTF-8 encoding of Unicode....
.

The precise choice of characters is difficult. The earliest instances of this type of encoding were created for dialup communication between systems running the same OS
Operating system

An operating system is an interface between hardware and applications; it is responsible for the management and coordination of activities and the sharing of the limited resources of the computer....
 - e.g. Uuencode for UNIX
Unix

Unix is a computer operating system originally developed in 1969 by a group of American Telephone & Telegraph employees at Bell Labs, including Ken Thompson , Dennis Ritchie, Douglas McIlroy, and Joe Ossanna....
, BinHex
BinHex

BinHex, short for "binary-to-hexadecimal", is a binary-to-text encoding system that was used on the Mac OS for sending binary files through e-mail....
 for the TRS-80
TRS-80

TRS-80 was Tandy Corporation's desktop microcomputer model line, sold through Tandy's Radio Shack stores in the late 1970s and early 1980s. The line won popularity with hobbyists, home users, and small-businesses....
 (later adapted for the Macintosh
Macintosh

File:Imac alu.pngMacintosh, commonly shortened to Mac, is a brand name which covers several lines of personal computers designed, developed, and marketed by Apple Inc....
) - and could therefore make more assumptions about what characters were safe to use. For instance, Uuencode uses uppercase letters, digits, and many punctuation characters, but no lowercase, since UNIX was sometimes used with terminals
Computer terminal

A computer terminal is an electronic or electromechanical computer hardware device that is used for entering data into, and displaying data from, a computer or a computing system....
 that did not support distinct letter case
Letter case

In orthography and typography, letter case is the distinction between majuscule and Lower case letters. The term originated with the shallow Drawer s called type cases still used to hold the movable type for letterpress printing....
. Unfortunately for interoperability with non-UNIX systems, some of the punctuation characters do not exist in other traditional character sets. The MIME Base64 encoding replaces most of the punctuation characters with the lowercase letters, a reasonable requirement by the time it was designed.

MIME Base64 uses A–Z, a–z, and 0–9 for the first 62 values. There are other similar systems, usually derived from Base64, that share this property but differ in the symbols chosen for the last two values; an example is UTF-7
UTF-7

UTF-7 is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages....
.

History of Base64 encoding schemes


Privacy-Enhanced Mail (PEM)


The first known use of the encoding now called MIME Base 64 was in the Privacy-enhanced Electronic Mail
Privacy-enhanced Electronic Mail

Privacy Enhanced Mail , is an early Internet Engineering Task Force proposal for securing email using public key cryptography. Although PEM became an IETF proposed standard it was never widely deployed or used....
 (PEM) protocol, proposed by RFC 989 in 1987. PEM defines a "printable encoding" scheme that uses Base 64 encoding to transform an arbitrary sequence of octet
Octet (computing)

In computing, an octet is a grouping of eight bits.Octet, with the only exception noted below, always refers to an entity having exactly eight bits....
s to a format that can be expressed in short lines of 7-bit characters, as required by transfer protocols such as SMTP.

The current version of PEM (specified in RFC 1421) uses a 64-character alphabet consisting of upper- and lower-case Roman alphabet characters (A–Z, a–z), the numerals (0–9), and the "+" and "/" symbols. The "=" symbol is also used as a special suffix code. The original specification, RFC 989, additionally used the "*" symbol to delimit encoded but unencrypted data within the output stream.

To convert data to PEM printable encoding, the first byte is placed in the most significant
Most significant bit

In computing, the most significant bit is the bit position in a Binary numeral system having the greatest value. The msb is sometimes referred to as the left-most bit on big-endian architectures, due to the convention in positional notation of writing more significant digits further to the left....
 eight bits of a 24-bit buffer, the next in the middle eight, and the third in the least significant
Least significant bit

In computing, the least significant bit is the bit position in a Binary numeral system integer giving the units value, that is, determining whether the number is even or odd....
 eight bits. If there are fewer than three bytes left to encode (or in total), the remaining buffer bits will be zero. The buffer is then used, six bits at a time, most significant first, as indices into the string: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/", and the indicated character is output.

The process is repeated on the remaining data until fewer than four octets remain. If three octets remain, they are processed normally. If fewer than three octets (24 bits) are remaining to encode, the input data is right-padded with zero bits to form an integral multiple of six bits.

After encoding the non-padded data, if two octets of the 24-bit buffer are padded-zeros, two "=" characters are appended to the output; if one octet of the 24-bit buffer is filled with padded-zeros, one "=" character is appended. This signals the decoder that the zero bits added due to padding should be excluded from the reconstructed data. This also guarantees that the encoded output length is a multiple of 4 bytes.

PEM requires that all encoded lines consist of exactly 64 printable characters, with the exception of the last line, which may contain fewer printable characters. Lines are delimited by whitespace characters according to local (platform-specific) conventions.

MIME

The MIME
MIME

Multipurpose Internet Mail Extensions is an Internet standard that extends the format of electronic mail to support:* Text in character sets other than ASCII...
 (Multipurpose Internet Mail Extensions) specification, defined in RFC 2045, lists "base64" as one of several binary-to-text encoding schemes. MIME's base64 encoding is based on that of the RFC 1421 version of PEM: it uses the same 64-character alphabet and encoding mechanism as PEM, and uses the "=" symbol for output padding in the same way.

MIME does not specify a fixed length for base64-encoded lines, but it does specify a maximum line length of 76 characters. Additionally it specifies that any extra-alphabetic characters must be ignored by a compliant decoder, although most implementations use a CR/LF newline
Newline

In computing, a newline is a special character or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line?that is, on the next line below the text, immediately proceeding the newline....
 pair to delimit encoded lines.

Thus, the actual length of MIME-compliant base64-encoded binary data is usually about 137% of the original data length, though for very short messages the overhead can be a lot higher because of the overhead of the headers. Very roughly, the final size of base64-encoded binary data is equal to 1.37 times the original data size + 814 bytes (for headers).

UTF-7

UTF-7
UTF-7

UTF-7 is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages....
, described in RFC 2152, introduced a system called Modified Base64. This data encoding scheme is used to encode UTF-16 as ASCII
ASCII

American Standard Code for Information Interchange , is a coding standard that can be used for interchanging information, if the information is expressed mainly by the written form of English words....
 characters for use in 7-bit transports such as SMTP. It is a variant of the base64 encoding used in MIME.

The "Modified Base64" alphabet consists of the MIME base64 alphabet, but does not use the "=" padding character. UTF-7 is intended for use in mail headers (defined in RFC 2047), and the "=" character is reserved in that context as the escape character for "quoted-printable" encoding. Modified base64 simply omits the padding and ends immediately after the last BASE64 digit containing useful bits (leaving 0-4 unused bits in the last base64 digit)

OpenPGP

OpenPGP, described in RFC 4880, describes Radix-64 encoding, also known as "ASCII Armor". Radix-64 is identical to the "base64" encoding described from MIME, with the addition of an optional 24-bit CRC
Cyclic redundancy check

A cyclic redundancy check is a type of function that takes as input a data stream of any length, and produces as output a value of a certain space, commonly a 32-bit integer....
 checksum. The checksum is calculated on the input data before encoding; the checksum is then encoded with the same base64 algorithm and, using an additional "=" symbol as separator, appended to the encoded output data.

RFC 3548

RFC 3548 (The Base16, Base32, and Base64 Data Encodings) is an informational (non-normative) memo that attempts to unify the RFC 1421 and RFC 2045 specifications of base64 encodings, alternative-alphabet encodings, and the seldom-used Base 32 and Base 16 encodings.

RFC 3548 forbids implementations from generating messages containing characters outside the encoding alphabet, unless they are written to a specification that refers to RFC 3548 and specifically requires otherwise; it also declares that decoder implementations must reject data that contains characters outside the encoding alphabet, unless they are written to a specification that refers to RFC 3548 and specifically requires otherwise.

RFC 4648

This RFC obsoletes RFC 3548 and focuses on base 64/32/16:

This document describes the commonly used base 64, base 32, and base 16 encoding schemes. It also discusses the use of line-feeds in encoded data, use of padding in encoded data, use of non-alphabet characters in encoded data, use of different encoding alphabets, and canonical encodings.


Example


A quote from Thomas Hobbes's
Thomas Hobbes

Thomas Hobbes was an English philosophy, remembered today for his work on political philosophy. His 1651 book Leviathan established the foundation for most of Western political philosophy from the perspective of social contract theory....
 Leviathan
Leviathan (book)

Leviathan, The Matter, Forme and Power of a Common Wealth Ecclesiasticall and Civil, commonly called Leviathan, is a book written by Thomas Hobbes which was published in 1651....
:

Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure.


is encoded in MIME's base64 scheme as follows:

TWFuIGlzIGRpc3Rpbmd1aXNoZWQsIG5vdCBvbmx5IGJ5IGhpcyByZWFzb24sIGJ1dCBieSB0aGlz IHNpbmd1bGFyIHBhc3Npb24gZnJvbSBvdGhlciBhbmltYWxzLCB3aGljaCBpcyBhIGx1c3Qgb2Yg dGhlIG1pbmQsIHRoYXQgYnkgYSBwZXJzZXZlcmFuY2Ugb2YgZGVsaWdodCBpbiB0aGUgY29udGlu dWVkIGFuZCBpbmRlZmF0aWdhYmxlIGdlbmVyYXRpb24gb2Yga25vd2xlZGdlLCBleGNlZWRzIHRo ZSBzaG9ydCB2ZWhlbWVuY2Ugb2YgYW55IGNhcm5hbCBwbGVhc3VyZS4=

In the above quote the encoded value of Man is TWFu. Encoded in ASCII
ASCII

American Standard Code for Information Interchange , is a coding standard that can be used for interchanging information, if the information is expressed mainly by the written form of English words....
, M, a, n are stored as the bytes 77, 97, 110, which are 01001101, 01100001, 01101110 in base 2. These three bytes are joined together in a 24 bit buffer producing 010011010110000101101110. Packs of 6 bits (6 bits have a maximum of 64 different binary values) are converted into 4 numbers (24 = 6x4) which are then converted to their corresponding values in Base 64.

| Text content | colspan="8" align="center"| M | colspan="8" align="center"| a | colspan="8" align="center"| n |- | ASCII | colspan="8" align="center"| 77 | colspan="8" align="center"| 97 | colspan="8" align="center"| 110 |- | Bit pattern ||0||1||0||0||1||1||0||1||0||1||1||0||0||0||0||1||0||1||1||0||1||1||1||0 |- | Index | colspan="6" align="center"| 19 | colspan="6" align="center"| 22 | colspan="6" align="center"| 5 | colspan="6" align="center"| 46 |- | Base64-Encoded | colspan="6" align="center"| T | colspan="6" align="center"| W | colspan="6" align="center"| F | colspan="6" align="center"| u |}

As this example illustrates, Base 64 encoding converts 3 uncoded bytes (in this case, ASCII characters) into 4 encoded ASCII characters.

The example below illustrates how shortening the input changes the output padding:

Input ends with: carnal pleasure. Output ends with: c3VyZS4= Input ends with: carnal pleasure Output ends with: c3VyZQ

Input ends with: carnal pleasur Output ends with: c3Vy Input ends with: carnal pleasu Output ends with: c3U=

Note that the same characters will be encoded differently depending on their position within the three-octet group which is encoded to produce the four characters. For example

The Input: leasure. Encodes to bGVhc3VyZS4= The Input: easure. Encodes to ZWFzdXJlLg

The Input: asure. Encodes to YXN1cmUu The Input: sure. Encodes to c3VyZS4=

URL applications

Base64 encoding can be helpful when fairly lengthy identifying information is used in an HTTP environment. Hibernate
Hibernate (Java)

Hibernate is an object-relational mapping library for the Java language, providing a Software framework for mapping an Object-oriented programming domain model to a traditional relational database....
, a database persistence framework for Java
Java (programming language)

Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java ....
 objects, uses Base64 encoding to encode a relatively large unique id (generally 128-bit UUIDs) into a string for use as an HTTP parameter in HTTP forms or HTTP GET URL
Uniform Resource Locator

In Information technology, a Uniform Resource Locator is a type of Uniform Resource Identifier that specifies where an identified resource is available and the mechanism for retrieving it....
s. Also, many applications need to encode binary data in a way that is convenient for inclusion in URLs, including in hidden web form fields, and Base64 is a convenient encoding to render them in not only a compact way, but in a relatively unreadable one when trying to obscure the nature of data from a casual human observer.

Using a URL-encoder on standard Base64, however, is inconvenient as it will translate the '+' and '/' characters into special percent-encoded
Percent-encoding

Percent-encoding, also known as URL encoding, is a mechanism for code information in a Uniform Resource Identifier under certain circumstances....
 hexadecimal sequences ('+' = '%2B' and '/' = '%2F'). When this is later used with database storage or across heterogeneous systems, they will themselves choke on the '%' character generated by URL-encoders (because the '%' character is also used in ANSI SQL as a wildcard).

For this reason, a
modified Base64 for URL variant exists, where
no padding '=' will be used, and the '+' and '/' characters of standard Base64 are respectively replaced by '-' and '_', so that using URL encoders/decoders is no longer necessary and has no impact on the length of the encoded value, leaving the same encoded form intact for use in relational databases, web forms, and object identifiers in general.

Another variant called
modified Base64 for regexps uses '!-' instead of '*-' to replace the standard Base64 '+/', because both '+' and '*' may be reserved for regular expressions (note that '[]' used in the IRCu variant above would not work in that context).

Another variant called
modified Base64 for filename uses '-' instead of '/', because Unix and Windows filenames can not contain '/'.

There are other variants that use '_-' or '._' when the Base64 variant string must be used within valid identifiers for programs, or '.-' for use in XML name tokens (
Nmtoken), or even '_:' for use in more restricted XML identifiers (Name).

Other applications

Base64 can be used in a variety of contexts:

  • Evolution
    Novell Evolution

    Evolution or Novell Evolution is the official personal information manager and workgroup information management tool for GNOME. It combines e-mail, calendar, address book, and task list management functions....
     and Thunderbird
    Mozilla Thunderbird

    Mozilla Thunderbird is a Free software, open source, cross-platform e-mail client and news client developed by the Mozilla Foundation. The project strategy is modeled after Mozilla Firefox, a project aimed at creating a web browser....
     use Base64 to obfuscate
    Obfuscation

    Obfuscation is the concealment of meaning in communication, making communication confusing, intentionally ambiguity, and more difficult to interpret....
     e-mail passwords
  • Base64 can be used to transmit and store text that might otherwise cause delimiter collision
    Delimiter

    A delimiter is a sequence of one or more character s used to specify the boundary between separate, independent regions in plain text or other data stream....
  • Base64 is often used as a quick but insecure shortcut to obscure secrets without incurring the overhead of cryptographic key management
    Key management

    Key management is a term used to describe two different fields; cryptography, and Key management within building or campus access control....
  • Base64 is used to store passwords encrypted with crypt in the /etc/passwd
    Passwd (file)

    The /etc/passwd file is used as one particular back-end for the passwd on Unix-like operating systems. It is generally world readable File system permissions....
  • Spammers
    Spam (electronic)

    Spam is the abuse of electronic messaging systems to send unsolicited bulk messages indiscriminately. While the most widely recognized form of spam is e-mail spam, the term is applied to similar abuses in other media: Messaging spam, Newsgroup spam, spamdexing, spam in blogs, wiki spam, Classified advertising spam, mobile phone spam, Forum...
     use Base64 to evade basic anti-spamming tools, which often do not decode Base64 and therefore cannot detect keywords in encoded messages.
  • Base64 is used to encode character strings in LDIF files
  • Base64 is sometimes used to embed binary data in an XML file, using a syntax similar to ...... e.g. Firefox's bookmarks.html.
  • Base64 is used to encode binary files such as images within scripts, to avoid depending on external files.
  • The data URI scheme can use Base64 to represent files. For instance, background images can be specified in a CSS
    Cascading Style Sheets

    Cascading Style Sheets is a stylesheet language used to describe the presentation of a document written in a markup language. Its most common application is to style web pages written in HTML and XHTML, but the language can be applied to any kind of XML document, including Scalable Vector Graphics and XUL....
     stylesheet file as data: URIs, instead of being supplied in separate image files.


See also

  • Base32
  • Base16
  • Ascii85
    Ascii85

    Ascii85 is a form of binary-to-text encoding developed by Paul E. Rutter for the btoa utility. By using five ASCII characters to represent four bytes of binary data , it is more efficient than uuencode or Base64, which use four characters to represent three bytes of data ....
  • Quoted-printable
    Quoted-printable

    Quoted-printable, or QP encoding, is an Semantics encoding using printable characters to transmit 8-bit data over a 7-bit data path. It is defined as a MIME MIME#Content-Transfer-Encoding for use in Internet e-mail....
  • uuencode
    Uuencode

    Uuencoding is a form of binary-to-text encoding that originated in the Unix program uuencode, for code Binary numeral system data for transmission over the uucp mail system....
  • yEnc
    YEnc

    yEnc is a binary-to-text encoding scheme for transferring binary files in messages on Usenet or via e-mail. It reduces the computational overhead over previous ASCII-based encoding methods by using an 8-bit Extended ASCII encoding method....
  • 8BITMIME
    8BITMIME

    8BITMIME is an Extended SMTP standardized in 1994 that facilitates the exchange of e-mail messages containing octets outside the seven-bit ASCII range....
  • URL
    Uniform Resource Locator

    In Information technology, a Uniform Resource Locator is a type of Uniform Resource Identifier that specifies where an identified resource is available and the mechanism for retrieving it....


External links

  • RFC 989 and RFC 1421 (Privacy Enhancement for Electronic Internet Mail)
  • RFC 2045 (Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies)
  • RFC 3548 and RFC 4648 (The Base16, Base32, and Base64 Data Encodings)
  • Implementations available for , , , , , , , and