Extended ASCII - AbsoluteAstronomy.com

The term extended ASCII (or high ASCII) describes eight-bit or larger character encoding

Character encoding

A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

s that include the standard seven-bit

Bit

A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

ASCII

The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

characters as well as others. The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue.

Motives for extending

Because the number of written symbols (or glyph

Glyph

A glyph is an element of writing: an individual mark on a written medium that contributes to the meaning of what is written. A glyph is made up of one or more graphemes....

s) used in common natural language

Natural language

In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...

s far exceeds the limited range of the ASCII code, many extensions to it have been used to facilitate handling of those languages. Markets for computers and communication equipment outside English-speaking countries were historically open long before standards bodies had time to deliberate upon the best way to accommodate them, so there are many incompatible proprietary extensions to ASCII.

Since ASCII is a seven-bit code and most computers manipulate data in eight-bit byte

Byte

The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...

s, many extensions use the additional 128 codes available by using all eight bits of each byte. This helps include many languages otherwise not easily representable in ASCII, but is still not enough to cover all languages of countries in which computers are sold, so even these eight-bit extensions had to have local variants.

Proprietary extensions

Various proprietary extensions appeared on non-EBCDIC

EBCDIC

Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....

mainframe computer

Mainframe computer

Mainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...

s and minicomputer

Minicomputer

A minicomputer is a class of multi-user computers that lies in the middle range of the computing spectrum, in between the largest multi-user systems and the smallest single-user systems...

s, especially in universities. Atari

Atari

Atari is a corporate and brand name owned by several entities since its inception in 1972. It is currently owned by Atari Interactive, a wholly owned subsidiary of the French publisher Atari, SA . The original Atari, Inc. was founded in 1972 by Nolan Bushnell and Ted Dabney. It was a pioneer in...

and Commodore

Commodore International

Commodore is the commonly used name for Commodore Business Machines , the U.S.-based home computer manufacturer and electronics manufacturer headquartered in West Chester, Pennsylvania, which also housed Commodore's corporate parent company, Commodore International Limited...

home computer

Home computer

Home computers were a class of microcomputers entering the market in 1977, and becoming increasingly common during the 1980s. They were marketed to consumers as affordable and accessible computers that, for the first time, were intended for the use of a single nontechnical user...

s added many graphic symbols to their non-standard ASCII (Respectively, ATASCII

ATASCII

The ATASCII character set, from ATARI Standard Code for Information Interchange, alternatively ATARI ASCII, is the variation on ASCII used in the Atari 8-bit family of home computers. The first of this family were the Atari 400 and 800, released in 1979, and later models were released throughout...

and PETSCII

PETSCII

PETSCII , also known as CBM ASCII, is the variation of the ASCII character set used in Commodore Business Machines 's 8-bit home computers, starting with the PET from 1977 and including the VIC-20, C64, CBM-II, Plus/4, C16, C116 and C128...

, based on the original ASCII standard of 1963).

IBM

IBM

International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

introduced eight-bit extended ASCII codes on the original IBM PC

IBM PC

The IBM Personal Computer, commonly known as the IBM PC, is the original version and progenitor of the IBM PC compatible hardware platform. It is IBM model number 5150, and was introduced on August 12, 1981...

and later produced variations for different languages and cultures. IBM called such character sets code pages and assigned numbers to both those they themselves invented as well as many invented and used by other manufacturers. Accordingly, character sets are very often indicated by their IBM code page number. In ASCII-compatible code pages, the lower 128 characters maintained their standard US-ASCII values, and different pages (or sets of characters) could be made available in the upper 128 characters. DOS

DOS

DOS, short for "Disk Operating System", is an acronym for several closely related operating systems that dominated the IBM PC compatible market between 1981 and 1995, or until about 2000 if one includes the partially DOS-based Microsoft Windows versions 95, 98, and Millennium Edition.Related...

computers built for the North American market, for example, used code page 437

Code page 437

IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....

, which included accented characters needed for French, German, and a few other European languages, as well as some graphical line-drawing characters. The larger character set made it possible to create documents in a combination of languages such as English

English language

English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...

and French

French language

French is a Romance language spoken as a first language in France, the Romandy region in Switzerland, Wallonia and Brussels in Belgium, Monaco, the regions of Quebec and Acadia in Canada, and by various communities elsewhere. Second-language speakers of French are distributed throughout many parts...

(though French computers usually use code page 850

Code page 850

Code page 850 is a code page used under MS-DOS in Western Europe. It is the code page commonly used by the version of MS-DOS underlying Windows ME...

), but not, for example, in English and Greek

Greek language

Greek is an independent branch of the Indo-European family of languages. Native to the southern Balkans, it has the longest documented history of any Indo-European language, spanning 34 centuries of written records. Its writing system has been the Greek alphabet for the majority of its history;...

(which required code page 737

Code page 737

Code page 737 is a code page used under MS-DOS to write Greek language. It was much more popular than code page 869.-Code page layout:...

).

Apple Computer

Apple Computer

Apple Inc. is an American multinational corporation that designs and markets consumer electronics, computer software, and personal computers. The company's best-known hardware products include the Macintosh line of computers, the iPod, the iPhone and the iPad...

introduced their own 8-bit extended ASCII codes in Mac OS

Mac OS

Mac OS is a series of graphical user interface-based operating systems developed by Apple Inc. for their Macintosh line of computer systems. The Macintosh user experience is credited with popularizing the graphical user interface...

, such as Mac OS Roman

Mac OS Roman

Mac OS Roman is a character encoding primarily used by Mac OS to represent text. It encodes 256 characters, the first 128 of which are identical to ASCII, with the remaining characters including mathematical symbols, diacritics, and additional punctuation marks. It is suitable for use to represent...

.

Digital Equipment Corporation

Digital Equipment Corporation

Digital Equipment Corporation was a major American company in the computer industry and a leading vendor of computer systems, software and peripherals from the 1960s to the 1990s...

developed the Multinational Character Set

Multinational Character Set

The Multinational Character Set is a character encoding created by Digital Equipment Corporation for use in the popular VT220 terminal. It was an 8-bit extension of ASCII that added accented characters, currency symbols, and other character glyphs missing from 7-bit ASCII...

, which had
fewer characters but more letter and diacritic combinations, based on draft versions of ISO 8859. It was supported by the VT220

VT220

The VT220 was a terminal produced by Digital Equipment Corporation from 1983 to 1987.-Hardware:The VT220 improved on the earlier VT100 series of terminals with a redesigned keyboard, much smaller physical packaging, and a much faster microprocessor...

and later DEC computer terminal

Computer terminal

A computer terminal is an electronic or electromechanical hardware device that is used for entering data into, and displaying data from, a computer or a computing system...

ISO 8859 and proprietary adaptions

Eventually, ISO

International Organization for Standardization

The International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...

released this standard as ISO 8859 describing its own set of eight-bit ASCII extensions. The most popular was ISO 8859-1, also called ISO Latin1, which contained characters sufficient for the most common Western European languages.
Variations were standardized for other languages as well: ISO 8859-2 for Eastern European languages and ISO 8859-5 for Cyrillic languages, for example.

One notable way in which ISO character sets differ from code pages is that the character positions 128 to 159, corresponding to ASCII control character

Control character

In computing and telecommunication, a control character or non-printing character is a code point in a character set, that does not in itself represent a written symbol.It is in-band signaling in the context of character encoding....

s with the high-order bit set, are specifically unused and undefined in the ISO standards, though they had often been used for printable characters in proprietary code pages, a breaking of ISO standards that was almost universal.

Microsoft later created code page 1252, a compatible superset of ISO 8859-1 with extra characters in the ISO unused range.
Code page 1252 is the standard character encoding of western European language versions of Microsoft Windows

Microsoft Windows

Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

, including English versions.
ISO 8859-1 is the common character encoding used by the X Window System

X Window System

The X window system is a computer software system and network protocol that provides a basis for graphical user interfaces and rich input device capability for networked computers...

, and most Internet

Internet

The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...

standards.

Character set confusion

Because these ASCII extensions have so many variants, it is necessary to identify which set is being used for a particular text for it to be interpreted correctly. However, because the most-used characters (those in ASCII, the seven-bit code points) are common to all sets—even most proprietary ones—failure to correctly identify a character set often suffers no adverse consequences if the user is typing in English. Further, because many Internet standards use ISO 8859-1, and because Microsoft Windows (using the code page 1252 superset of ISO 8859-1) is the dominant operating system for personal computers today, unannounced use of ISO 8859-1 is quite commonplace, and may generally be assumed without evidence to the contrary.

In many protocols, most importantly e-mail

E-mail

Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...

and HTTP, the character encoding of content has to be tagged with IANA

Internet Assigned Numbers Authority

The Internet Assigned Numbers Authority is the entity that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System , media types, and other Internet Protocol-related symbols and numbers...

-assigned character set identifiers.

Multi byte character sets

There are multi byte character sets (character sets that can handle more than 256 different characters) that are also true extended ASCII. That means all bytes 0x00-0x7F have the same meaning as in ASCII. UTF-8

UTF-8

UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

is such a character set.

They can be used in file formats where only ASCII bytes are used for keywords and file format syntax, while bytes 0x80-0xFF might be used for free text, including most programming languages. This makes it much easier to introduce a multi-byte character set into into existing systems that use extended ASCII.

Other character sets such as Shift JIS and UTF-16 are not true extended ASCII, since ASCII bytes (0x00-0x7F) can appear as part of other characters. Sometimes Shift JIS is called extended ASCII since ASCII characters are stored as ASCII bytes, but other characters can include ASCII bytes also. Shift JIS can directly be used in programming languages and languages such as HTML, since the bytes used for free text delimiters are not used as part of non-ASCII characters. UTF-16 is even less extended ASCII since ASCII characters are stored as two bytes with the other one equal to 0x00. Porting an existing system to support character sets as Shift JIS or UTF-16 is complicated and bug prone.

Usage in computer readable languages

For programming languages and document languages such as HTML, the principle of Extended ASCII is important, since it enables many different encodings and therefore human languages to be supported with little extra programming effort in the software that interprets the computer readable language files.

The principle of Extended ASCII means that:

all ASCII bytes (0x00 to 0x7F) have same meaning in all variants of extended ASCII,
bytes that are ASCII bytes are only used for free text, and not for tags, keywords and other features having special meaning to the interpreting software.

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.