Wide character
Encyclopedia
A wide character is a computer character
Character (computing)
In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language....

 datatype that generally has a size greater than the traditional 8-bit
8-bit
The first widely adopted 8-bit microprocessor was the Intel 8080, being used in many hobbyist computers of the late 1970s and early 1980s, often running the CP/M operating system. The Zilog Z80 and the Motorola 6800 were also used in similar computers...

 character. The increased datatype size allows for the use of larger coded character sets
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

.

History

During the 1960s, mainframe and mini-computer manufacturers began to standardize around the 8-bit byte
Byte
The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...

 as their smallest datatype. Meanwhile, the 7-bit ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 character set became the industry standard method for encoding alphanumeric
Alphanumeric
Alphanumeric is a combination of alphabetic and numeric characters, and is used to describe the collection of Latin letters and Arabic digits or a text constructed from this collection. There are either 36 or 62 alphanumeric characters. The alphanumeric character set consists of the numbers 0 to...

 characters for teletype machines
Teleprinter
A teleprinter is a electromechanical typewriter that can be used to communicate typed messages from point to point and point to multipoint over a variety of communication channels that range from a simple electrical connection, such as a pair of wires, to the use of radio and microwave as the...

 and computer terminal
Computer terminal
A computer terminal is an electronic or electromechanical hardware device that is used for entering data into, and displaying data from, a computer or a computing system...

s. As a result, the 8-bit byte became the de facto
De facto
De facto is a Latin expression that means "concerning fact." In law, it often means "in practice but not necessarily ordained by law" or "in practice or actuality, but not officially established." It is commonly used in contrast to de jure when referring to matters of law, governance, or...

 datatype for computer systems storing ASCII characters in memory.

Later, computer manufacturers began to make use of the spare bit to extend the ASCII character set beyond its limited set of English alphabet
English alphabet
The modern English alphabet is a Latin alphabet consisting of 26 letters and 2 ligatures – the same letters that are found in the Basic modern Latin alphabet:...

 characters. 8-bit extensions
Extended ASCII
The term extended ASCII describes eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others...

 such as IBM code page 37
EBCDIC 037
IBM code page 37 is an EBCDIC code page with the full Latin-1 character set used in IBM mainframes. It is used in some English and Portuguese speaking countries, including Australia, Brazil, Canada, New Zealand, Portugal, South Africa, and the United States....

, PETSCII
PETSCII
PETSCII , also known as CBM ASCII, is the variation of the ASCII character set used in Commodore Business Machines 's 8-bit home computers, starting with the PET from 1977 and including the VIC-20, C64, CBM-II, Plus/4, C16, C116 and C128...

 and ISO 8859
ISO/IEC 8859
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12...

 became commonplace, offering terminal support for Greek
Greek alphabet
The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...

, Cyrillic
Cyrillic alphabet
The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...

, and many others. However, such extensions were still limited in that they were region specific and often could not be used in tandem. Special conversion routines had to be used to convert from one character set to another, often resulting in destructive translation when no equivalent character existed in the target set.

In 1989, the International Organization for Standardization
International Organization for Standardization
The International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...

 began work on the Universal Character Set
Universal Character Set
The Universal Character Set , defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set , is a standard set of characters upon which many character encodings are based...

 (UCS), a multilingual character set that could be encoded using either a 16-bit (2-byte) or 32-bit (4-byte) value. These larger values required the use of a datatype larger than 8-bits to store the new character values in memory. Thus the term wide character was used to differentiate them from traditional 8-bit character datatypes.

Relation to UCS and Unicode

A wide character refers to the size of the datatype in memory. It does not state how each value in a character set is defined. Those values are instead defined using character sets, with UCS and Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 simply being two common character sets that contain more characters than an 8-bit value would allow.

Relation to multibyte characters

Just as earlier data transmission systems suffered from the lack of an 8-bit clean
8-bit clean
8-bit clean describes a computer system that correctly handles 8-bit character sets, such as the ISO 8859 series and the UTF-8 encoding of Unicode.- History :...

 data path, modern transmission systems often lack support for 16-bit or 32-bit data paths for character data. This has led to character encoding systems such as UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 that can use multiple bytes
Variable-width encoding
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation in a computer...

 to encode a value that is too large for a single 8-bit symbol.

Size of a wide character

The Microsoft Windows application programming interface
Application programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...

s Win32 and Win64, as well as the Java
Java
Java is an island of Indonesia. With a population of 135 million , it is the world's most populous island, and one of the most densely populated regions in the world. It is home to 60% of Indonesia's population. The Indonesian capital city, Jakarta, is in west Java...

 and .Net Framework
.NET Framework
The .NET Framework is a software framework that runs primarily on Microsoft Windows. It includes a large library and supports several programming languages which allows language interoperability...

 platforms, require that wide character variables be defined as 16-bit values, and that characters be encoded using UTF-16 (due to former use of UCS-2), while modern Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

-like systems generally require 32-bit values encoded using UTF-32.

C/C++

The standard library
C standard library
The C Standard Library is the standard library for the programming language C, as specified in the ANSI C standard.. It was developed at the same time as the C POSIX library, which is basically a superset of it...

 of the C programming language
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

 includes a lot of facilities for dealing with wide characters and strings composed of them. The wide characters are defined using datatype wchar_t, which in the original C90 standard was defined as 16-bit value due to historical compatibility reasons. C and C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

 compilers that comply with the 10646-1:2000 Unicode standard generally assume 32-bit values. However, the ISO/IEC 10646:2003 Unicode standard 4.0 says that:
"ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set
Portable character set
Portable Character Set is a set of 103 characters which, according to the POSIX standard, must be present in any character set. It is a subset of ASCII, lacking some control characters....

 correspond to their wide character equivalents by zero extension."


and that
"The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers."


Wide characters and long strings must use the prefix L when defined in quotes. Some examples are:

  1. include
  2. include
  3. include
  4. include


int main
{
setlocale(LC_ALL,"");
wchar_t myChar1 = L'Ω';
wchar_t myChar2 = 0x2126; // hexadecimal encoding of char Ω using UTF-16
wchar_t myString1[] = L"♠♣♥♦";
wchar_t myString2[] = { 0x2660, 0x2661, 0x2662, 0x2663, 0x0000 };
// hex encoding of null-terminated string ♠♣♥♦ using UTF-16

wprintf(L"This is char: %lc \n",myChar1);
wprintf(L"This is char: %lc \n",myChar2);
wprintf(L"This is a long string: %ls \n",myString1);
wprintf(L"This is a long string: %ls \n",myString2);
}

Python

According to Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

's documentation, the language sometimes uses wchar_t as the basis for its character type Py_UNICODE. It depends on whether wchar_t is "compatible with the chosen Python Unicode build variant" on that system.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK