Character (computing)
Encyclopedia
In computer
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...

 and machine-based telecommunication
Telecommunication
Telecommunication is the transmission of information over significant distances to communicate. In earlier times, telecommunications involved the use of visual signals, such as beacons, smoke signals, semaphore telegraphs, signal flags, and optical heliographs, or audio messages via coded...

s terminology, a character is a unit of information
Information
Information in its most restricted technical sense is a message or collection of messages that consists of an ordered sequence of symbols, or it is the meaning that can be interpreted from such a message or collection of messages. Information can be recorded or transmitted. It can be recorded as...

 that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet
Alphabet
An alphabet is a standard set of letters—basic written symbols or graphemes—each of which represents a phoneme in a spoken language, either as it exists now or as it was in the past. There are other systems, such as logographies, in which each character represents a word, morpheme, or semantic...

 or syllabary
Syllabary
A syllabary is a set of written symbols that represent syllables, which make up words. In a syllabary, there is no systematic similarity between the symbols which represent syllables with the same consonant or vowel...

 in the written
Written language
A written language is the representation of a language by means of a writing system. Written language is an invention in that it must be taught to children, who will instinctively learn or create spoken or gestural languages....

 form of a natural language
Natural language
In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...

.

Examples of characters include letters, numerical digit
Numerical digit
A digit is a symbol used in combinations to represent numbers in positional numeral systems. The name "digit" comes from the fact that the 10 digits of the hands correspond to the 10 symbols of the common base 10 number system, i.e...

s, and common punctuation
Punctuation
Punctuation marks are symbols that indicate the structure and organization of written language, as well as intonation and pauses to be observed when reading aloud.In written English, punctuation is vital to disambiguate the meaning of sentences...

 marks (such as '.' or '-'). The concept also includes control character
Control character
In computing and telecommunication, a control character or non-printing character is a code point in a character set, that does not in itself represent a written symbol.It is in-band signaling in the context of character encoding....

s, which do not correspond to symbols in a particular natural language, but rather to other bits of information used to process text in one or more languages. Examples of control characters include carriage return
Carriage return
Carriage return, often shortened to return, refers to a control character or mechanism used to start a new line of text.Originally, the term "carriage return" referred to a mechanism or lever on a typewriter...

 or tab
Tab key
Tab key on a keyboard is used to advance the cursor to the next tab stop.- Origin :The word tab derives from the word tabulate, which means "to arrange data in a tabular, or table, form"...

, as well as instructions to printer
Computer printer
In computing, a printer is a peripheral which produces a text or graphics of documents stored in electronic form, usually on physical print media such as paper or transparencies. Many printers are primarily used as local peripherals, and are attached by a printer cable or, in most new printers, a...

s or other devices that display or otherwise process text.

Characters are typically combined into string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

s.

Character encoding

Computers and communication equipment represent characters using a character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

 that assigns each character to something — an integer
Integer
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...

 quantity represented by a sequence of bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

s, typically — that can be stored
Computer storage
Computer data storage, often called storage or memory, refers to computer components and recording media that retain digital data. Data storage is one of the core functions and fundamental components of computers....

 or transmitted through a network
Computer network
A computer network, often simply referred to as a network, is a collection of hardware components and computers interconnected by communication channels that allow sharing of resources and information....

. Two examples of popular encodings are ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 and the UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 encoding for Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

. While most character encodings map characters to numbers and/or bit sequences, Morse code
Morse code
Morse code is a method of transmitting textual information as a series of on-off tones, lights, or clicks that can be directly understood by a skilled listener or observer without special equipment...

 instead represents characters using a series of electrical impulses of varying length.

Terminology

Historically, the term character has been widely used by industry professionals to refer to an encoded character, often as defined by the programming language or API
Application programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...

). Likewise, character set has been widely used to refer to a specific repertoire of characters that have been mapped to specific bit sequences or numerical codes. The term glyph
Glyph
A glyph is an element of writing: an individual mark on a written medium that contributes to the meaning of what is written. A glyph is made up of one or more graphemes....

 is used to describe a particular physical appearance of a character. Many computer font
Typeface
In typography, a typeface is the artistic representation or interpretation of characters; it is the way the type looks. Each type is designed and there are thousands of different typefaces in existence, with new ones being developed constantly....

s consist of glyphs that are indexed by the numerical code of the corresponding character.

With the advent and widespread acceptance of Unicode and bit-agnostic encoding forms,, a character is increasingly being seen as a unit of information
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

, independent of any particular visual manifestation. The ISO/IEC 10646 (Unicode) International Standard
Universal Character Set
The Universal Character Set , defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set , is a standard set of characters upon which many character encodings are based...

 defines character, or abstract character as "a member of a set of elements used for the organisation, control, or representation of data". Unicode's definition supplements this with explanatory notes that encourage the reader to differentiate between characters, graphemes, and glyphs, among other things.

For example, the Hebrew letter
Hebrew alphabet
The Hebrew alphabet , known variously by scholars as the Jewish script, square script, block script, or more historically, the Assyrian script, is used in the writing of the Hebrew language, as well as other Jewish languages, most notably Yiddish, Ladino, and Judeo-Arabic. There have been two...

 aleph
Aleph (letter)
' is the reconstructed name of the first letter of the Proto-Canaanite alphabet, continued in descended Semitic alphabets as Phoenician ' , Syriac ' , Hebrew Aleph , and Arabic ' ....

 ("א") is often used by mathematicians to denote certain kinds of infinity
Aleph number
In set theory, a discipline within mathematics, the aleph numbers are a sequence of numbers used to represent the cardinality of infinite sets. They are named after the symbol used to denote them, the Hebrew letter aleph...

, but it is also used in ordinary Hebrew text. In Unicode, these two uses are considered different characters, and have two different Unicode numerical identifiers ("code points"), though they may be rendered identically. Conversely, the Chinese logogram
Logogram
A logogram, or logograph, is a grapheme which represents a word or a morpheme . This stands in contrast to phonograms, which represent phonemes or combinations of phonemes, and determinatives, which mark semantic categories.Logograms are often commonly known also as "ideograms"...

 for water ("水") may have a slightly different appearance in Japanese texts than it does in Chinese texts, and local typeface
Typeface
In typography, a typeface is the artistic representation or interpretation of characters; it is the way the type looks. Each type is designed and there are thousands of different typefaces in existence, with new ones being developed constantly....

s may reflect this. But nonetheless in Unicode they are considered the same character, and share the same code point.

The Unicode standard also differentiates between these abstract characters and coded characters or encoded characters that have been paired with numeric codes that facilitate their representation in computers.

char

A char in the C programming language is a fixed-size byte
Byte
The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...

 entity, which at one time was large enough to store a character value from ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 or other encodings. Since often only 256 different values can be stored in a byte, it is impossible to store characters from Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 and other modern sets in a char. Instead larger storage units such as wchar t, or more than one byte per character such as UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

, are used.

Unfortunately the fact that a character was stored in a byte led to the two terms being used interchangeably in most documentation. This often makes the documentation confusing and/or misleading, and has also led to extremely inefficient implementations of UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

where offsets are replaced with repetitive counting of characters, and has also led to bugs when different systems disagree on the count.

word character

A 'word' character has special meaning in some aspects of computing. A 'word character' typically means alphabet A-Z (upper or lower case), the digits 0 to 9 and the underscore.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK