GB 2312
Encyclopedia
GB2312 is the registered internet name for a key official character set of the People's Republic of China
People's Republic of China
China , officially the People's Republic of China , is the most populous country in the world, with over 1.3 billion citizens. Located in East Asia, the country covers approximately 9.6 million square kilometres...

, used for simplified Chinese characters. GB abbreviates Guojia Biaozhun
Guobiao
Guóbiāo is usually the phonetic transcription of the word "National Standards" in Chinese.It could mean any of the standards issued by the Standardization Administration of China , the Chinese National Committee of the ISO and IEC....

 (国家标准), which means national standard in Chinese.

GB2312 (1980) has been superseded by GBK
GBK
GBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.GB abbreviates Guojia Biaozhun , which means national standard in Chinese, while K stands for Extension...

 and GB18030, which include additional characters, but GB2312 is nonetheless still in widespread use.

While GB2312 covers 99.75% of the characters used for Chinese input, historical texts and many names remain out of scope. GB2312 includes 6,763 Chinese characters (on two levels: the first is arranged by reading, the second by radical then number of strokes), along with symbols and punctuation, Japanese kana, the Greek and Cyrillic alphabets, Zhuyin, and a double-byte set of Pinyin letters with tone marks.

There is an analogous character set known as GB/T 12345, closely related to GB2312, but with traditional character forms replacing simplified forms. GB-encoded fonts often come in pairs, one with the GB 2312 (jianti) character set and the other with the GB/T 12345 (fanti) character set.

Characters

Characters in GB2312 are arranged in a 94x94 grid (as in ISO 2022), and the two-byte codepoint of each character is expressed in the kuten (or quwei) form, which specifies a row (ku or qu) and the position of the character within the row (ten or wei).

The rows (numbered from 1 to 94) contain characters as follows:
  • 01-09, comprising punctuation and other special characters; also Hiragana
    Hiragana
    is a Japanese syllabary, one basic component of the Japanese writing system, along with katakana, kanji, and the Latin alphabet . Hiragana and katakana are both kana systems, in which each character represents one mora...

    , Katakana
    Katakana
    is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji, and in some cases the Latin alphabet . The word katakana means "fragmentary kana", as the katakana scripts are derived from components of more complex kanji. Each kana represents one mora...

    , Greek
    Greek alphabet
    The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...

    , Cyrillic, Pinyin
    Pinyin
    Pinyin is the official system to transcribe Chinese characters into the Roman alphabet in China, Malaysia, Singapore and Taiwan. It is also often used to teach Mandarin Chinese and spell Chinese names in foreign publications and used as an input method to enter Chinese characters into...

    , Bopomofo
    Bopomofo
    Zhuyin fuhao , often abbreviated as zhuyin and colloquially called bopomofo, was introduced in the 1910s as the first official phonetic system for transcribing Chinese, especially Mandarin....

  • 16-55, the first plane for Chinese characters, arranged according to Pinyin. (3755 characters).
  • 56-87, the second plane for Chinese characters, arranged according to radical and strokes. (3008 characters).
  • 88-89, further Chinese characters. (103 characters). Defined only for GB/T 12345, not GB 2312.


The rows 10-15 and 90-94 are unassigned.

EUC-CN

EUC-CN is often used as the character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

 (i.e. for external storage) in programs that deal with GB2312, thus maintaining
compatibility with ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

. Two bytes are used to represent every character not found in ASCII. The value of the first
byte is from 0xA1-0xF7 (161-247), while the value of the second byte is from 0xA1-0xFE (161-254).

Compared to UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

, GB2312 (whether native or encoded in EUC-CN) is more storage efficient, this because no bits are reserved to indicate three or four byte sequences, and no bit is reserved for detecting tailing bytes.

To map the code points to bytes, add 160 (0xA0) to the 1000's and 100's value of the code point to form the high byte, and add 160 (0xA0) to the 10's and 1's value of the code point to form the low byte.

For example, if you have the GB2312 code point 4566 ("foreign,"), the high byte will come from 45 (4500), and the low byte will come from 66 (0066). For the high byte, add 45 to 160, giving 205 or 0xCD. For the low byte do the same, add 66 to 160, giving 226 or 0xE2. So, the full encoding is 0xCDE2.

HZ

HZ
HZ (character encoding)
The HZ character encoding is an encoding of GB2312 that was formerly commonly used in email and USENET postings. It was designed in 1989 by Fung Fung Lee of Stanford University, and subsequently codified in 1995 into RFC 1843....

 is another encoding of GB2312 that is used mostly for Usenet
Usenet
Usenet is a worldwide distributed Internet discussion system. It developed from the general purpose UUCP architecture of the same name.Duke University graduate students Tom Truscott and Jim Ellis conceived the idea in 1979 and it was established in 1980...

 postings.

See also

  • Guobiao code
  • CJK
    CJK
    CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...

  • Chinese character encoding
    Chinese character encoding
    In computing, Chinese character encodings can be used to represent text written in the CJK languages — Chinese, Japanese, Korean — and obsolete Vietnamese, all of which use Chinese characters...

  • Unicode
    Unicode
    Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

  • GB18030
  • GBK
    GBK
    GBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.GB abbreviates Guojia Biaozhun , which means national standard in Chinese, while K stands for Extension...

  • Big5
    Big5
    Big-5 or Big5 is a character encoding method used in Taiwan, Hong Kong, and Macau for Traditional Chinese characters.Mainland China, which uses Simplified Chinese Characters, uses the GB instead.- Organization :...

    - standard used in Taiwan and Hong Kong

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK