GB 18030
Encyclopedia
GB18030 is a Chinese government standard
Standardization Administration of China
The Standardization Administration of China is the Standards organization authorized by the State Council of China to exercise administrative responsibilities by undertaking unified management, supervision and overall coordination of standardization work in China...

 describing the required language and character support necessary for software in China
China
Chinese civilization may refer to:* China for more general discussion of the country.* Chinese culture* Greater China, the transnational community of ethnic Chinese.* History of China* Sinosphere, the area historically affected by Chinese culture...

. In addition to the "GB18030 code page" this standard contains requirements about which scripts must be supported, font support, etc.

GB18030 as a code page

GB18030 is the registered Internet name for the official character set of the People's Republic of China
People's Republic of China
China , officially the People's Republic of China , is the most populous country in the world, with over 1.3 billion citizens. Located in East Asia, the country covers approximately 9.6 million square kilometres...

 (PRC) superseding GB2312
GB 2312
GB2312 is the registered internet name for a key official character set of the People's Republic of China, used for simplified Chinese characters...

. This character set is formally called "Chinese National Standard GB 18030-2005: Information technology — Chinese coded character set". GB abbreviates Guójiā Biāozhǔn
Guobiao
Guóbiāo is usually the phonetic transcription of the word "National Standards" in Chinese.It could mean any of the standards issued by the Standardization Administration of China , the Chinese National Committee of the ISO and IEC....

 (国家标准), which means national standard in Chinese. The standard was published by the China Standard Press, Beijing, November 8, 2005. Only a portion of the standard is mandatory. Since May 1, 2006, support for the mandatory subset is officially required for all software products sold in the PRC. Due to its Unicode equivalence, GB18030 supports both simplified and traditional Chinese characters.

An older version of the standard, known as "Chinese National Standard GB 18030-2000: Information Technology — Chinese ideograms coded character set for information interchange — Extension for the basic set", was published on March 17, 2000. The encoding scheme remains the same in the new version, except that code points for the characters and have been exchanged. More code points are now associated to characters due to update of Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

, especially the appearance of CJK Unified Ideographs
CJK Unified Ideographs
The Chinese, Japanese and Korean scripts share a common background. In the process called Han unification the common characters were identified, and named "CJK Unified Ideographs"...

 Extension B. Some characters used by ethnic minorities in China
Ethnic minorities in China
Ethnic minorities in China are the non-Han Chinese population in the People's Republic of China. The People's Republic of China officially recognizes 55 ethnic minority groups within China in addition to the Han majority. As of 2010, the combined population of officially recognised minority...

, such as Mongolian characters
Mongolian script
The classical Mongolian script , also known as Uyghurjin, was the first writing system created specifically for the Mongolian language, and was the most successful until the introduction of Cyrillic in 1946...

 and Tibetan characters
Tibetan script
The Tibetan alphabet is an abugida of Indic origin used to write the Tibetan language as well as the Dzongkha language, Denzongkha, Ladakhi language and sometimes the Balti language. The printed form of the alphabet is called uchen script while the hand-written cursive form used in everyday...

 (GB 16959-1997 and GB/T 20542-2006), have been added as well, which accounts for the renaming of the standard.

GB18030 can be considered a Unicode Transformation Format (i.e. an encoding of all Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 code points) that maintains compatibility with a legacy character set. Like UTF-8, GB18030 is a superset of ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 and can represent the whole range of Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 code points; in addition, it is also a superset of GB2312. GB18030 also maintains compatibility with Windows Codepage 936, sometimes known as GBK
GBK
GBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.GB abbreviates Guojia Biaozhun , which means national standard in Chinese, while K stands for Extension...

, which is Microsoft's extended version of GB2312, with the exception of the euro sign
Euro sign
The euro sign is the currency sign used for the euro, the official currency of the Eurozone in the European Union . The design was presented to the public by the European Commission on 12 December 1996. The international three-letter code for the euro is EUR...

 which is given a single byte code of 0x80 in Microsoft's later versions of GBK and a two byte code of A2 E3 in GB18030. GB 18030-2005 is also compatible with Chinese Internal Code Specification, Version 1.0, known as GBK 1.0, which is a slight extension of Windows Codepage 936 in 1995. Mapping to Unicode, however, has been modified for the 81 characters that were provisionally assigned a Unicode PUA code point in GBK 1.0 and that have later been encoded in Unicode. This is specified in Appendix E of GB 18030-2005. There are 14 characters in GB 18030-2005 that are still mapped to Unicode PUA.

Part of the mapping data is from a lookup table (similarly to GBK). The rest is calculated algorithmically. Unfortunately it also inherits the bad aspects of the legacy standards it's based on (most notably needing special code to safely find ASCII characters in a GB18030 sequence).

Most major computer companies had already standardised on some version of Unicode as the primary format for use in their binary formats and OS calls. However, they mostly had only supported code points in the BMP originally defined in Unicode 1.0, which supported only 65,536 codepoints and was often encoded in 16 bits as UCS-2.

The mandatory part of GB 18030-2005 consists of 1 byte and 2 byte encoding, together with 4 byte encoding for CJK Unified Ideographs
CJK Unified Ideographs
The Chinese, Japanese and Korean scripts share a common background. In the process called Han unification the common characters were identified, and named "CJK Unified Ideographs"...

 Extension A. The corresponding Unicode code points of this subset lie entirely in the BMP.

In a move of historic significance for software supporting Unicode, the PRC
People's Republic of China
China , officially the People's Republic of China , is the most populous country in the world, with over 1.3 billion citizens. Located in East Asia, the country covers approximately 9.6 million square kilometres...

 decided to mandate support of certain code points outside the BMP. This means that software can no longer get away with treating characters as 16 bit fixed width entities (UCS-2). Therefore they must either process the data in a variable width format (such as UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 or UTF-16), which are the most common choices, or move to a larger fixed width format (such as UCS-4 or UTF-32). Microsoft made the change from UCS-2 to UTF-16 with Windows 2000.

Encoding

Windows 2000 can support the GB18030 encoding if GB18030 Support Package http://www.microsoft.com/downloads/details.aspx?FamilyID=fc02e2e3-14bb-46c1-afee-3732d6249647&DisplayLang=en is installed. Windows XP can support it natively. Microsoft SQL Server cannot (including SQL Server 2008) as it can use UCS-2 but not UTF-16 (except through the use of varbinary 'blobs'). The open source PostgreSQL database supports GB18030 through its full support for UTF-8.

More specifically, supporting the GB18030 encoding on Windows means that Code Page 54936 is supported by MultiByteToWideChar and WideCharToMultiByte. Due to the backward compatibility of the mapping, many files in GB18030 can be actually opened successfully as the legacy Code Page 936, that is GBK, even if the Code Page 54936 is not supported. However, that is only true if the file in question contains only GBK characters. Loading will fail or cause corrupted result if the file contains characters that do not exist in GBK (see below for examples).

Glyphs

GB18030 Support Package contains SimSun18030.ttc, a TrueType font collection file which combines two Chinese fonts, SimSun-18030 and NSimSun-18030.

The SimSun 18030 font includes all the characters in Unicode 2.1 plus new characters found in the Unicode CJK Unified Ideographs Extension A section, but despite its name, it does not contain glyphs for all GB 18030 characters. Note that all (about a million) Unicode code points up to U+10FFFF can be encoded as GB 18030, hence “a font that fully supports GB 18030” would mean a font that contains glyphs for all Unicode characters, not only for CJK ones. HAN NOM A and HAN NOM B http://sourceforge.net/project/showfiles.php?group_id=153105&package_id=172061 are free fonts, which include all the characters in the Extension A and the Extension B, more exhaustive than SimSun-18030, or even than Simsun (Founder Extended), but they don't support all code points defined in Unicode 5.0.0 either.

Technical details

The four byte scheme can be thought of as consisting of two units, each of two bytes. Each unit has a similar format to a GBK two byte character but with a range of values for the second byte of 0x30–0x39 (the ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 codes for decimal digits). The first byte has the range 0x81 to 0xFE, as before. This means that a string search routine that is safe for GBK should also be reasonably safe for GB18030 (in much the same way that a basic byte-oriented search routine is reasonably safe for EUC
Extended Unix Code
Extended Unix Code is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 characters, or 830584 ...

).

This gives a total of 1,587,600 (126×10×126×10) possible 4 byte sequences, which is easily sufficient to cover Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

's 1,111,998 (17×65536 − 2048 surrogates − 66 noncharacters) assigned and reserved code points. (Surrogates and noncharacters are considered designated but not assigned.)

Unfortunately, to further complicate matters there are no simple rules to translate between a 4 byte sequence and its corresponding code point
Code point
In character encoding terminology, a code point or code position is any of the numerical values that make up the code space . For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112...

. Instead, codes are allocated sequentially (with the first byte containing the most significant part and the last the least significant part) only to Unicode code points that are not mapped in any other manner. For example:

U+00DE (Þ) → 81 30 89 37
U+00DF (ß) → 81 30 89 38
U+00E0 (à) → A8 A4
U+00E1 (á) → A8 A2
U+00E2 (â) → 81 30 89 39
U+00E3 (ã) → 81 30 8A 30

See also

  • GBK
    GBK
    GBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.GB abbreviates Guojia Biaozhun , which means national standard in Chinese, while K stands for Extension...

  • Guobiao code
  • CJK
    CJK
    CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...

  • Chinese character encoding
    Chinese character encoding
    In computing, Chinese character encodings can be used to represent text written in the CJK languages — Chinese, Japanese, Korean — and obsolete Vietnamese, all of which use Chinese characters...

  • Comparison of Unicode encodings
    Comparison of Unicode encodings
    This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in the standards and so...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK