GSM 03.38 - AbsoluteAstronomy.com

Mobile telephony

Mobile telephony is the provision of telephone services to phones which may move around freely rather than stay fixed in one location. Mobile phones connect to a terrestrial cellular network of base stations , whereas satellite phones connect to orbiting satellites...

GSM 03.38 is a character set

Character encoding

A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

used in the Short Message Service

SMS

SMS is a form of text messaging communication on phones and mobile phones. The terms SMS or sms may also refer to:- Computer hardware :...

of GSM based cell phones. It is defined in GSM recommendation 03.38. Messages sent via this encoding can be encoded in the default GSM 7-bit

Bit

A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

alphabet, the 8-bit data alphabet, and the 16-bit UTF-16 alphabet. Support of the GSM 7-bit alphabet is mandatory for GSM handsets and network elements, but characters in languages such as Arabic, Chinese, Korean, Japanese or Cyrillic alphabet

Cyrillic alphabet

The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...

languages must be encoded using the 16-bit UTF-16 character encoding or an extended national language shift table.

GSM 7 bit default alphabet and extension table of 3GPP TS 23.038 / GSM 03.38

The standard encoding for GSM messages is the 7 bit default alphabet as defined in the 23.038 recommendation. Seven-bit characters must be encoded into octets following one of three packing modes: SMS, CBS or USSD.

Using this encoding, it is possible to send up to 160 characters (140 octets) in one SMS message in the GSM network.

3GPP TS 23.038 / GSM 03.38
	x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	xA	xB	xC	xD	xE	xF
0x	@	£ Pound sign The pound sign is the symbol for the pound sterling—the currency of the United Kingdom . The same symbol is used for similarly named currencies in some other countries and territories, such as the Irish pound, Gibraltar pound, Australian pound and the Italian lira...	$ Dollar sign The dollar or peso sign is a symbol primarily used to indicate the various peso and dollar units of currency around the world.- Origin :...	¥	è È or can beThe letter E with a Grave accent.In Shakespeare's works, è would be used in the -ed suffix to indicate alternate pronunciation, for example with winged/wingèd, the è would be added to produce a pronunciation of instead of ....	é É is a letter of the Czech, Hungarian, Icelandic, Kashubian, Luxembourgish, Slovak, and Catalan, Danish, English, French, Galician, Irish, Italian, Occitan, Norwegian, Portuguese, Spanish, Swedish, and Vietnamese language as a variant of the letter “e”...	ù U U is the twenty-first letter and a vowel in the basic modern Latin alphabet.-History:The letter U ultimately comes from the Semitic letter Waw by way of the letter Y. See the letter Y for details....	ì Ì Ì is used in the ISO 9:1995 system of Ukrainian transliteration as the Cyrillic letter І.In the Pinyin system of Chinese romanization, ì is an i with a falling tone.This appears in Catalan, Galician, Italian, Taos, and Vietnamese. Also Alcozauca Mixtec....	ò Ò is a letter in the Kashubian language. This letter also appears in Catalan, Italian, Occitan, Scottish Gaelic, Taos, and Vietnamese language as a variant of letter “o”.-Character mappings:-External links:***...	Ç Ç is a Latin script letter, used in the Albanian, Azerbaijani, Ligurian, Tatar, Turkish, Turkmen, Kurdish and Zazaki alphabets. This letter also appears in Catalan, French, Friulian, Occitan and Portuguese as a variant of the letter “c”...	LF	Ø Ø Ø — minuscule: "ø", is a vowel and a letter used in the Danish, Faroese, Norwegian and Southern Sami languages.It's mostly used as a representation of mid front rounded vowels, such as ø œ, except for Southern Sami where it's used as an [oe] diphtong.The name of this letter is the same as the sound...	ø Ø Ø — minuscule: "ø", is a vowel and a letter used in the Danish, Faroese, Norwegian and Southern Sami languages.It's mostly used as a representation of mid front rounded vowels, such as ø œ, except for Southern Sami where it's used as an [oe] diphtong.The name of this letter is the same as the sound...	CR Carriage return Carriage return, often shortened to return, refers to a control character or mechanism used to start a new line of text.Originally, the term "carriage return" referred to a mechanism or lever on a typewriter...	Å Å Å represents various sounds in several languages. Å is part of the alphabets used for the Alemannic and the Bavarian-Austrian dialects of German...	å Å Å represents various sounds in several languages. Å is part of the alphabets used for the Alemannic and the Bavarian-Austrian dialects of German...
1x	Δ	Underscore The underscore [ _ ] is a character that originally appeared on the typewriter and was primarily used to underline words...	Φ	Γ	Λ	Ω	Π	Ψ	Σ	Θ	Ξ	ESC Escape character In computing and telecommunication, an escape character is a character which invokes an alternative interpretation on subsequent characters in a character sequence. An escape character is a particular case of metacharacters...	Æ Æ Æ is a grapheme formed from the letters a and e. Originally a ligature representing a Latin diphthong, it has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Faroese, Norwegian and Icelandic...	æ Æ Æ is a grapheme formed from the letters a and e. Originally a ligature representing a Latin diphthong, it has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Faroese, Norwegian and Icelandic...	ß ß In the German alphabet, ß is a letter that originated as a ligature of ss or sz. Like double "s", it is pronounced as an , but in standard spelling, it is only used after long vowels and diphthongs, while ss is used after short vowels...	É É is a letter of the Czech, Hungarian, Icelandic, Kashubian, Luxembourgish, Slovak, and Catalan, Danish, English, French, Galician, Irish, Italian, Occitan, Norwegian, Portuguese, Spanish, Swedish, and Vietnamese language as a variant of the letter “e”...
2x	SP	! Exclamation mark The exclamation mark, exclamation point, or bang, or "dembanger" is a punctuation mark usually used after an interjection or exclamation to indicate strong feelings or high volume , and often marks the end of a sentence. Example: “Watch out!” The character is encoded in Unicode at...	"	# Number sign Number sign is a name for the symbol #, which is used for a variety of purposes including, in some countries, the designation of a number...	¤ Currency (typography) The currency sign is a character used to denote a currency, when the symbol for a particular currency is unavailable. It is particularly common in place of symbols, such as that of the Colón , which are absent from most character sets and fonts...	%	& Ampersand An ampersand is a logogram representing the conjunction word "and". The symbol is a ligature of the letters in et, Latin for "and".-Etymology:...	' ' The ' symbol is the apostrophe punctuation mark.The ' symbol may also refer to:Single quotation mark, ', ‘, or ’Ejective consonant or modifier letter apostrophe, [[ʻOkina\|Okina]], Modifier letter right half ring, ʾ...	( Bracket Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...	) Bracket Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...	* Asterisk An asterisk is a typographical symbol or glyph. It is so called because it resembles a conventional image of a star. Computer scientists and mathematicians often pronounce it as star...	+	, Comma (punctuation) The comma is a punctuation mark. It has the same shape as an apostrophe or single closing quotation mark in many typefaces, but it differs from them in being placed on the baseline of the text. Some typefaces render it as a small line, slightly curved or straight but inclined from the vertical, or...	- Hyphen The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. The hyphen should not be confused with dashes , which are longer and have different uses, or with the minus sign which is also longer...	. Full stop A full stop is the punctuation mark commonly placed at the end of sentences. In American English, the term used for this punctuation is period. In the 21st century, it is often also called a dot by young people...	/ Slash (punctuation) The slash is a sign used as a punctuation mark and for various other purposes. It is now often called a forward slash , and many other alternative names.-History:...
3x	0 0 (number) 0 is both a numberand the numerical digit used to represent that number in numerals.It fulfills a central role in mathematics as the additive identity of the integers, real numbers, and many other algebraic structures. As a digit, 0 is used as a placeholder in place value systems...	1	2	3	4	5	6	7	8	9	: Colon (punctuation) The colon is a punctuation mark consisting of two equally sized dots centered on the same vertical line.-Usage:A colon informs the reader that what follows the mark proves, explains, or lists elements of what preceded the mark....	; Semicolon The semicolon is a punctuation mark with several uses. The Italian printer Aldus Manutius the Elder established the practice of using the semicolon to separate words of opposed meaning and to indicate interdependent statements. "The first printed semicolon was the work of ... Aldus Manutius"...	<	=	>	? Question mark The question mark , is a punctuation mark that replaces the full stop at the end of an interrogative sentence in English and many other languages. The question mark is not used for indirect questions...
4x	¡	A A A is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...	B B B is the second letter in the basic modern Latin alphabet. It is used to represent a variety of bilabial sounds , most commonly a voiced bilabial plosive.-History:...	C C Ĉ or ĉ is a consonant in Esperanto orthography, representing the sound .Esperanto orthography uses a diacritic for all four of its postalveolar consonants, as do the Latin-based Slavic alphabets...	D D D is the fourth letter in the basic modern Latin alphabet.- History :The Semitic letter Dâlet may have developed from the logogram for a fish or a door. There are various Egyptian hieroglyphs that might have inspired this. In Semitic, Ancient Greek, and Latin, the letter represented ; in the...	E E E is the fifth letter and a vowel in the basic modern Latin alphabet. It is the most commonly used letter in the Czech, Danish, Dutch, English, French, German, Hungarian, Latin, Norwegian, Spanish, and Swedish languages.-History:...	F F F is the sixth letter in the basic modern Latin alphabet.-History:The origin of ⟨f⟩ is the Semitic letter vâv that represented a sound like or . Graphically, it originally probably depicted either a hook or a club...	G G G is the seventh letter in the basic modern Latin alphabet.-History:The letter 'G' was introduced in the Old Latin period as a variant of ⟨c⟩ to distinguish voiced, from voiceless, . The recorded originator of ⟨g⟩ is freedman Spurius Carvilius Ruga, the first Roman to open a fee-paying school,...	H H H .) is the eighth letter in the basic modern Latin alphabet.-History:The Semitic letter ⟨ח⟩ most likely represented the voiceless pharyngeal fricative . The form of the letter probably stood for a fence or posts....	I I I is the ninth letter and a vowel in the basic modern Latin alphabet.-History:In Semitic, the letter may have originated in a hieroglyph for an arm that represented a voiced pharyngeal fricative in Egyptian, but was reassigned to by Semites, because their word for "arm" began with that sound...	J J Ĵ or ĵ is a letter in Esperanto orthography representing the sound .While Esperanto orthography uses a diacritic for its four postalveolar consonants, as do the Latin-based Slavic alphabets, the base letters are Romano-Germanic...	K K K is the eleventh letter of the English and basic modern Latin alphabet.-History and usage:In English, the letter K usually represents the voiceless velar plosive; this sound is also transcribed by in the International Phonetic Alphabet and X-SAMPA....	L L Ł or ł, described in English as L with stroke, is a letter of the Polish, Kashubian, Sorbian, Łacinka , Łatynka , Wilamowicean, Navajo, Dene Suline, Inupiaq, Zuni, Hupa, and Dogrib alphabets, several proposed alphabets for the Venetian language, and the ISO 11940 romanization of the Thai alphabet...	M M M is the thirteenth letter of the basic modern Latin alphabet.-History:The letter M is derived from the Phoenician Mem, via the Greek Mu . Semitic Mem probably originally pictured water...	N N N is the fourteenth letter in the basic modern Latin alphabet.- History of the forms :One of the most common hieroglyphs, snake, was used in Egyptian writing to stand for a sound like English ⟨J⟩, because the Egyptian word for "snake" was djet...	O O O is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...
5x	P P P is the sixteenth letter of the basic modern Latin alphabet.-Usage:In English and most other European languages, P is a voiceless bilabial plosive. Both initial and final Ps can be combined with many other discrete consonants in English words...	Q Q Q is the seventeenth letter of the basic modern Latin alphabet.- History :The Semitic sound value of Qôp was , a sound common to Semitic languages, but not found in English or most Indo-European ones...	R R R is the eighteenth letter of the basic modern Latin alphabet.-History:The original Semitic letter may have been inspired by an Egyptian hieroglyph for tp, "head". It was used for by Semites because in their language, the word for "head" was rêš . It developed into Greek Ρ and Latin R...	S S S is the nineteenth letter in the ISO basic Latin alphabet.-History: Semitic Šîn represented a voiceless postalveolar fricative . Greek did not have this sound, so the Greek sigma came to represent...	T T T is the 20th letter in the basic modern Latin alphabet. It is the most commonly used consonant and the second most common letter in the English language.- History :Taw was the last letter of the Western Semitic and Hebrew alphabets...	U U U is the twenty-first letter and a vowel in the basic modern Latin alphabet.-History:The letter U ultimately comes from the Semitic letter Waw by way of the letter Y. See the letter Y for details....	V V V is the twenty-second letter in the basic modern Latin alphabet.-Letter:The letter V comes from the Semitic letter Waw, as do the modern letters F, U, W, and Y. See F for details....	W W W is the 23rd letter in the basic modern Latin alphabet.In other Germanic languages, including German, its pronunciation is similar or identical to that of English V...	X X X is the twenty-fourth letter in the basic modern Latin alphabet.-Uses:In mathematics, x is commonly used as the name for an independent variable or unknown value. The usage of x to represent an independent or unknown variable can be traced back to the Arabic word šay شيء = “thing,” used in Arabic...	Y Y Y is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...	Z Z Z is the twenty-sixth and final letter of the basic modern Latin alphabet.-Name and pronunciation:In most dialects of English, the letter's name is zed , reflecting its derivation from the Greek zeta but in American English, its name is zee , deriving from a late 17th century English dialectal...	Ä Ä "Ä" and "ä" are both characters that represent either a letter from several extended Latin alphabets, or the letter A with an umlaut mark or diaeresis.- Independent letter :...	Ö Ö "Ö", or "ö", is a character used in several extended Latin alphabets, or the letter O with umlaut to denote the front vowels or . In languages without umlaut, the character is also used as a "O with diaeresis" to denote a syllable break, wherein its pronunciation remains an unmodified .- O-Umlaut...	Ñ Ñ Ñ is a letter of the modern Latin alphabet, formed by an N with a diacritical tilde. It is used in the Spanish alphabet, Galician alphabet, Asturian alphabet, Basque alphabet, Aragonese old alphabet , Filipino alphabet, Chamorro alphabet and the Guarani alphabet, where it represents...	Ü Ü Ü, or ü, is a character which can be either a letter from several extended Latin alphabets, or the letter U with an umlaut or a diaeresis...	§
6x	¿	a A A is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...	b B B is the second letter in the basic modern Latin alphabet. It is used to represent a variety of bilabial sounds , most commonly a voiced bilabial plosive.-History:...	c C Ĉ or ĉ is a consonant in Esperanto orthography, representing the sound .Esperanto orthography uses a diacritic for all four of its postalveolar consonants, as do the Latin-based Slavic alphabets...	d D D is the fourth letter in the basic modern Latin alphabet.- History :The Semitic letter Dâlet may have developed from the logogram for a fish or a door. There are various Egyptian hieroglyphs that might have inspired this. In Semitic, Ancient Greek, and Latin, the letter represented ; in the...	e E E is the fifth letter and a vowel in the basic modern Latin alphabet. It is the most commonly used letter in the Czech, Danish, Dutch, English, French, German, Hungarian, Latin, Norwegian, Spanish, and Swedish languages.-History:...	f F F is the sixth letter in the basic modern Latin alphabet.-History:The origin of ⟨f⟩ is the Semitic letter vâv that represented a sound like or . Graphically, it originally probably depicted either a hook or a club...	g G G is the seventh letter in the basic modern Latin alphabet.-History:The letter 'G' was introduced in the Old Latin period as a variant of ⟨c⟩ to distinguish voiced, from voiceless, . The recorded originator of ⟨g⟩ is freedman Spurius Carvilius Ruga, the first Roman to open a fee-paying school,...	h H H .) is the eighth letter in the basic modern Latin alphabet.-History:The Semitic letter ⟨ח⟩ most likely represented the voiceless pharyngeal fricative . The form of the letter probably stood for a fence or posts....	i I I is the ninth letter and a vowel in the basic modern Latin alphabet.-History:In Semitic, the letter may have originated in a hieroglyph for an arm that represented a voiced pharyngeal fricative in Egyptian, but was reassigned to by Semites, because their word for "arm" began with that sound...	j J Ĵ or ĵ is a letter in Esperanto orthography representing the sound .While Esperanto orthography uses a diacritic for its four postalveolar consonants, as do the Latin-based Slavic alphabets, the base letters are Romano-Germanic...	k K K is the eleventh letter of the English and basic modern Latin alphabet.-History and usage:In English, the letter K usually represents the voiceless velar plosive; this sound is also transcribed by in the International Phonetic Alphabet and X-SAMPA....	l L Ł or ł, described in English as L with stroke, is a letter of the Polish, Kashubian, Sorbian, Łacinka , Łatynka , Wilamowicean, Navajo, Dene Suline, Inupiaq, Zuni, Hupa, and Dogrib alphabets, several proposed alphabets for the Venetian language, and the ISO 11940 romanization of the Thai alphabet...	m M M is the thirteenth letter of the basic modern Latin alphabet.-History:The letter M is derived from the Phoenician Mem, via the Greek Mu . Semitic Mem probably originally pictured water...	n N N is the fourteenth letter in the basic modern Latin alphabet.- History of the forms :One of the most common hieroglyphs, snake, was used in Egyptian writing to stand for a sound like English ⟨J⟩, because the Egyptian word for "snake" was djet...	o O O is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...
7x	p P P is the sixteenth letter of the basic modern Latin alphabet.-Usage:In English and most other European languages, P is a voiceless bilabial plosive. Both initial and final Ps can be combined with many other discrete consonants in English words...	q Q Q is the seventeenth letter of the basic modern Latin alphabet.- History :The Semitic sound value of Qôp was , a sound common to Semitic languages, but not found in English or most Indo-European ones...	r R R is the eighteenth letter of the basic modern Latin alphabet.-History:The original Semitic letter may have been inspired by an Egyptian hieroglyph for tp, "head". It was used for by Semites because in their language, the word for "head" was rêš . It developed into Greek Ρ and Latin R...	s S S is the nineteenth letter in the ISO basic Latin alphabet.-History: Semitic Šîn represented a voiceless postalveolar fricative . Greek did not have this sound, so the Greek sigma came to represent...	t T T is the 20th letter in the basic modern Latin alphabet. It is the most commonly used consonant and the second most common letter in the English language.- History :Taw was the last letter of the Western Semitic and Hebrew alphabets...	u U U is the twenty-first letter and a vowel in the basic modern Latin alphabet.-History:The letter U ultimately comes from the Semitic letter Waw by way of the letter Y. See the letter Y for details....	v V V is the twenty-second letter in the basic modern Latin alphabet.-Letter:The letter V comes from the Semitic letter Waw, as do the modern letters F, U, W, and Y. See F for details....	w W W is the 23rd letter in the basic modern Latin alphabet.In other Germanic languages, including German, its pronunciation is similar or identical to that of English V...	x X X is the twenty-fourth letter in the basic modern Latin alphabet.-Uses:In mathematics, x is commonly used as the name for an independent variable or unknown value. The usage of x to represent an independent or unknown variable can be traced back to the Arabic word šay شيء = “thing,” used in Arabic...	y Y Y is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...	z Z Z is the twenty-sixth and final letter of the basic modern Latin alphabet.-Name and pronunciation:In most dialects of English, the letter's name is zed , reflecting its derivation from the Greek zeta but in American English, its name is zee , deriving from a late 17th century English dialectal...	ä Ä "Ä" and "ä" are both characters that represent either a letter from several extended Latin alphabets, or the letter A with an umlaut mark or diaeresis.- Independent letter :...	ö Ö "Ö", or "ö", is a character used in several extended Latin alphabets, or the letter O with umlaut to denote the front vowels or . In languages without umlaut, the character is also used as a "O with diaeresis" to denote a syllable break, wherein its pronunciation remains an unmodified .- O-Umlaut...	ñ Ñ Ñ is a letter of the modern Latin alphabet, formed by an N with a diacritical tilde. It is used in the Spanish alphabet, Galician alphabet, Asturian alphabet, Basque alphabet, Aragonese old alphabet , Filipino alphabet, Chamorro alphabet and the Guarani alphabet, where it represents...	ü Ü Ü, or ü, is a character which can be either a letter from several extended Latin alphabets, or the letter U with an umlaut or a diaeresis...	à À is a letter of the Catalan, French, Galician, Italian, Portuguese, Scottish Gaelic and Vietnamese languages, consisting of the Latin letter A and a grave accent. À is also used in Pinyin transliteration. In most languages, it represents the vowel a. This letter is also a letter in Taos.When...
1B 0x											FF				\|
1B 1x					^ Caret Caret usually refers to the spacing symbol ^ in ASCII and other character sets. In Unicode, however, the corresponding character is , whereas the Unicode character named caret is actually a similar but lowered symbol: ....							ESC2
1B 2x									{ Bracket Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...

]]||||||||||||\

Backslash

The backslash is a typographical mark used mainly in computing. It was first introduced to computers in 1960 by Bob Bemer. Sometimes called a reverse solidus or a slosh, it is the mirror image of the common slash....

|-
!1B 3x
|||||||||||||||||||||||||[

Bracket

Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...

||~

Tilde

The tilde is a grapheme with several uses. The name of the character comes from Portuguese and Spanish, from the Latin titulus meaning "title" or "superscription", though the term "tilde" has evolved and now has a different meaning in linguistics....

||]

Bracket

||
|-
!1B 4x
|

Vertical bar

The vertical bar is a character with various uses in mathematics, where it can be used to represent absolute value, among others; in computing and programming and in general typography, as a divider not unlike the interpunct...

||||||||||||||||||||||||||||||
|-
!1B 5x
|||||||||||||||||||||||||||||||
|-
!1B 6x
|||||||||||€

Euro sign

The euro sign is the currency sign used for the euro, the official currency of the Eurozone in the European Union . The design was presented to the public by the European Commission on 12 December 1996. The international three-letter code for the euro is EUR...

||||||||||||||||||||
|-
!1B 7x
|||||||||||||||||||||||||||||||
|}

Note that the second part of the table is only accessible if the GSM device supports the 7-bit extension mechanism, using the ESC character prefix. Otherwise, the ESC code itself is interpreted as a space, and the following character will be treated as if there was no leading ESC code.
Most of the high part of the table is not used in the default character set, but the GSM standard defines some language code indicators that allows the system to identify national variants of this part, to support more characters than those displayed in the above table.

In a standard GSM text message, all characters are encoded using 7-bit code units, packed together to fill all bits of octets. So, for example, the 140-octet envelope of an SMS

SMS

SMS is a form of text messaging communication on phones and mobile phones. The terms SMS or sms may also refer to:- Computer hardware :...

, with no other language indicator but only the standard class prefix, can transport up to (140*8)/7=160, that is 160 GSM 7-bit characters (but note that the ESC code counts for one of them, if characters in the high part of the table are used).

Longer messages may be sent, but will require a continuation prefix and a sequence number on subsequent SMS messages (these prefix bytes and sequence number are counted within the maximum length of the 140-octet payload of the envelope format.

When there are 1 to 6 spare bits in the last octet of a message, these bits are set to zero (these bits do not count as a character but only as a filler). When there are 7 spare bits in the last octet of a message, these bits are set to the 7-bit code of the CR control (also used as a padding filler) instead of being set to zero (where they would be confused with the 7-bit code of an '@' character).

This 7-bit encoding allows the transport of texts encoded in the Basic Latin subset of ASCII, as well as some characters of the ISO Latin 1 character set. It also allows the encoding of texts written in the Greek script, but only capitals; for such use in Greek, the Latin capital letters that look like the Greek letters are reused with the same code, so that the above character set is complete only for modern monotonic Greek restricted to capital letters. A complete support for the Greek alphabet (including small letters) requires a national version of the shifted 7-bit table (using the ESC code for each national character encoded in this shifted table), or an unspecified proprietary 8-bit encoding, or the use of the UCS2 encoding (see below).

Note that the special code marked ESC2 in the table above has also been assigned (and encoded as 0x1B,0x1B) to allow using another alternate 7-bit shift table. But this mechanism has never been used and the UCS2 encoding has been preferred.

GSM 8 bit data encoding

8 bit data encoding mode treats the information as raw data. According to the standard, the alphabet for this encoding is user specific.

UCS-2 Encoding

This encoding allows use of a greater range of characters and languages. UCS-2 can represent the most commonly used latin and eastern characters at the cost of a greater space expense.

A single SMS GSM message using this encoding can have at most 70 characters (140 octets).

Note that on many GSM smartphones, there's no specific preselection of the UCS-2 encoding. The default is to use the 7-bit encoding above, until one enters a character that is not present in the GSM 7-bit table (for example the lowercase c with cedilla 'ç'). In that case, the whole message gets reencoded using the UCS-2 encoding, and the maximum length of the message sent in only 1 SMS is immediately reduced to 70 code units, instead of 160.

To avoid unexpected costs for senders that have a subscription for a limited pack of sent SMS, smartphones should display the number of character used and the maximum number of characters in the composed SMS. When a message does exceeds this maximum, the message will be sent as multiple successive SMS containing parts of the message (each one containing a sequence number, which also uses a few leading characters in each part); these parts will be reassembled later by the recipient.

Some GSM smartphones will alert the user about the number of SMS messages needed to send the message, when it requires more than one.

National Language Shift Tables

Since release 8 of the GSM 23.038 standard, additional characters sets can be accessed through the use of a National Language Shift Table.

These tables allow the use of different character sets according to the language the text is going to be written. The choice of table for a given message is selected in the User Data Header section of an SMS message and can be specified for the whole text (a Locking shift table) or a single character (Single shift table).

Using a shift table, a message can still use 7-bit encoding for the characters, but a different set can be chosen to correctly show accented and language specific characters. This allows up to 155 characters, encoded in 136 octets (140 octets, minus the 4-octets of User Data Header required to indicate the use of a shift table and the language code).

Initially, shift tables for the following languages were specified: Spanish, Portuguese, Turkish, and 10 languages used in India written with a Brahmic script (Bengali, Gujarati, Hindi, Kannada, Malayalam, Oriya, Punjabi, Tamil, Telugu and Urdu).

Until recently, there was still no defined national language shift table for French, Greek, Russian, Bulgarian, Arabic, Hebrew and most Central European languages that need a better coverage than the default 7-bit standard character set and its default 7-bit extension character set: if ever any character is composed that cannot be represented in those default GSM 7-bit sets, the message will be automatically reencoded using UCS-2, with the effect of dividing by more than two the maximum length in characters of messages that can be sent at the price of a single SMS (when a message is split in multiple parts, a few other octets are needed in the User Data Header to indicate the sequence number of each part).

But a revision of GSM 03.38 (in specification document CR 007, version 4.2.0 of september 2001) has added support for more languages using a 7-bit national shift table : English (extended), German, Dutch, Swedish, Danish, Finnish, Norwegian, French, Italian, Hungarian, Polish, Czech, Icelandic, Greek, Russian, Hebrew and Arabic, in addition to the previous languages. Unfortunately, many smartphones (and national operators) still don't support these extensions.

There's also no language shift table for Japanese written in basic kanas, or for Korean written in Hangul jamos, or for Chinese written in the Han script. This is often not a problem in Japan, because it uses other standards than GSM and WAP for messaging.

External links

GSM 03.38 to Unicode - the GSM 03.38 to Unicode mapping data file from unicode.org.
Text to GSM 03.38 in C# - Text to GSM 03.38 mapping in the C# programming language.
JCharset - Java Charset package includes GSM 03.38 support - JCharset - Java Charset package includes GSM 03.38 support

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.