Frequency analysis
Encyclopedia
In cryptanalysis
Cryptanalysis
Cryptanalysis is the study of methods for obtaining the meaning of encrypted information, without access to the secret information that is normally required to do so. Typically, this involves knowing how the system works and finding a secret key...

, frequency analysis is the study of the frequency of letters
Letter frequencies
The frequency of letters in text has often been studied for use in cryptography, and frequency analysis in particular. No exact letter frequency distribution underlies a given language, since all writers write slightly differently. Linotype machines sorted the letters' frequencies as etaoin shrdlu...

 or groups of letters in a ciphertext
Ciphertext
In cryptography, ciphertext is the result of encryption performed on plaintext using an algorithm, called a cipher. Ciphertext is also known as encrypted or encoded information because it contains a form of the original plaintext that is unreadable by a human or computer without the proper cipher...

. The method is used as an aid to breaking classical cipher
Classical cipher
A cipher is a means of concealing a message, where letters of the message are substituted or transposed for other letters, letter pairs, and sometimes for many letters. In cryptography, a classical cipher is a type of cipher that was used historically but now has fallen, for the most part, into...

s.

Frequency analysis is based on the fact that, in any given stretch of written language, certain letters and combinations of letters occur with varying frequencies. Moreover, there is a characteristic distribution of letters that is roughly the same for almost all samples of that language. For instance, given a section of English language
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...

, E, T, A and O are the most common, while Z, Q and X are rare. Likewise, TH, ER, ON, and AN are the most common pairs of letters (termed bigram
Bigram
Bigrams or digrams are groups of two written letters, two syllables, or two words, and are very commonly used as the basis for simple statistical analysis of text. They are used in one of the most successful language models for speech recognition...

s
or digraphs), and SS , EE , TT , and FF are the most common repeats. The nonsense phrase "ETAOIN SHRDLU
ETAOIN SHRDLU
ETAOIN SHRDLU is a nonsense phrase that sometimes appeared in print in the days of "hot type" publishing because of a custom of Linotype machine operators. It appeared frequently enough that it became part of the lore of newspapers...

" represents the 12 most frequent letters in typical English language text.

In some ciphers, such properties of the natural language plaintext are preserved in the ciphertext, and these patterns have the potential to be exploited in a ciphertext-only attack
Ciphertext-only attack
In cryptography, a ciphertext-only attack or known ciphertext attack is an attack model for cryptanalysis where the attacker is assumed to have access only to a set of ciphertexts....

.

Frequency analysis for simple substitution ciphers

In a simple substitution cipher
Substitution cipher
In cryptography, a substitution cipher is a method of encryption by which units of plaintext are replaced with ciphertext according to a regular system; the "units" may be single letters , pairs of letters, triplets of letters, mixtures of the above, and so forth...

, each letter of the plaintext
Plaintext
In cryptography, plaintext is information a sender wishes to transmit to a receiver. Cleartext is often used as a synonym. Before the computer era, plaintext most commonly meant message text in the language of the communicating parties....

 is replaced with another, and any particular letter in the plaintext will always be transformed into the same letter in the ciphertext. For instance, if all occurrences of the letter e turn into the letter X, a ciphertext message containing numerous instances of the letter X would suggest to a cryptanalyst that X represents e.

The basic use of frequency analysis is to first count the frequency of ciphertext letters and then associate guessed plaintext letters with them. More X's in the ciphertext than anything else suggests that X corresponds to e in the plaintext, but this is not certain; t and a are also very common in English, so X might be either of them also. It is unlikely to be a plaintext z or q which are less common. Thus the cryptanalyst may need to try several combinations of mappings between ciphertext and plaintext letters.

More complex use of statistics can be conceived, such as considering counts of pairs of letters (digrams), triplets (trigrams), and so on. This is done to provide more information to the cryptanalyst, for instance, Q and U nearly always occur together in that order in English, even though Q itself is rare.

An example

Suppose Eve
Alice and Bob
The names Alice and Bob are commonly used placeholder names for archetypal characters in fields such as cryptography and physics. The names are used for convenience; for example, "Alice sends a message to Bob encrypted with his public key" is easier to follow than "Party A sends a message to Party...

 has intercepted the cryptogram
Cryptogram
A cryptogram is a type of puzzle which consists of a short piece of encrypted text. Generally the cipher used to encrypt the text is simple enough that cryptogram can be solved by hand. Frequently used are substitution ciphers where each letter is replaced by a different letter or number. To solve...

 below, and it is known to be encrypted using a simple substitution cipher:
LIVITCSWPIYVEWHEVSRIQMXLEYVEOIEWHRXEXIPFEMVEWHKVSTYLXZIXLIKIIXPIJVSZEYPERRGERIM
WQLMGLMXQERIWGPSRIHMXQEREKIETXMJTPRGEVEKEITREWHEXXLEXXMZITWAWSQWXSWEXTVEPMRXRSJ
GSTVRIEYVIEXCVMUIMWERGMIWXMJMGCSMWXSJOMIQXLIVIQIVIXQSVSTWHKPEGARCSXRWIEVSWIIBXV
IZMXFSJXLIKEGAEWHEPSWYSWIWIEVXLISXLIVXLIRGEPIRQIVIIBGIIHMWYPFLEVHEWHYPSRRFQMXLE
PPXLIECCIEVEWGISJKTVWMRLIHYSPHXLIQIMYLXSJXLIMWRIGXQEROIVFVIZEVAEKPIEWHXEAMWYEPP
XLMWYRMWXSGSWRMHIVEXMSWMGSTPHLEVHPFKPEZINTCMXIVJSVLMRSCMWMSWVIRCIGXMWYMX
For this example, uppercase letters are used to denote ciphertext, lowercase letters are used to denote plaintext (or guesses at such), and X~t is used to express a guess that ciphertext letter X represents the plaintext letter t.

Eve could use frequency analysis to help solve the message along the following lines: counts of the letters in the cryptogram show that I is the most common single letter, XL most common bigram
Bigram
Bigrams or digrams are groups of two written letters, two syllables, or two words, and are very commonly used as the basis for simple statistical analysis of text. They are used in one of the most successful language models for speech recognition...

, and XLI is the most common trigram
Trigram
Trigrams are a special case of the N-gram, where N is 3. They are often used in natural language processing for doing statistical analysis of texts.-Frequency:The 16 most common trigrams in English are:-Examples:...

. e is the most common letter in the English language, th is the most common bigram, and the the most common trigram. This strongly suggests that X~t, L~h and I~e. The second most common letter in the cryptogram is E; since the first and second most frequent letters in the English language, e and t are accounted for, Eve guesses that E~a, the third most frequent letter. Tentatively making these assumptions, the following partial decrypted message is obtained.

heVeTCSWPeYVaWHaVSReQMthaYVaOeaWHRtatePFaMVaWHKVSTYhtZetheKeetPeJVSZaYPaRRGaReM
WQhMGhMtQaReWGPSReHMtQaRaKeaTtMJTPRGaVaKaeTRaWHatthattMZeTWAWSQWtSWatTVaPMRtRSJ
GSTVReaYVeatCVMUeMWaRGMeWtMJMGCSMWtSJOMeQtheVeQeVetQSVSTWHKPaGARCStRWeaVSWeeBtV
eZMtFSJtheKaGAaWHaPSWYSWeWeaVtheStheVtheRGaPeRQeVeeBGeeHMWYPFhaVHaWHYPSRRFQMtha
PPtheaCCeaVaWGeSJKTVWMRheHYSPHtheQeMYhtSJtheMWReGtQaROeVFVeZaVAaKPeaWHtaAMWYaPP
thMWYRMWtSGSWRMHeVatMSWMGSTPHhaVHPFKPaZeNTCMteVJSVhMRSCMWMSWVeRCeGtMWYMt

Using these initial guesses, Eve can spot patterns that confirm her choices, such as "that". Moreover, other patterns suggest further guesses. "Rtate" might be "state", which would mean R~s. Similarly "atthattMZe" could be guessed as "atthattime", yielding M~i and Z~m. Furthermore, "heVe" might be "here", giving V~r. Filling in these guesses, Eve gets:

hereTCSWPeYraWHarSseQithaYraOeaWHstatePFairaWHKrSTYhtmetheKeetPeJrSmaYPassGasei
WQhiGhitQaseWGPSseHitQasaKeaTtiJTPsGaraKaeTsaWHatthattimeTWAWSQWtSWatTraPistsSJ
GSTrseaYreatCriUeiWasGieWtiJiGCSiWtSJOieQthereQeretQSrSTWHKPaGAsCStsWearSWeeBtr
emitFSJtheKaGAaWHaPSWYSWeWeartheStherthesGaPesQereeBGeeHiWYPFharHaWHYPSssFQitha
PPtheaCCearaWGeSJKTrWisheHYSPHtheQeiYhtSJtheiWseGtQasOerFremarAaKPeaWHtaAiWYaPP
thiWYsiWtSGSWsiHeratiSWiGSTPHharHPFKPameNTCiterJSrhisSCiWiSWresCeGtiWYit

In turn, these guesses suggest still others (for example, "remarA" could be "remark", implying A~k) and so on, and it is relatively straightforward to deduce the rest of the letters, eventually yielding the plaintext.

hereuponlegrandarosewithagraveandstatelyairandbroughtmethebeetlefromaglasscasei
nwhichitwasencloseditwasabeautifulscarabaeusandatthattimeunknowntonaturalistsof
courseagreatprizeinascientificpointofviewthereweretworoundblackspotsnearoneextr
emityofthebackandalongoneneartheotherthescaleswereexceedinglyhardandglossywitha
lltheappearanceofburnishedgoldtheweightoftheinsectwasveryremarkableandtakingall
thingsintoconsiderationicouldhardlyblamejupiterforhisopinionrespectingit

At this point, it would be a good idea for Eve to insert spaces and punctuation:

Hereupon Legrand arose, with a grave and stately air, and brought me the beetle
from a glass case in which it was enclosed. It was a beautiful scarabaeus, and, at
that time, unknown to naturalists—of course a great prize in a scientific point
of view. There were two round black spots near one extremity of the back, and a
long one near the other. The scales were exceedingly hard and glossy, with all the
appearance of burnished gold. The weight of the insect was very remarkable, and,
taking all things into consideration, I could hardly blame Jupiter for his opinion
respecting it.

In this example from The Gold-Bug
The Gold-Bug
"The Gold-Bug" is a short story by Edgar Allan Poe. Set on Sullivan's Island, South Carolina, the plot follows William Legrand, who was recently bitten by a gold-colored bug. His servant Jupiter fears him to be going insane and goes to Legrand's friend, an unnamed narrator who agrees to visit his...

, Eve's guesses were all correct. This would not always be the case, however; the variation in statistics for individual plaintexts can mean that initial guesses are incorrect. It may be necessary to backtrack
Backtracking
Backtracking is a general algorithm for finding all solutions to some computational problem, that incrementally builds candidates to the solutions, and abandons each partial candidate c as soon as it determines that c cannot possibly be completed to a valid solution.The classic textbook example...

 incorrect guesses or to analyze the available statistics in much more depth than the somewhat simplified justifications given in the above example.

It is also possible that the plaintext does not exhibit the expected distribution of letter frequencies. Shorter messages are likely to show more variation. It is also possible to construct artificially skewed texts. For example, entire novels have been written that omit the letter "e" altogether — a form of literature known as a lipogram
Lipogram
A lipogram is a kind of constrained writing or word game consisting of writing paragraphs or longer works in which a particular letter or group of letters is avoided — usually a common vowel, and frequently "E", the most common letter in the English language.Writing a lipogram is a trivial task...

.

History and usage


The first known recorded explanation of frequency analysis (indeed, of any kind of cryptanalysis) was given in the 9th century by Al-Kindi
Al-Kindi
' , known as "the Philosopher of the Arabs", was a Muslim Arab philosopher, mathematician, physician, and musician. Al-Kindi was the first of the Muslim peripatetic philosophers, and is unanimously hailed as the "father of Islamic or Arabic philosophy" for his synthesis, adaptation and promotion...

, an Arab
Arab
Arab people, also known as Arabs , are a panethnicity primarily living in the Arab world, which is located in Western Asia and North Africa. They are identified as such on one or more of genealogical, linguistic, or cultural grounds, with tribal affiliations, and intra-tribal relationships playing...

 polymath
Polymath
A polymath is a person whose expertise spans a significant number of different subject areas. In less formal terms, a polymath may simply be someone who is very knowledgeable...

, in A Manuscript on Deciphering Cryptographic Messages. It has been suggested that close textual study of the Qur'an
Qur'an
The Quran , also transliterated Qur'an, Koran, Alcoran, Qur’ān, Coran, Kuran, and al-Qur’ān, is the central religious text of Islam, which Muslims consider the verbatim word of God . It is regarded widely as the finest piece of literature in the Arabic language...

 first brought to light that Arabic
Arabic language
Arabic is a name applied to the descendants of the Classical Arabic language of the 6th century AD, used most prominently in the Quran, the Islamic Holy Book...

 has a characteristic letter frequency. Its use spread, and similar systems were widely used in European states by the time of the Renaissance
Renaissance
The Renaissance was a cultural movement that spanned roughly the 14th to the 17th century, beginning in Italy in the Late Middle Ages and later spreading to the rest of Europe. The term is also used more loosely to refer to the historical era, but since the changes of the Renaissance were not...

. By 1474 Cicco Simonetta
Cicco Simonetta
Francesco Simonetta was an Italian Renaissance statesman. He also is remembered for composing an early treatise on cryptography.- Biography :...

 had written a manual on deciphering encryptions of Latin and Italian
Italian language
Italian is a Romance language spoken mainly in Europe: Italy, Switzerland, San Marino, Vatican City, by minorities in Malta, Monaco, Croatia, Slovenia, France, Libya, Eritrea, and Somalia, and by immigrant communities in the Americas and Australia...

 text. Arabic Letter Frequency
Arabic Letter Frequency
- What gets counted in input Arabic text? :Chiefly, the Arabic alphabet consists of 28 primary letters, these are letters 1 to 28 in Table 1. However, when scripting in Arabic, the eight modified letters listed in positions 29 to 36 in the same table are used just the same...

 and a detailed study of letter and word frequency analysis of the entire book of Qur'an
Qur'an
The Quran , also transliterated Qur'an, Koran, Alcoran, Qur’ān, Coran, Kuran, and al-Qur’ān, is the central religious text of Islam, which Muslims consider the verbatim word of God . It is regarded widely as the finest piece of literature in the Arabic language...

 are provided by Intellaren Articles.

Several schemes were invented by cryptographers to defeat this weakness in simple substitution encryptions. These included:
  • Use of homophones — several alternatives to the most common letters in otherwise monoalphabetic substitution ciphers (for example, for English, both X and Y ciphertext might mean plaintext E).
  • Polyalphabetic substitution
    Polyalphabetic cipher
    A polyalphabetic cipher is any cipher based on substitution, using multiple substitution alphabets. The Vigenère cipher is probably the best-known example of a polyalphabetic cipher, though it is a simplified special case...

    , that is, the use of several alphabets — chosen in assorted, more or less devious, ways (Leone Alberti seems to have been the first to propose this); and
  • Polygraphic substitution, schemes where pairs or triplets of plaintext letters are treated as units for substitution, rather than single letters (for example, the Playfair cipher
    Playfair cipher
    The Playfair cipher or Playfair square is a manual symmetric encryption technique and was the first literal digraph substitution cipher. The scheme was invented in 1854 by Charles Wheatstone, but bears the name of Lord Playfair who promoted the use of the cipher.The technique encrypts pairs of...

     invented by Charles Wheatstone
    Charles Wheatstone
    Sir Charles Wheatstone FRS , was an English scientist and inventor of many scientific breakthroughs of the Victorian era, including the English concertina, the stereoscope , and the Playfair cipher...

     in the mid-19th century).


A disadvantage of all these attempts to defeat frequency counting attacks is that it increases complication of both enciphering and deciphering, leading to mistakes. Famously, a British Foreign Secretary is said to have rejected the Playfair cipher because, even if school boys could cope successfully as Wheatstone and Playfair had shown, 'our attachés could never learn it!'.

The rotor machine
Rotor machine
In cryptography, a rotor machine is an electro-mechanical device used for encrypting and decrypting secret messages. Rotor machines were the cryptographic state-of-the-art for a prominent period of history; they were in widespread use in the 1920s–1970s...

s of the first half of the 20th century (for example, the Enigma machine
Enigma machine
An Enigma machine is any of a family of related electro-mechanical rotor cipher machines used for the encryption and decryption of secret messages. Enigma was invented by German engineer Arthur Scherbius at the end of World War I...

) were essentially immune to straight­forward frequency analysis.
However, other kinds of analysis ("attacks") successfully decoded messages from some of those machines.

Frequency analysis requires only a basic understanding of the statistics of the plaintext language and some problem solving skills, and, if performed by hand, some tolerance for extensive letter bookkeeping. During World War II
World War II
World War II, or the Second World War , was a global conflict lasting from 1939 to 1945, involving most of the world's nations—including all of the great powers—eventually forming two opposing military alliances: the Allies and the Axis...

 (WWII), both the British
United Kingdom
The United Kingdom of Great Britain and Northern IrelandIn the United Kingdom and Dependencies, other languages have been officially recognised as legitimate autochthonous languages under the European Charter for Regional or Minority Languages...

 and the Americans
United States
The United States of America is a federal constitutional republic comprising fifty states and a federal district...

 recruited codebreakers by placing crossword
Crossword
A crossword is a word puzzle that normally takes the form of a square or rectangular grid of white and shaded squares. The goal is to fill the white squares with letters, forming words or phrases, by solving clues which lead to the answers. In languages that are written left-to-right, the answer...

 puzzles in major newspapers and running contests for who could solve them the fastest. Several of the ciphers used by the Axis powers
Axis Powers
The Axis powers , also known as the Axis alliance, Axis nations, Axis countries, or just the Axis, was an alignment of great powers during the mid-20th century that fought World War II against the Allies. It began in 1936 with treaties of friendship between Germany and Italy and between Germany and...

 were breakable using frequency analysis (for example, some of the consular ciphers used by the Japanese). Mechanical methods of letter counting and statistical analysis (generally IBM card type machinery) were first used in WWII, possibly by the US Army's SIS
Signals Intelligence Service
The Signals Intelligence Service was the United States Army codebreaking division, headquartered at Arlington Hall. It was a part of the Signal Corps so secret that outside the office of the Chief Signal officer, it did not officially exist. William Friedman began the division with three "junior...

. Today, the hard work of letter counting and analysis has been replaced by computer
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...

 software, which can carry out such analysis in seconds. With modern computing power, classical ciphers are unlikely to provide any real protection for confidential data.

Frequency analysis in fiction

Frequency analysis has been described in fiction. Edgar Allan Poe
Edgar Allan Poe
Edgar Allan Poe was an American author, poet, editor and literary critic, considered part of the American Romantic Movement. Best known for his tales of mystery and the macabre, Poe was one of the earliest American practitioners of the short story and is considered the inventor of the detective...

's "The Gold-Bug
The Gold-Bug
"The Gold-Bug" is a short story by Edgar Allan Poe. Set on Sullivan's Island, South Carolina, the plot follows William Legrand, who was recently bitten by a gold-colored bug. His servant Jupiter fears him to be going insane and goes to Legrand's friend, an unnamed narrator who agrees to visit his...

", and Sir Arthur Conan Doyle's
Arthur Conan Doyle
Sir Arthur Ignatius Conan Doyle DL was a Scottish physician and writer, most noted for his stories about the detective Sherlock Holmes, generally considered a milestone in the field of crime fiction, and for the adventures of Professor Challenger...

 Sherlock Holmes
Sherlock Holmes
Sherlock Holmes is a fictional detective created by Scottish author and physician Sir Arthur Conan Doyle. The fantastic London-based "consulting detective", Holmes is famous for his astute logical reasoning, his ability to take almost any disguise, and his use of forensic science skills to solve...

 tale "The Adventure of the Dancing Men
The Adventure of the Dancing Men
"The Adventure of the Dancing Men", one of the 56 Sherlock Holmes short stories written by British author Sir Arthur Conan Doyle, is one of 13 stories in the cycle collected as The Return of Sherlock Holmes....

" are examples of stories which describe the use of frequency analysis to attack simple substitution ciphers. The cipher in the Poe story is encrusted with several deception measures, but this is more a literary device than anything significant cryptographically.

See also

  • ETAOIN SHRDLU
    ETAOIN SHRDLU
    ETAOIN SHRDLU is a nonsense phrase that sometimes appeared in print in the days of "hot type" publishing because of a custom of Linotype machine operators. It appeared frequently enough that it became part of the lore of newspapers...

  • Letter frequencies
    Letter frequencies
    The frequency of letters in text has often been studied for use in cryptography, and frequency analysis in particular. No exact letter frequency distribution underlies a given language, since all writers write slightly differently. Linotype machines sorted the letters' frequencies as etaoin shrdlu...

  • Arabic Letter Frequency
    Arabic Letter Frequency
    - What gets counted in input Arabic text? :Chiefly, the Arabic alphabet consists of 28 primary letters, these are letters 1 to 28 in Table 1. However, when scripting in Arabic, the eight modified letters listed in positions 29 to 36 in the same table are used just the same...

  • Index of coincidence
    Index of coincidence
    In cryptography, coincidence counting is the technique of putting two texts side-by-side and counting the number of times that identical letters appear in the same position in both texts...

  • Topics in cryptography
  • Zipf's law
  • A Void, a novel by Georges Perec
    Georges Perec
    Georges Perec was a French novelist, filmmaker, documentalist and essayist. He is a member of the Oulipo group...

    . The original French text is written without the letter e, as is the English translation. The Spanish version contains no a.

Further reading

  • Helen Fouché Gaines, "Cryptanalysis", 1939, Dover. ISBN 0-486-20097-3
  • Abraham Sinkov
    Abraham Sinkov
    Dr. Abraham Sinkov was a US cryptanalyst.-Biography:Sinkov, the son of immigrants from Russia, was born in Philadelphia, but grew up in Brooklyn. After graduating from Boys High School—what today would be called a "magnet school" -- he took his B.S. in mathematics from City College of New York...

    , "Elementary Cryptanalysis: A Mathematical Approach", The Mathematical Association of America, 1966. ISBN 0-88385-622-0.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK