Letter frequencies
Categories: Accuracy disputes | Linguistics
The frequency of letters in text messages has often been studied for use in cryptography, and frequency analysis in particular. An exact analysis of this is not possible, as each person writes slightly differently; however, an approximate ordering of English letters by frequency of use is ETAOIN SHRDL UCMFG YPWBV KXJQZ.
An analysis based on all the words in the Cambridge Encyclopedia gave a word frequency list quite unlike that which shows up in most lists. From most common to least common, it gave EATIN ORSLH DCMUF PGBYW VKXJZQ. Note that more As appeared than Ts. The author stated that the variance from standard lists could be due to the many foreign words often repeated within articles. Note, too, that the frequency of X is greater in this work than that of J.
This brings up an interesting point. Letter frequencies, like word frequencies, tend to vary, both by writer and by subject. You cannot talk about x-rays without using frequent Xs, and you cannot use any letter if it is broken on your keyboard. Letter, digraph, trigraph and word frequencies can be used to prove or disprove authorship of long texts. Things like average word and sentence length are also used. Everyone writes differently – Hemingway is not Faulkner, and so on. A precise average usage could only be gleaned by analyzing usage in, say, a number of different chat rooms, or, say, by covertly checking e-mail, or something of that order using a huge mass of differing inputs.
Contents |
Relative frequencies of letters
| By letter | By frequency | ||
| Letter | Frequency | Letter | Frequency |
| a | 0.08167 | e | 0.12702 |
| b | 0.01492 | t | 0.09056 |
| c | 0.02782 | a | 0.08167 |
| d | 0.04253 | o | 0.07507 |
| e | 0.12702 | i | 0.06966 |
| f | 0.02228 | n | 0.06749 |
| g | 0.02015 | s | 0.06327 |
| h | 0.06094 | h | 0.06094 |
| i | 0.06966 | r | 0.05987 |
| j | 0.00153 | d | 0.04253 |
| k | 0.00772 | l | 0.04025 |
| l | 0.04025 | c | 0.02782 |
| m | 0.02406 | u | 0.02758 |
| n | 0.06749 | m | 0.02406 |
| o | 0.07507 | w | 0.02360 |
| p | 0.01929 | f | 0.02228 |
| q | 0.00095 | g | 0.02015 |
| r | 0.05987 | y | 0.01974 |
| s | 0.06327 | p | 0.01929 |
| t | 0.09056 | b | 0.01492 |
| u | 0.02758 | v | 0.00978 |
| v | 0.00978 | k | 0.00772 |
| w | 0.02360 | j | 0.00153 |
| x | 0.00150 | x | 0.00150 |
| y | 0.01974 | q | 0.00095 |
| z | 0.00074 | z | 0.00074 |
Top 10 beginning of word letters
| Letter | Frequency |
| t | 0.1594 |
| a | 0.155 |
| i | 0.0823 |
| s | 0.0775 |
| o | 0.0712 |
| c | 0.0597 |
| m | 0.0426 |
| f | 0.0408 |
| p | 0.040 |
| w | 0.0382 |
Top 10 end of word letters
| Letter | Frequency |
| e | 0.1917 |
| s | 0.1435 |
| d | 0.0923 |
| t | 0.0864 |
| n | 0.0786 |
| y | 0.0730 |
| r | 0.0693 |
| o | 0.0467 |
| l | 0.0456 |
| f | 0.0408 |
Most common digrams (in order)
th, he, in, en, nt, re, er, an, ti, es, on, at, se, nd, or, ar, al, te, co, de, to, ra, et, ed, it, sa, em, ro.
Most common trigrams (in order)
the, and, tha, ent, ing, ion, tio, for, nde, has, nce, edt, tis, oft, sth, men
See also
External link
- Image:Symbole-en.png A site with content of Cryptographical Mathematics by Robert Edward Lewandde:Buchstabenhäufigkeit
es:Frecuencia de aparición de letras fr:Fréquence d'apparition des lettres en français