Character encoding

(Redirected from Text encoding)

A character encoding consists of a code that pairs a set of characters (representations of graphemes or grapheme-like units, such as might appear in an alphabet or syllabary for the communication of a natural language) with a set of something else, such as numbers or electrical pulses, in order to facilitate the storage of text in computers and the transmission of text through telecommunication networks. Common examples include Morse code, which encodes letters of the Latin alphabet as series of long and short depressions of a telegraph key; and ASCII, which encodes letters, numerals, and other symbols, both as integers and as 7-bit binary versions of those integers.

Contents

Character repertoire

In some contexts, especially computer storage and communication, it makes sense to distinguish a character repertoire (a full set of abstract characters that a system supports) from a coded character set or character encoding (which specifies how to represent characters from that set using a number of integer codes).

In earlier days of computing, the introduction of character repertoires such as ASCII (1963) and EBCDIC (1964) began the process of standardisation. The limitations of such sets soon became apparent, and a number of ad-hoc methods developed to extend them. The need to support multiple writing systems, including the CJK family of East Asian scripts, required support for a far larger number of characters and demanded a systematic approach to character encoding rather than the previous ad hoc approaches.

For example, the full repertoire of Unicode encompasses over 100,000 characters. Each of these characters has a unique integer code in the range 0 to hexadecimal 10FFFF (a little over 1.1 million, so not all integers in that range represent coded characters). Other common repertoires include ASCII and ISO 8859-1, which mirror exactly the first 128 and 256 coded characters of Unicode respectively.

Encoding forms and encoding schemes

Computer scientists sometimes overload the term character encoding to mean also how a specific sequence of bits represent characters. This involves an encoding form which specifies the conversion of the integer code into a series of integer code values that facilitate storage in a system that uses fixed bit widths. For example, integers greater than 65535 ( hex FFFF) will not fit in 16 bits, so the UTF-16 encoding form mandates representation of these integers as a surrogate pair of integers, each less than 65536 and not assigned to characters (for example, hex 10000 becomes the pair D800 DC00). An encoding scheme then converts code values to bit sequences, with attention given to things like platform-dependent byte order issues (for example, D800 DC00 might become 00 D8 00 DC on an Intel x86 architecture). A character set or character map or code page shortcuts this process by directly mapping abstract characters to specific bit patterns. Unicode Technical Report #17 explains this terminology in depth and provides further examples.

Since most applications use only a small subset of Unicode, encoding schemes (like UTF-8 and UTF-16) and character maps (like ASCII) provide efficient ways to represent Unicode characters in computer storage or communications by using short binary words. Some of these simple encodings use data compression techniques to represent a large repertoire with a smaller number of codes.

Popular character encodings

See also

External links

de:Zeichencodierung es:Codificación de caracteres fr:Codage_de_caractères gl:Codificación de caracteres ja:文字コード nn:teiknsett zh:字符集 zh-min-nan:Pian-bé