Character encoding

(Or "character encoding scheme") A mapping between binary data values and character code positions (or "code points"). Early systems stored characters in a variety of ways, e.g. four six-bit characters in a 24-bit word, but around 1960, eight-bit bytes started to become the most common data storage layout, with each character stored in one byte, typically in the ASCII character set. In the case of ASCII, the character encoding is an identity mapping: code position 65 maps to the byte value 65. This is possible because ASCII uses only code positions representable as single bytes, i.e., values between 0 and 255. (US-ASCII only uses values 0 to 127, in fact.) From the late 1990s, there was increased use of larger character sets such as Unicode and many CJK coded character sets. These can represent characters from many languages and more symbols. Unicode uses many more than the 256 code positions that can be represented by one byte. It thus requires more complex mappings: sometimes the characters are mapped onto pairs of bytes (see DBCS). In many cases, this breaks programs that assume a one-to-one mapping of bytes to characters, and so, for example, treat any occurrance of the byte value 13 as a carriage return. To avoid this problem, character encodings such as UTF-8 were devised.

Free Online Dictionary of Computing