To translate bytes to characters, you need to know both what character code and what character encoding you’re using. A character code defines a mapping from positive integers to characters. Each number in the mapping is called a code point. For instance, ASCII is a character code that maps the numbers from 0-127 to particular characters used in the Latin alphabet. A character encoding, on the other hand, defines how the code points are represented as a sequence of bytes in a byte-oriented medium such as a file. For codes that use eight or fewer bits, such as ASCII and ISO-8859-1, the encoding is trivial—each numeric value is encoded as a single byte.
Nearly as straightforward are pure double-byte encodings, such as UCS-2, which map between 16-bit values and characters. The only reason double-byte encodings can be more complex than single-byte encodings is that you may also need to know whether the 16-bit values are supposed to be encoded in big-endian or little-endian format.
Variable-width encodings use different numbers of octets for different numeric values, making them more complex but allowing them to be more compact in many cases. For instance, UTF-8, an encoding designed for use with the Unicode character code, uses a single octet to encode the values 0-127 while using up to four octets to encode values up to 1,114,111.6
Common Lisp provides two functions for translating between numeric character codes and character objects: , which takes an numeric code and returns as a character, and **CHAR-CODE**
, which takes a character and returns its numeric code. The language standard doesn’t specify what character encoding an implementation must use, so there’s no guarantee you can represent every character that can possibly be encoded in a given file format as a Lisp character. However, almost all contemporary Common Lisp implementations use ASCII, ISO-8859-1, or Unicode as their native character code. Because Unicode is a superset ofISO-8859-1, which is in turn a superset of ASCII, if you’re using a Unicode Lisp, **CODE-CHAR**
and **CHAR-CODE**
can be used directly for translating any of those three character codes.7
In addition to specifying a character encoding, a string encoding must also specify how to encode the length of the string. Three techniques are typically used in binary file formats.
The simplest is to not encode it but to let it be implicit in the position of the string in some larger structure: a particular element of a file may always be a string of a certain length, or a string may be the last element of a variable-length data structure whose overall size determines how many bytes are left to read as string data. Both these techniques are used in ID3 tags, as you’ll see in the next chapter.
The different representations have different advantages and disadvantages, but when you’re dealing with already specified binary formats, you won’t have any control over which encoding is used. However, none of the encodings is particularly more difficult to read and write than any other. Here, as an example, is a function that reads a null-terminated ASCII string, assuming your Lisp implementation uses ASCII or one of its supersets such as ISO-8859-1 or full Unicode as its native character encoding:
The macro, which I mentioned in Chapter 14, is an easy way to build up a string when you don’t know how long it’ll be. It creates a **STRING-STREAM**
and binds it to the variable name specified, s
in this case. All characters written to the stream are collected into a string, which is then returned as the value of the **WITH-OUTPUT-TO-STRING**
form.
To write a string back out, you just need to translate the characters back to numeric values that can be written with and then write the null terminator after the string contents.
Now you can turn to the issue of reading and writing more complex on-disk structures and how to map them to Lisp objects.