null program

Emacs Unicode Pitfalls

GNU Emacs is seven years older than Unicode. Support for Unicode had to be added relatively late in Emacs' existence. This means Emacs has existed longer without Unicode support (16 years) than with it (14 years). Despite this, Emacs has excellent Unicode support. It feels as if it was there the whole time.

However, as a natural result of Unicode covering all sorts of edge cases for every known human language, there are pitfalls and complications. As a user of Emacs, you're not particularly affected by these, but extension developers might run into trouble while handling Emacs character-oriented data structures: strings and buffers.

In this article I'll go over Elisp's Unicode surprises. I've been caught by some of these myself. In fact, as a result of writing this article, I've discovered subtle encoding bugs in some of my own extensions. None of these pitfalls are Emacs' fault. They're just the result of complexities of natural language.

Unicode and Code Points

First, there are excellent materials online for learning Unicode. I recommend starting with UTF-8 and Unicode FAQ for Unix/Linux. There's no reason for me to repeat all this information here, but I'll attempt to quickly summarize it.

Unicode maps code points (integers) to specific characters, along with a standard name. As of this writing, Unicode defines over 110,000 characters. For backwards compatibility, the first 128 code points are mapped to ASCII. This trend continues for other character standards, like Latin-1.

In Emacs, Unicode characters are entered into a buffer with C-x 8 RET (insert-char). You can enter either the official name of the character (e.g. "GREEK SMALL LETTER PI" for π) or the hexadecimal code point. Outside of Emacs it depends on the application, but C-S-u followed by the hexadecimal code works for most of the applications I care about.

Encodings

The Unicode standard also describes several methods for encoding sequences of code points into sequences of bytes. Obviously a selection of 110,000 characters cannot be encoded with one byte per letter, so these are multibyte encodings. The two most popular encodings are probably UTF-8 and UTF-16.

UTF-8 was designed to be backwards compatible with ASCII, Unix, and existing C APIs (null-terminated C strings). The first 128 code points are encoded directly as a single byte. Every other character is encoded with two to six bytes, with the highest bit of each byte set to 1. This ensures that no part of a multibyte character will be interpreted as ASCII, nor will it contain a null (0). The latter means that C programs and C APIs can handle UTF-8 strings with few or no changes. Most importantly, every ASCII encoded file is automatically a UTF-8 encoded file.

UTF-16 encodes all the characters from the Basic Multilingual Plane (BMP) with two bytes. Even the original ASCII characters get two bytes (16 bits). The BMP covers virtually all modern languages and is generally all you'll ever practically need. However, this doesn't include the important TROPICAL DRINK or PILE OF POO characters from the supplemental ("astral") plane. If you need to use these characters in UTF-16, you're going to run into problems: characters outside the BMP don't fit in two bytes. To accommodate these characters, UTF-16 uses surrogate pairs: these characters are encoded with two 16-bit units.

Because of this last point, UTF-16 offers no practical advantages over UTF-8. Its existence was probably a big mistake. You can't do constant-time character lookup because you have to scan for surrogate pairs. It's not backwards compatible and cannot be stored in null-terminated strings. In both Java and JavaScript, it leads to the awkward situation where the "length" of a string is not the number of characters, code points, or even bytes. Worst of all, it has serious security implications. New applications should avoid it whenever possible.

Emacs and UTF-8

Emacs internally stores all text as UTF-8. This was an excellent choice! When text leaves Emacs, such as writing to a file or to a process, Emacs automatically converts it to the coding system configured for that particular file or process. When it accepts text from a file or process, it either converts it to UTF-8 or preserves it as raw bytes.

There are two modes for this in Emacs: unibyte and multibyte. Unibyte strings/buffers are just raw bytes. They have constant access O(1) time but can only hold single-byte values. The byte-code compiler outputs unibyte strings.

Multibyte strings/buffers hold UTF-8 encoded code points. Character access is O(n) because the string/buffer has to be scanned to count characters.

The actual encoding is rarely relevant because there's little way (and need) to access it directly. Emacs automatically converts text as needed when it leaves Emacs and arrives in Emacs, so there's no need to know the internal encoding. If you really want to see it anyway, you can use string-as-unibyte to get a copy of a string with the exact same bytes, but as a byte-string.

(string-as-unibyte "π")
;; => "\317\200"

This can be reversed with string-as-multibyte), to change a unibyte string holding UTF-8 encoded text back into a multibyte string. Note that these functions are different than string-to-unibyte and string-to-multibyte, which will attempt a conversion rather than preserving the raw bytes.

The length and buffer-size functions always count characters in multibyte and bytes in unibyte. Being UTF-8, there are no surrogate pairs to worry about here. The string-bytes and position-bytes functions return byte information for both multibyte and unibyte.

To specify a Unicode character in a string literal without using the character directly, use \uXXXX. The XXXX is the hexadecimal code point for the character and is always 4 digits long. For characters outside the BMP, which won't fit in four digits, use a capital U with eight digits: \UXXXXXXXX.

"\u03C0"
;; => "π"

"\U0001F4A9"
;; => "💩"  (PILE OF POO)

Finally, Emacs extends Unicode with 256 additional "characters" representing raw bytes. This allows raw bytes to be embedded distinctly within UTF-8 sequences. For example, it's used to distinguish the code point U+0041 from the raw byte #x41. As far as I can tell, this isn't used very often.

Combining Characters

Some Unicode characters are defined as combining characters. These characters modify the non-combining character that appears before it, typically with accents or diacritical marks.

For example, the word "naïve" can be written as six characters as "nai\u0308ve". The fourth character, U+0308 (COMBINING DIAERESIS), is a combining character that changes the "i" (U+0069 LATIN SMALL LETTER I) into an umlaut character.

The most commonly accented characters have a code of their own. These are called precomposed characters. This includes ï (U+00EF LATIN SMALL LETTER I WITH DIAERESIS). This means "naïve" can also be written as five characters as "na\u00EFve".

Normalization

So what happens when comparing two different representations of the same text? They're not equal.

(string= "nai\u0308ve" "na\u00EFve")
;; => nil

To deal with situations like this, the Unicode standard defines four different kinds of normalization. The two most important ones are NFC (composition) and NFD (decomposition). The former uses precomposed characters whenever possible and the latter breaks them apart. The functions ucs-normalize-NFC-string and ucs-normalize-NFD-string perform this operation.

Pitfall #1: Proper string comparison requires normalization. It doesn't matter which normalization you use (though NFD should be slightly faster), you just need to use it consistently. Unfortunately this can get tricky when using equal to compare complex data structures with multiple strings.

(string= (ucs-normalize-NFD-string "nai\u0308ve")
         (ucs-normalize-NFD-string "na\u00EFve"))
;; => t

Emacs itself fails to do this. It doesn't normalize strings before interning them, which is probably a mistake. This means you can have differently defined variables and functions with the same canonical name.

(eq (intern "nai\u0308ve")
    (intern "na\u00EFve"))
;; => nil

(defun print-résumé ()
  "NFC-normalized form."
  (print "I'm going to sabotage your team."))

(defun print-résumé ()
  "NFD-normalized form."
  (print "I'd be a great asset to your team."))

(print-résumé)
;; => "I'm going to sabotage your team."

String Width

There are three ways to quantify multibyte text. These are often the same value, but in some circumstances they can each be different.

Most of the time, one character is one column (a width of one). Some characters, like combining characters, consume no columns. Many Asian characters consume two columns (U+4000, 䀀). Tabs consume tab-width columns, usually 8.

Generally, a string should have the same width regardless of which whether it's NFD or NFC. However, due to bugs and incomplete Unicode support, this isn't strictly true. For example, some combining characters, such as U+20DD ⃝, won't combine correctly in Emacs nor in other applications.

Pitfall #2: Always measure text by width, not length, when laying out a buffer. Width is measured with the string-width function. This comes up when laying out tables in a buffer. The number of characters that fit in a column depends on what those characters are.

Fortunately I accidentally got this right in Elfeed because I used the format function for layout. The %s directive operates on width, as would be expected. However, this has the side effect that the output of may format change depending on the current buffer! Pitfall #3: Be mindful of the current buffer when using the format function.

(let ((tab-width 4))
  (length (format "%.6s" "\t")))
;; => 1

(let ((tab-width 8))
  (length (format "%.6s" "\t")))
;; => 0

String Reversal

Say you want to reverse a multibyte string. Simple, right?

(defun reverse-string (string)
  (concat (reverse (string-to-list string))))

(reverse-string "abc")
;; => "cba"

Wrong! The combining characters will get flipped around to the wrong side of the character they're meant to modify.

(reverse-string "nai\u0308ve")
;; => "ev̈ian"

Pitfall #4: Reversing Unicode strings is non-trivial. The Rosetta Code page is full of incorrect examples, and I'm personally guilty of this, too. The other day I submitted a patch to s.el to correct its s-reverse function for Unicode. If it's accepted, you should never need to worry about this.

Regular Expressions

Regular expressions operate on code points. This means combining characters are counted separately and the match may change depending on how characters are composed. To avoid this, you might want to consider NFC normalization before performing some kinds of regular expressions.

;; Like string= from before:
(string-match-p  "na\u00EFve" "nai\u0308ve")
;; => nil

;; The . only matches part of the composition
(string-match-p "na.ve" "nai\u0308ve")
;; => nil

Pitfall #5: Be mindful of combining characters when using regular expressions. Prefer NFC normalization when dealing with regular expressions.

Another potential problem is ranges, though this is quite uncommon. Ranges of characters can be expressed in inside brackets, e.g. [a-zA-Z]. If the range begins or ends with a decomposed combining character you won't get the proper range because its parts are considered separately by the regular expression engine.

(defvar match-weird "[\u00E0-\u00F6]+")

(string-match-p match-weird "áâãäå")
;; => 0  (successful match)

(string-match-p (ucs-normalize-NFD-string match-weird) "áâãäå")
;; => nil

It's especially important to keep all of this in mind when sanitizing untrusted input, such as when using Emacs as a web server. An attacker might use a denormalized or strange grapheme cluster to bypass a filter.

Interacting with the World

Here's a mistake I've made twice now. Emacs uses UTF-8 internally, regardless of whatever encoding the original text came in. Pitfall #6: When working with bytes of text, the counts may be different than the original source of the text.

For example, HTTP/1.1 introduced persistent connections. Before this, a client connects to a server and asks for content. The server sends the content and then closes the connection to signal the end of the data. In HTTP/1.1, when Connection: close isn't specified, the server will instead send a Content-Length header indicating the length of the content in bytes. The connection can then be re-used for more requests, or, more importantly, pipelining requests.

The main problem is that HTTP headers usually have a different encoding than the content body. Emacs is not prepared to handle multiple encodings from a single source, so the only correct way to talk HTTP with a network process is raw. My mistake was allowing Emacs to do the UTF-8 conversion, then measuring the length of the content in its UTF-8 encoding. This just happens to work fine about 99.9% of the time since clients tend to speak UTF-8, or something like it, anyway, but it's not correct.

Further Reading

A lot of this investigation was inspired by JavaScript's and other languages' Unicode shortcomings.

Comparatively, Emacs Lisp has really great Unicode support. This isn't too surprising considering that it's primary purpose is for manipulating text.

tags: [ emacs elisp ]
blog comments powered by Disqus
Fork me on GitHub