Emacs Unicode Pitfalls
GNU Emacs is seven years older than Unicode. Support for Unicode had to be added relatively late in Emacs' existence. This means Emacs has existed longer without Unicode support (16 years) than with it (14 years). Despite this, Emacs has excellent Unicode support. It feels as if it was there the whole time.
However, as a natural result of Unicode covering all sorts of edge cases for every known human language, there are pitfalls and complications. As a user of Emacs, you're not particularly affected by these, but extension developers might run into trouble while handling Emacs character-oriented data structures: strings and buffers.
In this article I'll go over Elisp's Unicode surprises. I've been caught by some of these myself. In fact, as a result of writing this article, I've discovered subtle encoding bugs in some of my own extensions. None of these pitfalls are Emacs' fault. They're just the result of complexities of natural language.
Unicode and Code Points
First, there are excellent materials online for learning Unicode. I recommend starting with UTF-8 and Unicode FAQ for Unix/Linux. There's no reason for me to repeat all this information here, but I'll attempt to quickly summarize it.
Unicode maps code points (integers) to specific characters, along with a standard name. As of this writing, Unicode defines over 110,000 characters. For backwards compatibility, the first 128 code points are mapped to ASCII. This trend continues for other character standards, like Latin-1.
In Emacs, Unicode characters are entered into a buffer with
insert-char). You can enter either the official name of the
character (e.g. "GREEK SMALL LETTER PI" for π) or the hexadecimal code
point. Outside of Emacs it depends on the application, but
followed by the hexadecimal code works for most of the applications I
The Unicode standard also describes several methods for encoding sequences of code points into sequences of bytes. Obviously a selection of 110,000 characters cannot be encoded with one byte per letter, so these are multibyte encodings. The two most popular encodings are probably UTF-8 and UTF-16.
UTF-8 was designed to be backwards compatible with ASCII, Unix, and existing C APIs (null-terminated C strings). The first 128 code points are encoded directly as a single byte. Every other character is encoded with two to six bytes, with the highest bit of each byte set to 1. This ensures that no part of a multibyte character will be interpreted as ASCII, nor will it contain a null (0). The latter means that C programs and C APIs can handle UTF-8 strings with few or no changes. Most importantly, every ASCII encoded file is automatically a UTF-8 encoded file.
UTF-16 encodes all the characters from the Basic Multilingual Plane (BMP) with two bytes. Even the original ASCII characters get two bytes (16 bits). The BMP covers virtually all modern languages and is generally all you'll ever practically need. However, this doesn't include the important TROPICAL DRINK or PILE OF POO characters from the supplemental ("astral") plane. If you need to use these characters in UTF-16, you're going to run into problems: characters outside the BMP don't fit in two bytes. To accommodate these characters, UTF-16 uses surrogate pairs: these characters are encoded with two 16-bit units.
Emacs and UTF-8
Emacs internally stores all text as UTF-8. This was an excellent choice! When text leaves Emacs, such as writing to a file or to a process, Emacs automatically converts it to the coding system configured for that particular file or process. When it accepts text from a file or process, it either converts it to UTF-8 or preserves it as raw bytes.
There are two modes for this in Emacs: unibyte and multibyte. Unibyte strings/buffers are just raw bytes. They have constant access O(1) time but can only hold single-byte values. The byte-code compiler outputs unibyte strings.
Multibyte strings/buffers hold UTF-8 encoded code points. Character access is O(n) because the string/buffer has to be scanned to count characters.
The actual encoding is rarely relevant because there's little way (and
need) to access it directly. Emacs automatically converts text as
needed when it leaves Emacs and arrives in Emacs, so there's no need
to know the internal encoding. If you really want to see it anyway,
you can use
string-as-unibyte to get a copy of a string with the
exact same bytes, but as a byte-string.
(string-as-unibyte "π") ;; => "\317\200"
This can be reversed with
string-as-multibyte), to change a unibyte
string holding UTF-8 encoded text back into a multibyte string. Note
that these functions are different than
string-to-multibyte, which will attempt a conversion rather than
preserving the raw bytes.
buffer-size functions always count characters in
multibyte and bytes in unibyte. Being UTF-8, there are no surrogate
pairs to worry about here. The
functions return byte information for both multibyte and unibyte.
To specify a Unicode character in a string literal without using the
character directly, use
XXXX is the hexadecimal code
point for the character and is always 4 digits long. For characters
outside the BMP, which won't fit in four digits, use a capital U with
"\u03C0" ;; => "π" "\U0001F4A9" ;; => "💩" (PILE OF POO)
Finally, Emacs extends Unicode with 256 additional "characters" representing raw bytes. This allows raw bytes to be embedded distinctly within UTF-8 sequences. For example, it's used to distinguish the code point U+0041 from the raw byte #x41. As far as I can tell, this isn't used very often.
Some Unicode characters are defined as combining characters. These characters modify the non-combining character that appears before it, typically with accents or diacritical marks.
For example, the word "naïve" can be written as six characters as
"nai\u0308ve". The fourth character, U+0308 (COMBINING DIAERESIS),
is a combining character that changes the "i" (U+0069 LATIN SMALL
LETTER I) into an umlaut character.
The most commonly accented characters have a code of their own. These
are called precomposed characters. This includes ï (U+00EF LATIN
SMALL LETTER I WITH DIAERESIS). This means "naïve" can also be written
as five characters as
So what happens when comparing two different representations of the same text? They're not equal.
(string= "nai\u0308ve" "na\u00EFve") ;; => nil
To deal with situations like this, the Unicode standard defines four
different kinds of normalization. The two most important ones are NFC
(composition) and NFD (decomposition). The former uses precomposed
characters whenever possible and the latter breaks them apart. The
perform this operation.
Pitfall #1: Proper string comparison requires normalization. It
doesn't matter which normalization you use (though NFD should be
slightly faster), you just need to use it consistently. Unfortunately
this can get tricky when using
equal to compare complex data
structures with multiple strings.
(string= (ucs-normalize-NFD-string "nai\u0308ve") (ucs-normalize-NFD-string "na\u00EFve")) ;; => t
Emacs itself fails to do this. It doesn't normalize strings before interning them, which is probably a mistake. This means you can have differently defined variables and functions with the same canonical name.
(eq (intern "nai\u0308ve") (intern "na\u00EFve")) ;; => nil (defun print-résumé () "NFC-normalized form." (print "I'm going to sabotage your team.")) (defun print-résumé () "NFD-normalized form." (print "I'd be a great asset to your team.")) (print-résumé) ;; => "I'm going to sabotage your team."
There are three ways to quantify multibyte text. These are often the same value, but in some circumstances they can each be different.
- length: number of characters, including combining characters
- bytes: number of bytes in its UTF-8 encoding
- width: number of columns it would occupy in the current buffer
Most of the time, one character is one column (a width of one). Some
characters, like combining characters, consume no columns. Many Asian
characters consume two columns (U+4000, 䀀). Tabs consume
columns, usually 8.
Generally, a string should have the same width regardless of which whether it's NFD or NFC. However, due to bugs and incomplete Unicode support, this isn't strictly true. For example, some combining characters, such as U+20DD ⃝, won't combine correctly in Emacs nor in other applications.
Pitfall #2: Always measure text by width, not length, when laying
out a buffer. Width is measured with the
This comes up when laying out tables in a buffer. The number of
characters that fit in a column depends on what those characters are.
Fortunately I accidentally got this right in Elfeed because
I used the
format function for layout. The
%s directive operates
on width, as would be expected. However, this has the side effect that
the output of may
format change depending on the current buffer!
Pitfall #3: Be mindful of the current buffer when using the format
(let ((tab-width 4)) (length (format "%.6s" "\t"))) ;; => 1 (let ((tab-width 8)) (length (format "%.6s" "\t"))) ;; => 0
Say you want to reverse a multibyte string. Simple, right?
(defun reverse-string (string) (concat (reverse (string-to-list string)))) (reverse-string "abc") ;; => "cba"
Wrong! The combining characters will get flipped around to the wrong side of the character they're meant to modify.
(reverse-string "nai\u0308ve") ;; => "ev̈ian"
Pitfall #4: Reversing Unicode strings is non-trivial.
The Rosetta Code page is full of incorrect examples, and
I'm personally guilty of this, too. The other day I
submitted a patch to s.el to correct its
for Unicode. If it's accepted, you should never need to worry about
Regular expressions operate on code points. This means combining characters are counted separately and the match may change depending on how characters are composed. To avoid this, you might want to consider NFC normalization before performing some kinds of regular expressions.
;; Like string= from before: (string-match-p "na\u00EFve" "nai\u0308ve") ;; => nil ;; The . only matches part of the composition (string-match-p "na.ve" "nai\u0308ve") ;; => nil
Pitfall #5: Be mindful of combining characters when using regular expressions. Prefer NFC normalization when dealing with regular expressions.
Another potential problem is ranges, though this is quite uncommon.
Ranges of characters can be expressed in inside brackets, e.g.
[a-zA-Z]. If the range begins or ends with a decomposed combining
character you won't get the proper range because its parts are
considered separately by the regular expression engine.
(defvar match-weird "[\u00E0-\u00F6]+") (string-match-p match-weird "áâãäå") ;; => 0 (successful match) (string-match-p (ucs-normalize-NFD-string match-weird) "áâãäå") ;; => nil
It's especially important to keep all of this in mind when sanitizing untrusted input, such as when using Emacs as a web server. An attacker might use a denormalized or strange grapheme cluster to bypass a filter.
Interacting with the World
Here's a mistake I've made twice now. Emacs uses UTF-8 internally, regardless of whatever encoding the original text came in. Pitfall #6: When working with bytes of text, the counts may be different than the original source of the text.
For example, HTTP/1.1 introduced persistent connections. Before this,
a client connects to a server and asks for content. The server sends
the content and then closes the connection to signal the end of the
data. In HTTP/1.1, when
Connection: close isn't specified, the
server will instead send a
Content-Length header indicating the
length of the content in bytes. The connection can then be re-used for
more requests, or, more importantly, pipelining requests.
The main problem is that HTTP headers usually have a different encoding than the content body. Emacs is not prepared to handle multiple encodings from a single source, so the only correct way to talk HTTP with a network process is raw. My mistake was allowing Emacs to do the UTF-8 conversion, then measuring the length of the content in its UTF-8 encoding. This just happens to work fine about 99.9% of the time since clients tend to speak UTF-8, or something like it, anyway, but it's not correct.
- UTF-8 and Unicode FAQ for Unix/Linux
- Hacking with Unicode
- java.lang.Character Unicode Character Representations
- GNU Emacs Lisp Reference Manual: Strings and Characters
Comparatively, Emacs Lisp has really great Unicode support. This isn't too surprising considering that it's primary purpose is for manipulating text.blog comments powered by Disqus