GNU Emacs is seven years older than Unicode. Support for Unicode had
to be added relatively late in Emacs’ existence. This means Emacs has
existed longer without Unicode support (16 years) than with it (14
years). Despite this, Emacs has excellent Unicode support. It feels as
if it was there the whole time.
However, as a natural result of Unicode covering all sorts of edge
cases for every known human language, there are pitfalls and
complications. As a user of Emacs, you’re not particularly affected
by these, but extension developers might run into trouble while
handling Emacs character-oriented data structures: strings and
In this article I’ll go over Elisp’s Unicode surprises. I’ve been
caught by some of these myself. In fact, as a result of writing this
article, I’ve discovered subtle encoding bugs in some of my own
extensions. None of these pitfalls are Emacs’ fault. They’re just the
result of complexities of natural language.
Unicode and Code Points
First, there are excellent materials online for learning Unicode. I
recommend starting with UTF-8 and Unicode FAQ for Unix/Linux.
There’s no reason for me to repeat all this information here, but I’ll
attempt to quickly summarize it.
Unicode maps code points (integers) to specific characters, along
with a standard name. As of this writing, Unicode defines over 110,000
characters. For backwards compatibility, the first 128 code points are
mapped to ASCII. This trend continues for other character standards,
In Emacs, Unicode characters are entered into a buffer with
insert-char). You can enter either the official name of the
character (e.g. “GREEK SMALL LETTER PI” for π) or the hexadecimal code
point. Outside of Emacs it depends on the application, but
followed by the hexadecimal code works for most of the applications I
The Unicode standard also describes several methods for encoding
sequences of code points into sequences of bytes. Obviously a selection
of 110,000 characters cannot be encoded with one byte per letter, so
these are multibyte encodings. The two most popular encodings are
probably UTF-8 and UTF-16.
UTF-8 was designed to be backwards compatible with ASCII, Unix, and
existing C APIs (null-terminated C strings). The first 128 code points
are encoded directly as a single byte. Every other character is
encoded with two to six bytes, with the highest bit of each byte set
to 1. This ensures that no part of a multibyte character will be
interpreted as ASCII, nor will it contain a null (0). The latter means
that C programs and C APIs can handle UTF-8 strings with few or no
changes. Most importantly, every ASCII encoded file is automatically a
UTF-8 encoded file.
UTF-16 encodes all the characters from the Basic Multilingual Plane
(BMP) with two bytes. Even the original ASCII characters get two bytes
(16 bits). The BMP covers virtually all modern languages and is
generally all you’ll ever practically need. However, this doesn’t
include the important TROPICAL DRINK or PILE OF POO
characters from the supplemental (“astral”) plane. If you need to use
these characters in UTF-16, you’re going to run into problems:
characters outside the BMP don’t fit in two bytes. To accommodate
these characters, UTF-16 uses surrogate pairs: these characters are
encoded with two 16-bit units.
Because of this last point, UTF-16 offers no practical advantages
over UTF-8. Its existence was probably a big mistake. You
can’t do constant-time character lookup because you have to scan for
surrogate pairs. It’s not backwards compatible and cannot be stored in
awkward situation where the “length” of a string is not the number of
characters, code points, or even bytes. Worst of all, it has serious
security implications. New applications should avoid it
Emacs and UTF-8
Emacs internally stores all text as UTF-8. This was an excellent
choice! When text leaves Emacs, such as writing to a file or to a
process, Emacs automatically converts it to the coding system
configured for that particular file or process. When it accepts text
from a file or process, it either converts it to UTF-8 or preserves it
as raw bytes.
There are two modes for this in Emacs: unibyte and multibyte. Unibyte
strings/buffers are just raw bytes. They have constant access O(1)
time but can only hold single-byte values. The byte-code compiler
outputs unibyte strings.
Multibyte strings/buffers hold UTF-8 encoded code points. Character
access is O(n) because the string/buffer has to be scanned to count
The actual encoding is rarely relevant because there’s little way (and
need) to access it directly. Emacs automatically converts text as
needed when it leaves Emacs and arrives in Emacs, so there’s no need
to know the internal encoding. If you really want to see it anyway,
you can use
string-as-unibyte to get a copy of a string with the
exact same bytes, but as a byte-string.
;; => "\317\200"
This can be reversed with
string-as-multibyte), to change a unibyte
string holding UTF-8 encoded text back into a multibyte string. Note
that these functions are different than
string-to-multibyte, which will attempt a conversion rather than
preserving the raw bytes.
buffer-size functions always count characters in
multibyte and bytes in unibyte. Being UTF-8, there are no surrogate
pairs to worry about here. The
functions return byte information for both multibyte and unibyte.
To specify a Unicode character in a string literal without using the
character directly, use
XXXX is the hexadecimal code
point for the character and is always 4 digits long. For characters
outside the BMP, which won’t fit in four digits, use a capital U with
;; => "π"
;; => "💩" (PILE OF POO)
Finally, Emacs extends Unicode with 256 additional “characters”
representing raw bytes. This allows raw bytes to be embedded
distinctly within UTF-8 sequences. For example, it’s used to
distinguish the code point U+0041 from the raw byte #x41. As far as I
can tell, this isn’t used very often.
Some Unicode characters are defined as combining characters. These
characters modify the non-combining character that appears before it,
typically with accents or diacritical marks.
For example, the word “naïve” can be written as six characters as
"nai\u0308ve". The fourth character, U+0308 (COMBINING DIAERESIS),
is a combining character that changes the “i” (U+0069 LATIN SMALL
LETTER I) into an umlaut character.
The most commonly accented characters have a code of their own. These
are called precomposed characters. This includes ï (U+00EF LATIN
SMALL LETTER I WITH DIAERESIS). This means “naïve” can also be written
as five characters as
So what happens when comparing two different representations of the
same text? They’re not equal.
(string= "nai\u0308ve" "na\u00EFve")
;; => nil
To deal with situations like this, the Unicode standard defines four
different kinds of normalization. The two most important ones are NFC
(composition) and NFD (decomposition). The former uses precomposed
characters whenever possible and the latter breaks them apart. The
perform this operation.
Pitfall #1: Proper string comparison requires normalization. It
doesn’t matter which normalization you use (though NFD should be
slightly faster), you just need to use it consistently. Unfortunately
this can get tricky when using
equal to compare complex data
structures with multiple strings.
(string= (ucs-normalize-NFD-string "nai\u0308ve")
;; => t
Emacs itself fails to do this. It doesn’t normalize strings before
interning them, which is probably a mistake. This means you can have
differently defined variables and functions with the same canonical
(eq (intern "nai\u0308ve")
;; => nil
(defun print-résumé ()
(print "I'm going to sabotage your team."))
(defun print-résumé ()
(print "I'd be a great asset to your team."))
;; => "I'm going to sabotage your team."
There are three ways to quantify multibyte text. These are often the
same value, but in some circumstances they can each be different.
- length: number of characters, including combining characters
- bytes: number of bytes in its UTF-8 encoding
- width: number of columns it would occupy in the current buffer
Most of the time, one character is one column (a width of one). Some
characters, like combining characters, consume no columns. Many Asian
characters consume two columns (U+4000, 䀀). Tabs consume
columns, usually 8.
Generally, a string should have the same width regardless of which
whether it’s NFD or NFC. However, due to bugs and incomplete Unicode
support, this isn’t strictly true. For example, some combining
characters, such as U+20DD ⃝, won’t combine correctly in Emacs nor in
Pitfall #2: Always measure text by width, not length, when laying
out a buffer. Width is measured with the
This comes up when laying out tables in a buffer. The number of
characters that fit in a column depends on what those characters are.
Fortunately I accidentally got this right in Elfeed because
I used the
format function for layout. The
%s directive operates
on width, as would be expected. However, this has the side effect that
the output of may
format change depending on the current buffer!
Pitfall #3: Be mindful of the current buffer when using the format
(let ((tab-width 4))
(length (format "%.6s" "\t")))
;; => 1
(let ((tab-width 8))
(length (format "%.6s" "\t")))
;; => 0
Say you want to reverse a multibyte string. Simple, right?
(defun reverse-string (string)
(concat (reverse (string-to-list string))))
;; => "cba"
Wrong! The combining characters will get flipped around to the wrong
side of the character they’re meant to modify.
;; => "ev̈ian"
Pitfall #4: Reversing Unicode strings is non-trivial.
The Rosetta Code page is full of incorrect examples, and
I’m personally guilty of this, too. The other day I
submitted a patch to s.el to correct its
for Unicode. If it’s accepted, you should never need to worry about
Regular expressions operate on code points. This means combining
characters are counted separately and the match may change depending
on how characters are composed. To avoid this, you might want to
consider NFC normalization before performing some kinds of regular
;; Like string= from before:
(string-match-p "na\u00EFve" "nai\u0308ve")
;; => nil
;; The . only matches part of the composition
(string-match-p "na.ve" "nai\u0308ve")
;; => nil
Pitfall #5: Be mindful of combining characters when using regular
expressions. Prefer NFC normalization when dealing with regular
Another potential problem is ranges, though this is quite uncommon.
Ranges of characters can be expressed in inside brackets, e.g.
[a-zA-Z]. If the range begins or ends with a decomposed combining
character you won’t get the proper range because its parts are
considered separately by the regular expression engine.
(defvar match-weird "[\u00E0-\u00F6]+")
(string-match-p match-weird "áâãäå")
;; => 0 (successful match)
(string-match-p (ucs-normalize-NFD-string match-weird) "áâãäå")
;; => nil
It’s especially important to keep all of this in mind when
sanitizing untrusted input, such as when using Emacs as a web server.
An attacker might use a denormalized or strange grapheme cluster to
bypass a filter.
Interacting with the World
Here’s a mistake I’ve made twice now. Emacs uses UTF-8 internally,
regardless of whatever encoding the original text came in. Pitfall #6:
When working with bytes of text, the counts may be different than
the original source of the text.
For example, HTTP/1.1 introduced persistent connections. Before this,
a client connects to a server and asks for content. The server sends
the content and then closes the connection to signal the end of the
data. In HTTP/1.1, when
Connection: close isn’t specified, the
server will instead send a
Content-Length header indicating the
length of the content in bytes. The connection can then be re-used for
more requests, or, more importantly, pipelining requests.
The main problem is that HTTP headers usually have a different
encoding than the content body. Emacs is not prepared to handle
multiple encodings from a single source, so the only correct way to
talk HTTP with a network process is raw. My mistake was allowing Emacs
to do the UTF-8 conversion, then measuring the length of the content
in its UTF-8 encoding. This just happens to work fine about 99.9% of
the time since clients tend to speak UTF-8, or something like it,
anyway, but it’s not correct.
languages’ Unicode shortcomings.
Comparatively, Emacs Lisp has really great Unicode support. This isn’t
too surprising considering that it’s primary purpose is for