The Billion Pi Challenge
The challenge: As quickly as possible, find all occurrences of a given sequence of digits in the first one billion digits of pi. You don’t have to compute pi yourself for this challenge. For example, “141592653” appears 4 times: at positions 1, 427,238,911, 570,434,346, and 678,096,434.
To my surprise, this turned out to be harder than I expected. A straightforward scan with Boyer-Moore-Horspool across the entire text file is already pretty fast. On modern, COTS hardware it takes about 6 seconds. Comparing bytes is cheap and it’s largely an I/O-bound problem. This means building fancy indexes tends to make it slower because it’s more I/O demanding.
The challenge was inspired by The Pi-Search Page, which offers a search on the first 200 million digits. There’s also a little write-up about how their pi search works. I wanted to try to invent my own solution. I did eventually come up with something that worked, which can be found here. It’s written in plain old C.
You might want to give the challenge a shot on your own before continuing!
The first thing I tried was SQLite. I thought an index (B-tree) over
fixed-length substrings would be efficient. A
LIKE condition with a
right-hand wildcard is sargable and would work well with the
index. Here’s the schema I tried.
CREATE TABLE digits (position INTEGER PRIMARY KEY, sequence TEXT NOT NULL)
There will be 1 row for each position, i.e. 1 billion rows. Using
INTEGER PRIMARY KEY means
position will be used directly for row
IDs, saving some database space.
After the data has been inserted by sliding a window along pi, I build an index. It’s better to build an index after data is in the database than before.
CREATE INDEX sequence_index ON digits (sequence, position)
This takes several hours to complete. When it’s done the database is a whopping 60GB! Remember I said that this is very much an I/O-bound problem? I wasn’t kidding. This doesn’t work well at all. Here’s the a search for the example sequence.
SELECT position, sequence FROM digits WHERE sequence LIKE '141592653%'
You get your answers after about 15 minutes of hammering on the disk.
Sometime later I realized that up to 18-digits sequences could be
encoded into an integer, so that
TEXT column could be a much simpler
INTEGER. Unfortunately this doesn’t really improve anything. I also
tried this in PostgreSQL but it was even worse. I gave up after 24
hours of waiting on it. These databases are not built for such long,
skinny tables, at least not without beefy hardware.
A couple weeks later I had another idea. A query is just a sequence of
digits, so it can be trivially converted into a unique number. As
before, pick a fixed length for sequences (
n) for the index and an
appropriate stride. The database would be one big file. To look up a
sequence, treat that sequence as an offset into the database and seek
into the database file to that offset times the stride. The total size
of the database is
10^n * stride.
In this quick and dirty illustration, n=4 and stride=4 (far too small for that n).
For example, if the fixed-length for sequences is 6 and the stride is
4,000 bytes, looking up “141592” is just a matter of seeking to byte
141,592 * 4,000 and reading in positions until some sort of
sentinel. The stride must be long enough to store all the positions
for any indexed sequence.
For this purpose, the digits of pi are practically random numbers. The
good news is that it means a fixed stride will work well. Any
particular sequence appears just as often as any other. The chance a
specific n-length sequence begins at a specific position is
10^n. There are 1 billion positions, so a particular sequence will
1e9 / 10^n positions associated with it, which is a good place
to start for picking a stride.
The bad news is that building the index means jumping around the database essentially at random for each write. This will break any sort of cache between the program and the hard drive. It’s incredibly slow, even mmap()ed. The workaround is to either do it entirely in RAM (needs at least 6GB of RAM for 1 billion digits!) or to build it up over many passes. I didn’t try it on an SSD but maybe the random access is more tolerable there.
Adding an Index
Doing all the work in memory makes it easier to improve the database
format anyway. It can be broken into an index section and a tables
section. Instead of a fixed stride for the data, front-load the
database with a similar index that points to the section (table) of
the database file that holds that sequence’s pi positions. Each of the
10^n positions gets a single integer in the index at the front of
the file. Looking up the positions for a sequence means parsing the
sequence as a number, seeking to that offset into the beginning of the
database, reading in another offset integer, and then seeking to that
new offset. Now the database is compact and there are no concerns
No sentinel mark is needed either. The tables are concatenated in order in the table part of the database. To determine where to stop, take a peek at the next sequence’s start offset in the index. Its table immediately follows, so this doubles as an end offset. For convenience, one final integer in the index will point just beyond the end of the database, so the last sequence (99999…) doesn’t require special handling.
Searching Shorter and Longer
If the database built for fixed length sequences, how is a sequence of a different length searched? The two cases, shorter and longer, are handled differently.
If the sequence is shorter, fill in the remaining digits, …000 to …999, and look up each sequence. For example, if n=6 and we’re searching for “1415”, get all the positions for “141500”, “141501”, “141502”, …, “141599” and concatenate them. Fortunately the database already has them stored this way! Look up the offsets for “141500” and “141600” and grab everything in between. The downside is that the pi positions are only partially sorted, so they may require sorting before presenting to the user.
If the sequence is longer, the original digits file will be needed. Get the table for the subsequence fixed-length prefix, then seek into the digits file checking each of the pi positions for a full match. This requires lots of extra seeking, but a long sequence will naturally have fewer positions to test. For example, if n=7 and we’re looking for “141592653”, look up the “1415926” table in the database and check each of its 106 positions.
With this database searches are only a few milliseconds (though very much subject to cache misses). Here’s my program in action, from the repository linked above.
$ time ./pipattern 141592653 1: 14159265358979323 427238911: 14159265303126685 570434346: 14159265337906537 678096434: 14159265360713718 real 0m0.004s user 0m0.000s sys 0m0.000s
I call that challenge completed!