<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>Articles tagged ai at null program</title>
  <link rel="alternate" type="text/html"
        href="https://nullprogram.com/tags/ai/"/>
  <link rel="self" type="application/atom+xml"
        href="https://nullprogram.com/tags/ai/feed/"/>
  <updated>2026-04-07T03:24:16Z</updated>
  <id>urn:uuid:bbcac281-4091-4371-aaed-1061fd43ac26</id>

  <author>
    <name>Christopher Wellons</name>
    <uri>https://nullprogram.com</uri>
    <email>wellons@nullprogram.com</email>
  </author>

  
    
  
    
  <entry>
    <title>2026 has been the most pivotal year in my career… and it's only March</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2026/03/29/"/>
    <id>urn:uuid:91d679b3-4f07-4b61-b359-5890695ad621</id>
    <updated>2026-03-29T21:38:22Z</updated>
    <category term="ai"/><category term="c"/><category term="cpp"/>
    <content type="html">
      <![CDATA[<p>In February I left my employer after nearly two decades of service. In the
moment I was optimistic, yet unsure I made the right choice. Dust settled,
I’m now absolutely sure I chose correctly. I’m happier and better for it.
There were multiple factors, but it’s not mere chance it coincides with
these early months of <a href="https://shumer.dev/something-big-is-happening">the automation of software engineering</a>. I
left an employer that is <em>years behind</em> adopting AI to one actively
supporting and encouraging it. As of March, in my professional capacity
<strong>I no longer write code myself</strong>. My current situation was unimaginable
to me only a year ago. Like it or not, this is the future of software
engineering. Turns out I like it, and having tasted the future I don’t
want to go back to the old ways.</p>

<p>In case you’re worried, this is still me. These are my own words. <a href="https://paulgraham.com/writes.html">Writing
is thinking</a>, and it would defeat the purpose for an AI to write
in my place on my personal blog. That’s not going to change.</p>

<p>I still spend much time reading and understanding code, and using most of
the same development tools. It’s more like being a manager, orchestrating
a nebulous team of inhumanly-fast, nameless assistants. Instead of dicing
the vegetables, I conjure a helper to do it while I continue to run the
kitchen. I haven’t managed people in some 20 years now, but I can feel
those old muscles being put to use again as I improve at this new role.
Will these kitchens still need human chefs like me by the end of the
decade? Unclear, and it’s something we all need to prepare for.</p>

<p>My situation gave me an experience onboarding with AI assistance — a fast
process given a near-instant, infinitely-patient helper answering any
question about the code. By second week I was making substantial, wide
contributions to the large C++ code base. It’s difficult to attach a
quantifiable factor like 2x, 5x, 10x, etc. faster, but I can say for
certain this wouldn’t have been possible without AI. The bottlenecks have
shifted from producing code, which now takes relatively no time at all, to
other points, and we’re all still trying to figure it out.</p>

<p>My personal programming has transformed as well. Everything <a href="/blog/2024/11/10/">I said about
AI in late 2024</a> is, as I predicted, utterly obsolete. There’s a
huge, growing gap between open weight models and the frontier. Models you
can run yourself are toys. In general, almost any AI product or service
worth your attention costs money. The free stuff is, at minimum, months
behind. Most people only use limited, free services, so there’s a broad
unawareness of just how far AI has advanced. AI is <em>now highly skilled at
programming</em>, and better than me at almost every programming task, with
inhumanly-low defect rates. The remaining issues are mainly steering
problems: If AI code doesn’t do what I need, likely the AI writing it
didn’t understand what I needed.</p>

<p>I’ll still write code myself from time to time for fun — <a href="/blog/2018/06/10/">minimalist</a>,
with my <a href="/blog/2023/10/08/">style</a> and <a href="/blog/2025/01/19/">techniques</a> — the same way I play <a href="https://en.wikipedia.org/wiki/Shogi">shogi</a> on
the weekends for fun. However, artisan production is uneconomical in the
presence of industrialization. AI makes programming so cheap that only the
rich will write code by hand.</p>

<p>A small part of me is sad at what is lost. A bigger part is excited about
the possibilities of the future. I’ve always had more ideas than time or
energy to pursue them. With AI at my command, the problem changes shape. I
can comfortably take on complexity from which I previously shied away, and
I can take a shot at any idea sufficiently formed in my mind to prompt an
AI — a whole skill of its own that I’m actively developing.</p>

<p>For instance, a couple weeks ago I <a href="https://github.com/skeeto/w64devkit/pull/357">put AI to work on a problem</a>,
and it produced a working solution for me after ~12 hours of continuous,
autonomous work, literally while I slept. The past month <a href="https://github.com/skeeto/w64devkit">w64devkit</a> has
burst with activity, almost entirely AI-driven. Some of it architectural
changes I’ve wanted for years, but would require hours of tedious work,
and so I never got around to it. AI knocked it out in minutes, with the
new architecture opening new opportunities. It’s also taken on most of the
cognitive load of maintenance.</p>

<h3 id="quiltcpp">Quilt.cpp</h3>

<p>So far the my biggest, successful undertaking is <strong><a href="https://github.com/skeeto/quilt.cpp">Quilt.cpp</a></strong>, a C++
clone of <a href="https://savannah.nongnu.org/projects/quilt">Quilt</a>, an early, actively-used source control system for
patch management. Git is a glaring omission from the <a href="/blog/2020/09/25/">almost</a> complete
w64devkit, due platform and build issues. I’ve thought Quilt could fill
<em>some</em> of that source control hole, except the original is written in
Bash, Perl, and GNU Coreutils — even more of a challenge than Git. Since
Quilt is conceptually simple, and I could lean on <a href="https://frippery.org/busybox/">busybox-w32</a> <code class="language-plaintext highlighter-rouge">diff</code>
and <code class="language-plaintext highlighter-rouge">patch</code>, I’ve considered writing my own implementation, just <a href="/blog/2023/01/18/">as I did
pkg-config</a>, but I never found the energy to do it.</p>

<p>Then I got good enough with AI to knock out a near feature-complete clone
in about four days, including a built-in <code class="language-plaintext highlighter-rouge">diff</code> and <code class="language-plaintext highlighter-rouge">patch</code> so it doesn’t
actually depend on external tools (except invoking <code class="language-plaintext highlighter-rouge">$EDITOR</code>). On Windows
it’s a ~1.6MB standalone EXE, to be included in future w64devkit releases.
The source is distributed as an amalgamation, a single file <code class="language-plaintext highlighter-rouge">quilt.cpp</code>
per its namesake:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ c++ -std=c++20 -O2 -s -o quilt.exe quilt.cpp
$ ./quilt.exe --help
Usage: quilt [--quiltrc file] &lt;command&gt; [options] [args]

Commands:
  new        Create a new empty patch
  add        Add files to the topmost patch
  push       Apply patches to the source tree
  pop        Remove applied patches from the stack
  refresh    Regenerate a patch from working tree changes
  diff       Show the diff of the topmost or a specified patch
  series     List all patches in the series
  applied    List applied patches
  unapplied  List patches not yet applied
  top        Show the topmost applied patch
  next       Show the next patch after the top or a given patch
  previous   Show the patch before the top or a given patch
  delete     Remove a patch from the series
  rename     Rename a patch
  import     Import an external patch into the series
  header     Print or modify a patch header
  files      List files modified by a patch
  patches    List patches that modify a given file
  edit       Add files to the topmost patch and open an editor
  revert     Discard working tree changes to files in a patch
  remove     Remove files from the topmost patch
  fold       Fold a diff from stdin into the topmost patch
  fork       Create a copy of the topmost patch under a new name
  annotate   Show which patch modified each line of a file
  graph      Print a dot dependency graph of applied patches
  mail       Generate an mbox file from a range of patches
  grep       Search source files (not implemented)
  setup      Set up a source tree from a series file (not implemented)
  shell      Open a subshell (not implemented)
  snapshot   Save a snapshot of the working tree for later diff
  upgrade    Upgrade quilt metadata to the current format
  init       Initialize quilt metadata in the current directory

Use "quilt &lt;command&gt; --help" for details on a specific command.
</code></pre></div></div>

<p>It supports Windows and POSIX, and runs ~5x faster than the original. AI
developed it on Windows, Linux, and macOS: It’s best when the AI can close
the debug loop and tackle problems autonomously without involving a human
slowpoke. The handful of “not implemented” parts aren’t because they’re
too hard — each would probably take an AI ~10 minutes — but deliberate
decisions of taste.</p>

<p>There’s an irony that the reason I could produce Quilt.cpp with such ease
is also a reason I don’t really need it anymore.</p>

<p>I changed the output of <code class="language-plaintext highlighter-rouge">quilt mail</code> to be more Git-compatible. The mbox
produced by Quilt.cpp can be imported into Git with a plain <code class="language-plaintext highlighter-rouge">git am</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ quilt mail --mbox feature-branch.mbox
$ git am feature-branch.mbox
</code></pre></div></div>

<p>The idea being that I could work on a machine without Git (e.g. Windows
XP), and copy/mail the mbox to another machine where Git can absorb it as
though it were in Git the whole time. <code class="language-plaintext highlighter-rouge">git format-patch</code> to <code class="language-plaintext highlighter-rouge">quilt import</code>
sends commits in the opposite direction, useful for manually testing
Quilt.cpp on real change sets.</p>

<p>To be clear, I could not have done this if the original Quilt did not
exist as a working program. I began with an AI generating a <a href="https://eli.thegreenplace.net/2026/rewriting-pycparser-with-the-help-of-an-llm/">conformance
suite</a> based on the original, its documentation, and other online
documentation, validating that suite against the original implementation
(see <code class="language-plaintext highlighter-rouge">-DQUILT_TEST_EXECUTABLE</code>). Then had another AI code to the tests, on
architectural guidance from me, with <code class="language-plaintext highlighter-rouge">-D_GLIBCXX_DEBUG</code> and sanitizers as
guardrails. That was day one. The next three days were lots of refining
and iteration as I discover the gaps in the test suite. I’d prompt AI to
compare Quilt.cpp to the original Quilt man page, add tests for missing
features, validate the new tests against the original Quilt, then run
several agents to fix the tests. While they worked I’d try the latest
build and note any bugs. As of this writing, the result is about equal
parts test and non-test, ~9KLoC each.</p>

<p>I’m likely to use this technique to clone other tools with implementations
unsuitable for my purposes. I learned quite a bit from this first attempt.</p>

<p>Why C++ instead of my usual choice of C? As we know, <a href="/blog/2023/02/11/">conventional C is
highly error-prone</a>. Even AI has trouble with it. In the ~9k lines
of C++ that is Quilt.cpp, I am only aware of three memory safety errors by
the AI. Two were null-terminated string issues with <code class="language-plaintext highlighter-rouge">strtol</code>, where the AI
was essentially writing C instead of C++, after which I directed the AI to
use <code class="language-plaintext highlighter-rouge">std::from_chars</code> and drop as much direct libc use as possible. (The
other was an unlikely branch with <code class="language-plaintext highlighter-rouge">std::vector::back</code> on an empty vector.)
We can rescue C with better techniques like arena allocation, counted
strings, and slices, but while (current) state of the art AI understands
these things, it cannot work effectively with them in C. I’ve tried. So I
picked C++, and from my professional work I know AI is better at C++ than
me.</p>

<p>Also like a manager, I have not read most of the code, and instead focused
on results, so you might say this was “vibe-coded.” It <em>is</em> thoroughly
tested, though I’m sure there are still bugs to be ironed out, especially
on the more esoteric features I haven’t tried by hand yet.</p>

<h3 id="lets-discuss-tools">Let’s discuss tools</h3>

<p>After opposing CMake for years, you may have noticed the latest w64devkit
now includes CMake and Ninja. What happened? Preparing for my anticipated
employment change, this past December I read <a href="https://crascit.com/professional-cmake/"><em>Professional CMake</em></a>.
I realized that my practical problems with CMake were that nearly everyone
uses it incorrectly. Most CMake builds are a disaster, but my new-found
knowledge allows me to navigate the common mistakes. Only high profile
open source projects manage to put together proper CMake builds. Otherwise
the internet is loaded with CMake misinformation. Similar to AI, if you’re
not paying for CMake knowledge then it’s likely wrong or misleading. So I
highly recommend that book!</p>

<p>Frontier AI is <em>very good</em> with CMake. When a project has a CMake build
that isn’t <em>too</em> badly broken, just tell AI to fix it, <em>without any
specifics</em>, and build problems disappear in mere minutes without having to
think about it. It’s awesome. Combine it with the previous discussion
about tests making AI so much more effective, and that it <em>also</em> knows
CTest well, and you’ve got a killer formula. I’m more effective with CTest
myself merely from observing how AI uses it. AI (currently) cannot use
debuggers, so putting powerful, familiar testing tools in its hands helps
a lot, versus the usual bespoke, debugger-friendly solutions I prefer.</p>

<p>Similar to solving CMake problems: Have a hairy merge conflict? Just ask
AI resolve it. It’s like magic. I no longer fear merge conflicts.</p>

<p>So part of my motivation for adding CMake to w64devkit was anticipation of
projects like Quilt.cpp, where they’d be available to AI, or at least so I
could use the tools the AI used to build/test myself. It’s already paid
for itself, and there’s more to come.</p>

<p>For agent software, on personal projects I’m using Claude Code. It’s a
great value, cheaper than paying API rates but requires working around
5-hour limit windows. I started with Pro (US$20/mo), but I’m getting so
much out of it that as of this writing I’m on 5x Max (US$100/mo) simply to
have enough to explore all my ideas. Be warned: <strong>Anthropic software is
quite buggy, more so than industry average</strong>, and it’s obvious that they
never even <em>start</em>, let alone test, some of their released software on
disfavored platforms (Windows, Android). Don’t expect to use Claude Code
effectively for native Windows platform development, which sadly includes
w64devkit. Hopefully that’s fixed someday. I suspect Anthropic hit a
bottleneck on QA, and unable to fit AI in that role they don’t bother. You
can theoretically report bugs on GitHub, but they’re just ignored and
closed. (Why don’t they have AI agents jumping on this wealth of bug
reports?)</p>

<p>At work I’m using Cursor where I get a choice of models. My favorite for
March has been GPT-5.4, which in my experience beats Opus 4.6 on Claude
Code by a small margin. It’s immediately obvious that Cursor is better
agent software than Claude Code. It’s more robust, more featureful, and
with a clearer UI than Claude Code. It has no trouble on Windows and can
drive w64devkit flawlessly. It’s also more expensive than Claude Code. My
employer currently spends ~US$250/mo on my AI tokens, dirt cheap
considering what they’re getting out of it. I have bottlenecks elsewhere
that keep me from spending even more.</p>

<p>Neither Cursor nor Claude Code are open source, so what are the purists to
do, even if they’re willing to pay API rates for tokens? Sadly I have no
answers for you. I haven’t gotten any open source agent software actually
working, and it seems they may lack the necessary secret sauce.</p>

<p>Update: Several folks suggested I give <a href="https://opencode.ai/">OpenCode</a> another shot, and this
time I got over the configuration hurdle. Single executable, slick
interface, and unlike Claude Code, I observed no bugs in my brief trial.
Give that a shot if you’re looking for an open source client.</p>

<p>The future is going to be weird. My experience is only a peek at what’s to
come, and my head is still spinning. However, the more I adapt to the
changes, the better I feel. If you’re feeling anxious like I was, don’t
flinch from improving your own AI knowledge and experience.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Everything I've learned so far about running local LLMs</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2024/11/10/"/>
    <id>urn:uuid:975c2748-2c8f-4bb8-a108-b2be68a10fc5</id>
    <updated>2024-11-10T05:05:20Z</updated>
    <category term="ai"/><category term="rant"/><category term="tutorial"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=42100560">on Hacker News</a>.</em></p>

<p>Over the past month I’ve been exploring the rapidly evolving world of
Large Language Models (LLM). It’s now accessible enough to run a LLM on a
Raspberry Pi smarter than the original ChatGPT (November 2022). A modest
desktop or laptop supports even smarter AI. It’s also private, offline,
unlimited, and registration-free. The technology is improving at breakneck
speed, and information is outdated in a matter of months. This article
snapshots my practical, hands-on knowledge and experiences — information I
wish I had when starting. Keep in mind that I’m a LLM layman, I have no
novel insights to share, and it’s likely I’ve misunderstood certain
aspects. In a year this article will mostly be a historical footnote,
which is simultaneously exciting and scary.</p>

<!--more-->

<p>In case you’ve been living under a rock — as an under-the-rock inhabitant
myself, welcome! — LLMs are neural networks that underwent a breakthrough
in 2022 when trained for conversational “chat.” Through it, users converse
with a wickedly creative artificial intelligence indistinguishable from a
human, which smashes the Turing test and can be wickedly creative.
Interacting with one for the first time is unsettling, a feeling which
will last for days. When you bought your most recent home computer, you
probably did not expect to have a meaningful conversation with it.</p>

<p>I’ve found this experience reminiscent of the desktop computing revolution
of the 1990s, where your newly purchased computer seemed obsolete by the
time you got it home from the store. There are new developments each week,
and as a rule I ignore almost any information more than a year old. The
best way to keep up has been <a href="https://old.reddit.com/r/LocalLLaMA">r/LocalLLaMa</a>. Everything is hyped to the
stratosphere, so take claims with a grain of salt.</p>

<p>I’m wary of vendor lock-in, having experienced the rug pulled out from
under me by services shutting down, changing, or otherwise dropping my use
case. I want the option to continue, even if it means changing providers.
So for a couple of years I’d ignored LLMs. The “closed” models, accessibly
only as a service, have the classic lock-in problem, including <a href="https://arxiv.org/pdf/2307.09009">silent
degradation</a>. That changed when I learned I can run models close
to the state-of-the-art on my own hardware — the exact opposite of vendor
lock-in.</p>

<p>This article is about running LLMs, not fine-tuning, and definitely not
training. It’s also only about <em>text</em>, and not vision, voice, or other
“multimodal” capabilities, which aren’t nearly so useful to me personally.</p>

<p>To run a LLM on your own hardware you need <strong>software</strong> and a <strong>model</strong>.</p>

<h3 id="the-software">The software</h3>

<p>I’ve exclusively used the <em>astounding</em> <a href="https://github.com/ggerganov/llama.cpp">llama.cpp</a>. Other options exist,
but for basic CPU inference — that is, generating tokens using a CPU
rather than a GPU — llama.cpp requires nothing beyond a C++ toolchain. In
particular, no Python fiddling that plagues much of the ecosystem. On
Windows it will be a 5MB <code class="language-plaintext highlighter-rouge">llama-server.exe</code> with no runtime dependencies.
From just two files, EXE and GGUF (model), both designed to <a href="https://justine.lol/mmap/">load via
memory map</a>, you could likely still run the same LLM 25 years from
now, in exactly the same way, out-of-the-box on some future Windows OS.</p>

<p>Full disclosure: I’m biased because <a href="https://github.com/ggerganov/llama.cpp/blob/ec450d3b/docs/build.md">the official Windows build process is
w64devkit</a>. What can I say? These folks have good taste! That being
said, you should only do CPU inference if GPU inference is impractical. It
works reasonably up to ~10B parameter models on a desktop or laptop, but
it’s slower. My primary use case is not built with w64devkit because I’m
using CUDA for inference, which requires a MSVC toolchain. Just for fun, I
ported llama.cpp to Windows XP and ran <a href="https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct">a 360M model</a> on a 2008-era
laptop. It was magical to load that old laptop with technology that, at
the time it was new, would have been worth billions of dollars.</p>

<p>The bottleneck for GPU inference is video RAM, or VRAM. These models are,
well, <em>large</em>. The more RAM you have, the larger the model and the longer
the context window. Larger models are smarter, and longer contexts let you
process more information at once. <strong>GPU inference is not worth it below
8GB of VRAM</strong>. If <a href="https://huggingface.co/spaces/k-mktr/gpu-poor-llm-arena">“GPU poor”</a>, stick with CPU inference. On the
plus side, it’s simpler and easier to get started with CPU inference.</p>

<p>There are many utilities in llama.cpp, but this article is concerned with
just one: <strong><code class="language-plaintext highlighter-rouge">llama-server</code> is the program you want to run.</strong> It’s an HTTP
server (default port 8080) with a chat UI at its root, and <a href="https://github.com/ggerganov/llama.cpp/blob/ec450d3b/examples/server/README.md#api-endpoints">APIs for use
by programs</a>, including other user interfaces. A typical invocation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ llama-server --flash-attn --ctx-size 0 --model MODEL.gguf
</code></pre></div></div>

<p>The context size is the largest number of tokens the LLM can handle at
once, input plus output. Contexts typically range from 8K to 128K tokens,
and depending on the model’s tokenizer, normal English text is ~1.6 tokens
per word as counted by <code class="language-plaintext highlighter-rouge">wc -w</code>. If the model supports a large context you
may run out of memory. If so, set a smaller context size, like <code class="language-plaintext highlighter-rouge">--ctx-size
$((1&lt;&lt;13))</code> (i.e. 8K tokens).</p>

<p>I do not yet understand what flash attention is about, and I don’t know
why <code class="language-plaintext highlighter-rouge">--flash-attn</code>/<code class="language-plaintext highlighter-rouge">-fa</code> is not the default (lower accuracy?), but you
should always request it because it reduces memory requirements when
active and is well worth the cost.</p>

<p>If the server started successfully, visit it (<a href="http://localhost:8080/">http://localhost:8080/</a>) to
try it out. Though of course you’ll need a model first.</p>

<h3 id="the-models">The models</h3>

<p><a href="https://huggingface.co/">Hugging Face</a> (HF) is “the GitHub of LLMs.” It’s an incredible
service that has earned that title. “Small” models are around a few GBs,
large models are hundreds of GBs, and HF <em>hosts it all for free</em>. With a
few exceptions that do not matter in practice, you don’t even need to sign
up to download models! (I’ve been so impressed that after a few days they
got a penny-pincher like me to pay for pro account.) That means you can
immediately download and try any of the stuff I’m about to discuss.</p>

<p>If you look now, you’ll wonder, “There’s a lot of stuff here, so what the
heck am I supposed to download?” That was me one month ago. For llama.cpp,
the answer is <a href="https://github.com/ggerganov/ggml/blob/8a3d7994/docs/gguf.md">GGUF</a>. None of the models are natively in GGUF.
Instead GGUFs are in a repository with “GGUF” in the name, usually by a
third party: one of the heroic, prolific GGUF quantizers.</p>

<p>(Note how nowhere does the official documentation define what “GGUF”
stands for. Get used that. This is a technological frontier, and if the
information exists at all, it’s not in the obvious place. If you’re
considering asking your LLM about this once it’s running: Sweet summer
child, we’ll soon talk about why that doesn’t work. As far as I can tell,
“GGUF” has no authoritative definition (<strong>update</strong>: <a href="https://github.com/ggerganov/ggml/issues/220">the U stands for
“Unified”</a>, but the rest is still ambiguous).)</p>

<p>Since llama.cpp is named after the Meta’s flagship model, their model is a
reasonable start, though it’s not my personal favorite. The latest is
Llama 3.2, but at the moment only the 1B and 3B models — that is, ~1
billion and ~3 billion parameters — work in Llama.cpp. Those are a little
<em>too</em> small to be of much use, and your computer can likely to better if
it’s not a Raspberry Pi, even with CPU inference. Llama 3.1 8B is a better
option. (If you’ve got at least 24GB of VRAM then maybe you can even do
Llama 3.1 70B.)</p>

<p>If you search for Llama 3.1 8B you’ll find two options, one qualified
“instruct” and one with no qualifier. Instruct means it was trained to
follow instructions, i.e. to chat, and that’s nearly always what you want.
The other is the “base” model which can only continue a text. (Technically
the instruct model is still just completion, but we’ll get to that later.)
It would be great if base models were qualified “Base” but, for dumb path
dependency reasons, they’re usually not.</p>

<p>You will not find GGUF in the “Files” for the instruct model, nor can you
download the model without signing up in order to agree to the community
license. Go back to the search, add GGUF, and look for the matching GGUF
model: <a href="https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF">bartowski/Meta-Llama-3.1-8B-Instruct-GGUF</a>. bartowski is
one of the prolific and well-regarded GGUF quantizers. Not only will this
be in the right format for llama.cpp, you won’t need to sign up.</p>

<p>In “Files” you will now see many GGUFs. These are different quantizations
of the same model. The original model has <a href="https://en.wikipedia.org/wiki/Bfloat16_floating-point_format">bfloat16</a> tensors, but for
merely running the model we can throw away most of that precision with
minimal damage. It will be a tiny bit dumber and less knowledgeable, but
will require substantially fewer resources. <strong>The general recommendation,
which fits my experience, is to use <code class="language-plaintext highlighter-rouge">Q4_K_M</code></strong>, a 4-bit quantization. In
general, better to run a 4-bit quant of a larger model than an 8-bit quant
of a smaller model. Once you’ve got the basics understood, experiment with
different quants and see what you like!</p>

<h3 id="my-favorite-models">My favorite models</h3>

<p>Models are trained for different trade-offs and differ in strengths and
weaknesses, so no model is best at everything — especially on “GPU-poor”
configurations. My desktop system has an RTX 3050 Ti with 8GB VRAM, and
its limitations have shaped my choices. I can comfortably run ~10B models,
and ~30B models just barely enough to test their capabilities. For ~70B I
rely on third-party hosts. My “t/s” numbers are all on this system running
4-bit quants.</p>

<p>This list omits “instruct” from the model name, but assume the instruct
model unless I say otherwise. A few are <em>bona fide</em> open source, at least
as far as LLMs practically can be, and I’ve noted the license when that’s
the case. The rest place restrictions on both use and distribution.</p>

<ul>
  <li>
    <p>Mistral-Nemo-2407 (12B) [Apache 2.0]</p>

    <p>A collaboration between <a href="https://mistral.ai/">Mistral AI</a> and Nvidia (“Nemo”), the
most well-rounded ~10B model I’ve used, and my default. Inference starts
at a comfortable 30 t/s. It’s strengths are writing and proofreading,
and it can review code nearly as well as ~70B models. It was trained for
a context length of 128K, but its <a href="https://github.com/NVIDIA/RULER">effective context length is closer to
16K</a> — a limitation I’ve personally observed.</p>

    <p>The “2407” is a date (July 2024) as version number, a versioning scheme
I wholeheartedly support. A date tells you about its knowledge cut-off
and tech level. It sorts well. Otherwise LLM versioning is a mess. Just
as open source is bad with naming, AI companies do not comprehend
versioning.</p>
  </li>
  <li>
    <p>Qwen2.5-14B [Apache 2.0]</p>

    <p>Qwen models, by Alibaba Cloud, impressively punch above their weight at
all sizes. 14B inference starts at 11 t/s, with capabilities on par with
Mistral Nemo. If I could run 72B on my own hardware, it would probably
be my default. I’ve been trying it through Hugging Face’s inference API.
There’s a 32B model, but it’s impractical for my hardware, so I haven’t
spent much time with it.</p>
  </li>
  <li>
    <p>Gemma-2-2B</p>

    <p>Google’s model is popular, perhaps due to its playful demeanor. For me,
the 2B model <a href="https://github.com/skeeto/scratch/blob/master/userscript/reddit-llm-translate.user.js">is great for fast translation</a>. It’s amazing that LLMs
have nearly obsoleted Google Translate, and you can run it on your home
computer. Though it’s more resource-intensive, and refuses to translate
texts it finds offensive, which sounds like a plot element from a sci-fi
story. In my translation script, I send it text marked up with HTML.
Simply <em>asking</em> Gemma to preserve the markup Just Works! The 9B model is
even better, but slower, and I’d use it instead of 2B for translating my
own messages into another language.</p>
  </li>
  <li>
    <p>Phi3.5-Mini (4B) [MIT]</p>

    <p>Microsoft’s niche is training on synthetic data. The result is a model
that does well in tests, but doesn’t work so well in practice. For me,
its strength is document evaluation. I’ve loaded the context with up to
40K-token documents — it helps that it’s a 4B model — and successfully
queried accurate summaries and data listings.</p>
  </li>
  <li>
    <p>SmolLM2-360M [Apache 2.0]</p>

    <p>Hugging Face doesn’t just host models; their 360M model is unusually
good for its size. It fits on my 2008-era, 1G RAM, Celeron, and 32-bit
operating system laptop. It also runs well on older Raspberry Pis. It’s
creative, fast, converses competently, can write poetry, and a fun toy
in cramped spaces.</p>
  </li>
  <li>
    <p>Mixtral-8x7B (48B) [Apache 2.0]</p>

    <p>Another Mistral AI model, and more of a runner up. 48B seems too large,
but this is a <a href="https://mistral.ai/news/mixtral-of-experts/">Mixture of Experts</a> (MoE) model. Inference uses only
13B parameters at a time. It’s reasonably-suited to CPU inference on a
machine with at least 32G of RAM. The model retains more of its training
inputs, more like a database, but for reasons we’ll see soon, it isn’t
as useful as it might seem.</p>
  </li>
  <li>
    <p>Llama-3.1-70B and Llama-3.1-Nemotron-70B</p>

    <p>More models I cannot run myself, but which I access remotely. The latter
bears “Nemo” because it’s an Nvidia fine-tune. If I could run 70B models
myself, Nemotron might just be my default. I’d need to spent more time
evaluating it against Qwen2.5-72B.</p>
  </li>
</ul>

<p>Most of these models have <a href="https://huggingface.co/blog/mlabonne/abliteration">abliterated</a> or “uncensored” versions, in
which refusal is partially fine-tuned out at a cost of model degradation.
Refusals are annoying — such as Gemma refusing to translate texts it
dislikes — but doesn’t happen enough for me to make that trade-off. Maybe
I’m just boring. Also refusals seem to decrease with larger contexts, as
though “in for a penny, in for a pound.”</p>

<p>The next group are “coder” models trained for programming. In particular,
they have <em>fill-in-the-middle</em> (FIM) training for generating code inside
an existing program. I’ll discuss what that entails in a moment. As far as
I can tell, they’re no better at code review nor other instruct-oriented
tasks. It’s the opposite: FIM training is done in the base model, with
instruct training applied later on top, so instruct works <em>against</em> FIM!
In other words, <strong>base model FIM outputs are markedly better</strong>, though you
lose the ability to converse with them.</p>

<p>There will be a section on evaluation later, but I want to note now that
<em>LLMs produce mediocre code</em>, even at the state-of-the-art. The rankings
here are relative to other models, not about overall capability.</p>

<ul>
  <li>
    <p>DeepSeek-Coder-V2-Lite (16B)</p>

    <p>A self-titled MoE model from <a href="https://www.deepseek.com/">DeepSeek</a>. It uses 2B parameters
during inference, making it as fast as Gemma 2 2B but as smart as
Mistral Nemo, striking a great balance, especially because it
out-competes ~30B models at code generation. If I’m playing around with
FIM, this is my default choice.</p>
  </li>
  <li>
    <p>Qwen2.5-Coder-7B [Apache 2.0]</p>

    <p>Qwen Coder is a close second. Output is nearly as good, but slightly
slower since it’s not MoE. It’s a better choice than DeepSeek if you’re
memory-constrained. While writing this article, Alibaba Cloud released a
new Qwen2.5-Coder-7B but failed to increment the version number, which
is horribly confusing. The community has taken to calling it Qwen2.5.1.
Remember what I said about AI companies and versions? (<strong>Update</strong>: One
day publication, 14B and 32B coder models were released. I tried both,
and neither are quite as good as DeepSeek-Coder-V2-Lite, so my rankings
are unchanged.)</p>
  </li>
  <li>
    <p>Granite-8B-Code [Apache 2.0]</p>

    <p>IBM’s line of models is named Granite. In general Granite models are
disappointing, <em>except</em> that they’re unusually good at FIM. It’s tied
in second place with Qwen2.5 7B in my experience.</p>
  </li>
</ul>

<p>I also evaluated CodeLlama, CodeGemma, Codestral, and StarCoder. Their FIM
outputs were so poor as to be effectively worthless at that task, and I
found no reason to use these models. The negative effects of instruct
training were most pronounced for CodeLlama.</p>

<h3 id="the-user-interfaces">The user interfaces</h3>

<p>I pointed out Llama.cpp’s built-in UI, and I’d used similar UIs with other
LLM software. As is typical, no UI is to my liking, especially in matters
of productivity, so I built my own, <strong><a href="https://github.com/skeeto/illume">Illume</a></strong>. This command
line program converts standard input into an API query, makes the query,
and streams the response to standard output. Should be simple enough to
integrate into any extensible text editor, but I only needed it for Vim.
Vimscript is miserable, probably the second worst programming language
I’ve ever touched, so my goal was to write as little as possible.</p>

<p>I created Illume to scratch my own itch, to support my exploration of the
LLM ecosystem. I actively break things and add features as needed, and I
make no promises about interface stability. <em>You probably don’t want to
use it.</em></p>

<p>Lines that begin with <code class="language-plaintext highlighter-rouge">!</code> are directives interpreted by Illume, chosen
because it’s unlikely to appear in normal text. A conversation alternates
between <code class="language-plaintext highlighter-rouge">!user</code> and <code class="language-plaintext highlighter-rouge">!assistant</code> in a buffer.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>!user
Write a Haiku about time travelers disguised as frogs.

!assistant
Green, leaping through time,
Frog tongues lick the future's rim,
Disguised in pond's guise.
</code></pre></div></div>

<p>It’s still a text editor buffer, so I can edit the assistant response,
reword my original request, etc. before continuing the conversation. For
composing fiction, I can request it to continue some text (which does not
require instruct training):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>!completion
Din the Wizard stalked the dim castle
</code></pre></div></div>

<p>I can stop it, make changes, add my own writing, and keep going. I ought
to spend more time practicing with it. If you introduce out-of-story note
syntax, the LLM will pick up on it, and then you can use notes to guide
the LLM’s writing.</p>

<p>While the main target is llama.cpp, I query different APIs, implemented by
different LLM software, with incompatibilities across APIs (a parameter
required by one API is forbidden by another), so directives must be
flexible and powerful. So directives can set arbitrary HTTP and JSON
parameters. Illume doesn’t try to abstract the API, but exposes it at a
low level, so effective use requires knowing the remote API. For example,
the “profile” for talking to llama.cpp looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>!api http://localhost:8080/v1
!:cache_prompt true
</code></pre></div></div>

<p>Where <code class="language-plaintext highlighter-rouge">cache_prompt</code> is a llama.cpp-specific JSON parameter (<code class="language-plaintext highlighter-rouge">!:</code>). Prompt
cache nearly always better enabled, yet for some reason it’s disabled by
default. Other APIs refuse requests with this parameter, so then I must
omit or otherwise disable it. The Hugging Face “profile” looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>!api https://api-inference.huggingface.co/models/{model}/v1
!:model Qwen/Qwen2.5-72B-Instruct
!&gt;x-use-cache false
</code></pre></div></div>

<p>For the sake of HF, Illume can interpolate JSON parameters into the URL.
The HF API caches also aggressively caches. I never want this, so I supply
an HTTP parameter (<code class="language-plaintext highlighter-rouge">!&gt;</code>) to turn it off.</p>

<p>Unique to llama.cpp is an <code class="language-plaintext highlighter-rouge">/infill</code> endpoint for FIM. It requires a model
with extra metadata, trained a certain way, but this is usually not the
case. So while Illume can use <code class="language-plaintext highlighter-rouge">/infill</code>, I also added FIM configuration
so, after reading the model’s documentation and configuring Illume for
that model’s FIM behavior, I can do FIM completion through the normal
completion API on any FIM-trained model, even on non-llama.cpp APIs.</p>

<h3 id="fill-in-the-middle-fim-tokens">Fill-in-the-Middle (FIM) tokens</h3>

<p>It’s time to discuss FIM. To get to the bottom of FIM I needed to go to
the source of truth, the original FIM paper: <a href="https://arxiv.org/abs/2207.14255">Efficient Training of
Language Models to Fill in the Middle</a>. This allowed me to understand
how these models are FIM-trained, at least enough to put that training to
use. Even so, model documentation tends to be thin on FIM because they
expect you to run their code.</p>

<p>Ultimately an LLM can only predict the next token. So pick some special
tokens that don’t appear in inputs, use them to delimit a prefix and
suffix, and middle (PSM) — or sometimes ordered suffix-prefix-middle (SPM)
— in a large training corpus. Later in inference we can use those tokens
to provide a prefix, suffix, and let it “predict” the middle. Crazy, but
<em>this actually works!</em></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;PRE&gt;{prefix}&lt;SUF&gt;{suffix}&lt;MID&gt;
</code></pre></div></div>

<p>For example when filling the parentheses of <code class="language-plaintext highlighter-rouge">dist = sqrt(x*x + y*y)</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;PRE&gt;dist = sqrt(&lt;SUF&gt;)&lt;MID&gt;x*x + y*y
</code></pre></div></div>

<p>To have the LLM fill in the parentheses, we’d stop at <code class="language-plaintext highlighter-rouge">&lt;MID&gt;</code> and let the
LLM predict from there. Note how <code class="language-plaintext highlighter-rouge">&lt;SUF&gt;</code> is essentially the cursor. By the
way, this is basically how instruct training works, but instead of prefix
and suffix, special tokens delimit instructions and conversation.</p>

<p>Some LLM folks interpret the paper quite literally and use <code class="language-plaintext highlighter-rouge">&lt;PRE&gt;</code>, etc.
for their FIM tokens, although these look nothing like their other special
tokens. More thoughtful trainers picked <code class="language-plaintext highlighter-rouge">&lt;|fim_prefix|&gt;</code>, etc. Illume
accepts FIM templates, and I wrote templates for the popular models. For
example, here’s Qwen (PSM):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;|fim_prefix|&gt;{prefix}&lt;|fim_suffix|&gt;{suffix}&lt;|fim_middle|&gt;
</code></pre></div></div>

<p>Mistral AI prefers square brackets, SPM, and no “middle” token:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[SUFFIX]{suffix}[PREFIX]{prefix}
</code></pre></div></div>

<p>With these templates I could access the FIM training in models unsupported
by llama.cpp’s <code class="language-plaintext highlighter-rouge">/infill</code> API.</p>

<p>Besides just failing the prompt, the biggest problem I’ve had with FIM is
LLMs not know when to stop. For example, if I ask it to fill out this
function (i.e. assign something <code class="language-plaintext highlighter-rouge">r</code>):</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">norm</span><span class="p">(</span><span class="n">x</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="nb">float</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">r</span>
</code></pre></div></div>

<p>(Side note: Static types, including the hints here, produce better results
from LLMs, acting as guardrails.) It’s not unusual to get something like:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">norm</span><span class="p">(</span><span class="n">x</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="nb">float</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">):</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">sqrt</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">r</span>

<span class="k">def</span> <span class="nf">norm3</span><span class="p">(</span><span class="n">x</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">z</span><span class="p">:</span> <span class="nb">float</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">):</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">sqrt</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span> <span class="o">+</span> <span class="n">z</span><span class="o">*</span><span class="n">z</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">r</span>

<span class="k">def</span> <span class="nf">norm4</span><span class="p">(</span><span class="n">x</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">z</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">w</span><span class="p">:</span> <span class="nb">float</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">):</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">sqrt</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="o">*</span><span class="n">y</span> <span class="o">+</span> <span class="n">z</span><span class="o">*</span><span class="n">z</span> <span class="o">+</span> <span class="n">w</span><span class="o">*</span><span class="n">w</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">r</span>
</code></pre></div></div>

<p>Where the original <code class="language-plaintext highlighter-rouge">return r</code> became the return for <code class="language-plaintext highlighter-rouge">norm4</code>. Technically
it fits the prompt, but it’s obviously not what I want. So be ready to
mash the “stop” button when it gets out of control. The three coder models
I recommended exhibit this behavior less often. It might be more robust to
combine it with a non-LLM system that understands the code semantically
and automatically stops generation when the LLM begins generating tokens
in a higher scope. That would make more coder models viable, but this goes
beyond my own fiddling.</p>

<p>Figuring out FIM and putting it into action revealed to me that FIM is
still in its early stages, and hardly anyone is generating code via FIM. I
guess everyone’s just using plain old completion?</p>

<h3 id="so-what-are-llms-good-for">So what are LLMs good for?</h3>

<p>LLMs are fun, but what the productive uses do they have? That’s a question
I’ve been trying to answer this past month, and it’s come up shorter than
I hoped. It might be useful to establish boundaries — tasks that LLMs
definitely cannot do.</p>

<p>First, <strong>LLMs are no good if correctness cannot be readily verified</strong>.
They are untrustworthy hallucinators. Often if you’re in position to
verify LLM output, you didn’t need it in the first place. This is why
Mixtral, with its large “database” of knowledge, isn’t so useful. It also
means it’s <em>reckless and irresponsible to inject LLM output into search
results</em> — just shameful.</p>

<p>LLM enthusiasts, who ought to know better, fall into this trap anyway and
propagate hallucinations. It makes discourse around LLMs less trustworthy
than normal, and I need to approach LLM information with extra skepticism.
Case in point: Recall how “GGUF” doesn’t have an authoritative definition.
Search for one and you’ll find an obvious hallucination that made it all
the way into official IBM documentation. I won’t repeat it hear as to not
make things worse.</p>

<p>Second, <strong>LLMs have goldfish-sized working memory</strong>. That is, they’re held
back by small context lengths. Some models are trained on larger contexts,
but their <a href="https://github.com/NVIDIA/RULER">effective context length</a> is usually much smaller. In
practice, an LLM can hold several book chapters worth of comprehension “in
its head” at a time. For code it’s 2k or 3k lines (code is token-dense).
That’s the most you can work with at once. Compared to a human, it’s tiny.
There are tools like <a href="https://en.wikipedia.org/wiki/Retrieval-augmented_generation">retrieval-augmented generation</a> and fine-tuning
to mitigate it… <em>slightly</em>.</p>

<p>Third, <strong>LLMs are poor programmers</strong>. At best they write code at maybe an
undergraduate student level who’s read a lot of documentation. That sounds
better than it is. The typical fresh graduate enters the workforce knowing
practically nothing about software engineering. Day one on the job is the
first day of their <a href="/blog/2016/09/02/">real education</a>. In that sense, LLMs today
haven’t even begun their education.</p>

<p>To be fair, that LLMs work as well as they do is amazing! Thrown into the
middle of a program in <a href="/blog/2023/10/08/">my unconvential style</a>, LLMs figure it out
and make use of the custom interfaces. (Caveat: My code and writing is in
the training data of most of these LLMs.) So the more context, the better,
within the effective context length. The challenge is getting something
useful out of an LLM in less time than writing it myself.</p>

<p><em>Writing new code is the easy part</em>. The hard part is maintaining code,
and writing new code with that maintenance in mind. Even when an LLM
produces code that works, there’s no thought to maintenance, nor could
there be. In general the reliability of generate code follows the inverse
square law by length, and generating more than a dozen lines at a time is
fraught. I really tried, but never saw LLM output beyond 2–3 lines of code
which I would consider acceptable.</p>

<p>Quality varies substantially by language. LLMs are better at Python than
C, and better at C than assembly. I suspect it’s related to the difficulty
of the language and the quality of the input. It’s trained on lots of
terrible C — the internet is loaded with it after all — and probably the
only labeled x86 assembly it’s seen is crummy beginner tutorials. Ask it
to use SDL2 and it <a href="/blog/2023/01/08/">reliably produces the common mistakes</a> because
it’s been trained to do so.</p>

<p>What about boilerplate? That’s something an LLM could probably do with a
low error rate, and perhaps there’s merit to it. Though the fastest way to
deal with boilerplate is to not write it at all. Change your problem to
not require boilerplate.</p>

<p>Without taking my word for it, consider how it show up in the economics:
If AI companies could deliver the productivity gains they claim, they
wouldn’t sell AI. They’d keep it to themselves and gobble up the software
industry. Or consider the software products produced by companies on the
bleeding edge of AI. It’s still the same old, bloated web garbage everyone
else is building. (My LLM research has involved navigating their awful web
sites, and it’s made be bitter.)</p>

<p>In code generation, hallucinations are less concerning. You already knew
what you wanted when you asked, so you can review it, and your compiler
will help catch problems you miss (e.g. calling a hallucinated method).
However, small context and poor code generation remain roadblocks, and I
haven’t yet made this work effectively.</p>

<p>So then, what can I do with LLMs? A list is apt because LLMs love lists:</p>

<ul>
  <li>
    <p>Proofreading has been most useful for me. I give it a document such as
an email or this article (~8,000 tokens), tell it to look over grammar,
call out passive voice, and so on, and suggest changes. I accept or
reject its suggestions and move on. Most suggestions will be poor, and
this very article was long enough that even ~70B models suggested
changes to hallucinated sentences. Regardless, there’s signal in the
noise, and it fits within the limitations outlined above. I’m still
trying to apply this technique (“find bugs, please”) to code review, but
so far success is elusive.</p>
  </li>
  <li>
    <p>Writing short fiction. Hallucinations are not a problem; they’re a
feature! Context lengths are the limiting factor, though perhaps you can
stretch it by supplying chapter summaries, also written by LLM. I’m
still exploring this. If you’re feeling lazy, tell it to offer you three
possible story branches at each turn, and you pick the most interesting.
Or even tell it to combine two of them! LLMs are clever and will figure
it out. Some genres work better than others, and concrete works better
than abstract. (I wonder if professional writers judge its writing as
poor as I judge its programming.)</p>
  </li>
  <li>
    <p>Generative fun. Have an argument with Benjamin Franklin (note: this
probably violates the <a href="https://ai.meta.com/llama/use-policy/">Acceptable Use Policy</a> of some models), hang
out with a character from your favorite book, or generate a new scene of
<a href="/blog/2023/06/22/#76-henry-iv">Falstaff’s blustering antics</a>. Talking to historical figures
has been educational: The character says something unexpected, I look it
up the old-fashioned way to see what it’s about, then learn something
new.</p>
  </li>
  <li>
    <p>Language translation. I’ve been browsing foreign language subreddits
through Gemma-2-2B translation, and it’s been insightful. (I had no idea
German speakers were so distrustful of artificial sweeteners.)</p>
  </li>
</ul>

<p>Despite the short list of useful applications, this is the most excited
I’ve been about a new technology in years!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>I solved the Dandelions paper-and-pencil game</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2022/10/12/"/>
    <id>urn:uuid:14edf491-dcdd-4c2f-a75f-5e89838e6b40</id>
    <updated>2022-10-12T03:02:27Z</updated>
    <category term="c"/><category term="game"/><category term="ai"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p>I’ve been reading <a href="https://mathwithbaddrawings.com/2022/01/19/math-games-with-bad-drawings-2/"><em>Math Games with Bad Drawings</em></a>, a great book
well-aligned to my interests. It’s given me a lot of new, interesting
programming puzzles to consider. The first to truly nerd snipe me was
<a href="https://mathwithbaddrawings.com/dandelions/">Dandelions</a> (<a href="https://mathwithbaddrawings.com/wp-content/uploads/2020/06/game-5-dandelions-1.pdf">full rules</a>), an asymmetric paper-and-pencil game
invented by the book’s author, Ben Orlin. Just as with <a href="/blog/2020/10/19/">British Square two
years ago</a> — and essentially following the same technique — I wrote a
program that explores the game tree sufficiently to play either side
perfectly, “solving” the game in its standard 5-by-5 configuration.</p>

<p>The source: <strong><a href="https://github.com/skeeto/scratch/blob/master/misc/dandelions.c"><code class="language-plaintext highlighter-rouge">dandelions.c</code></a></strong></p>

<p>The game is played on a 5-by-5 grid where one player plays the dandelions,
the other plays the wind. Players alternate, dandelions placing flowers
and wind blowing in one of the eight directions, spreading seeds from all
flowers along the direction of the wind. Each side gets seven moves, and
the wind cannot blow in the same direction twice. The dandelions’ goal is
to fill the grid with seeds, and the wind’s goal is to prevent this.</p>

<p>Try playing a few rounds with a friend, and you will probably find that
dandelions is difficult, at least in your first games, as though it cannot
be won. However, my engine proves the opposite: <strong>The dandelions always
win with perfect play.</strong> In fact, it’s so lopsided that the dandelions’
first move is irrelevant. Every first move is winnable. If the dandelions
blunder, typically wind has one narrow chance to seize control, after
which wind probably wins with any (or almost any) move.</p>

<p>For reasons I’ll discuss later, I only solved the 5-by-5 game, and the
situation may be different for the 6-by-6 variant. Also, unlike British
Square, my engine does not exhaustively explore the entire game tree
because it’s far too large. Instead it does a minimax search to the bottom
of the tree and stops when it finds a branch where all leaves are wins for
the current player. Because of this, it cannot maximize the outcome —
winning as early as possible as dandelions or maximizing the number of
empty grid spaces as wind. I also can’t quantify the exact size of tree.</p>

<p>Like with British Square, my game engine only has a crude user interface
for interactively exploring the game tree. While you can “play” it in a
sense, it’s not intended to be played. It also takes a few seconds to
initially explore the game tree, so wait for the <code class="language-plaintext highlighter-rouge">&gt;&gt;</code> prompt.</p>

<h3 id="bitboard-seeding">Bitboard seeding</h3>

<p>I used <a href="https://www.chessprogramming.org/Bitboards">bitboards</a> of course: a 25-bit bitboard for flowers, a 25-bit
bitboard for seeds, and an 8-bit set to track which directions the wind
has blown. It’s especially well-suited for this game since seeds can be
spread in parallel using bitwise operations. Shift the flower bitboard in
the direction of the wind four times, ORing it into the seeds bitboard
on each shift:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int wind;
uint32_t seeds, flowers;

flowers &gt;&gt;= wind;  seeds |= flowers;
flowers &gt;&gt;= wind;  seeds |= flowers;
flowers &gt;&gt;= wind;  seeds |= flowers;
flowers &gt;&gt;= wind;  seeds |= flowers;
</code></pre></div></div>

<p>Of course it’s a little more complicated than this. The flowers must be
masked to keep them from wrapping around the grid, and wind may require
shifting in the other direction. In order to “negative shift” I actually
use a rotation (notated with <code class="language-plaintext highlighter-rouge">&gt;&gt;&gt;</code> below). Consider, to rotate an N-bit
integer <em>left</em> by R, one can <em>right</em>-rotate it by <code class="language-plaintext highlighter-rouge">N-R</code> — ex. on a 32-bit
integer, a left-rotate by 1 is the same as a right-rotate by 31. So for a
negative <code class="language-plaintext highlighter-rouge">wind</code> that goes in the other direction:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>flowers &gt;&gt;&gt; (wind &amp; 31);
</code></pre></div></div>

<p>With such a “programmable shift” I can implement the bulk of the game
rules using a couple of tables and no branches:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// clockwise, east is zero
static int8_t rot[] = {-1, -6, -5, -4, +1, +6, +5, +4};
static uint32_t mask[] = {
    0x0f7bdef, 0x007bdef, 0x00fffff, 0x00f7bde,
    0x1ef7bde, 0x1ef7bc0, 0x1ffffe0, 0x0f7bde0
};
f &amp;= mask[dir];  f &gt;&gt;&gt;= rot[i] &amp; 31;  s |= f;
f &amp;= mask[dir];  f &gt;&gt;&gt;= rot[i] &amp; 31;  s |= f;
f &amp;= mask[dir];  f &gt;&gt;&gt;= rot[i] &amp; 31;  s |= f;
f &amp;= mask[dir];  f &gt;&gt;&gt;= rot[i] &amp; 31;  s |= f;
</code></pre></div></div>

<p>The masks clear out the column/row about to be shifted “out” so that it
doesn’t wrap around. Viewed in base-2, they’re 5-bit patterns repeated 5
times.</p>

<h3 id="bitboard-packing-and-canonicalization">Bitboard packing and canonicalization</h3>

<p>The entire game state is two 25-bit bitboards and an 8-bit set. That’s 58
bits, which fits in a 64-bit integer with bits to spare. How incredibly
convenient! So I represent the game state using a 64-bit integer, using a
packing like I did with British Square. The bottom 25 bits are the seeds,
the next 25 bits are the flowers, and the next 8 is the wind set.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>000000 WWWWWWWW FFFFFFFFFFFFFFFFFFFFFFFFF SSSSSSSSSSSSSSSSSSSSSSSSS
</code></pre></div></div>

<p>Even more convenient, I could reuse my bitboard canonicalization code from
British Square, also a 5-by-5 grid packed in the same way, saving me the
trouble of working out all the bit sieves. I only had to figure out how to
transpose and flip the wind bitset. Turns out that’s pretty easy, too.
Here’s how I represent the 8 wind directions:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>567
4 0
321
</code></pre></div></div>

<p>Flipping this vertically I get:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>321
4 0
567
</code></pre></div></div>

<p>Unroll these to show how old maps onto new:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>old: 01234567
new: 07654321
</code></pre></div></div>

<p>The new is just the old rotated and reversed. Transposition is the same
story, just a different rotation. I use a small lookup table to reverse
the bits, and then an 8-bit rotation. (See <code class="language-plaintext highlighter-rouge">revrot</code>.)</p>

<p>To determine how many moves have been made, popcount the flower bitboard
and wind bitset.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int moves = POPCOUNT64(g &amp; 0x3fffffffe000000);
</code></pre></div></div>

<p>To test if dandelions have won:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>int win = (g&amp;0x1ffffff) == 0x1ffffff;
</code></pre></div></div>

<p>Since the plan is to store all the game states in a big hash table — an
<a href="/blog/2022/08/08/">MSI double hash</a> in this case — I’d like to reserve the zero value
as a “null” board state. This lets me zero-initialize the hash table. To
do this, I invert the wind bitset such that a 1 indicates the direction is
still available. So the initial game state looks like this (in the real
program this is accounted for in the previously-discussed turn popcount):</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define GAME_INIT ((uint64_t)255 &lt;&lt; 50)
</span></code></pre></div></div>

<p>The remaining 6 bits can be used to cache information about the rest of
tree under this game state, namely who wins from this position, and this
serves as the “value” in the hash table. Turns out the bitboards are
already noisy enough that a <a href="/blog/2018/07/31/">single xorshift</a> makes for a great hash
function. The hash table, including hash function, is under a dozen lines
of code.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Find the hash table slot for the given game state.</span>
<span class="kt">uint64_t</span> <span class="o">*</span><span class="nf">lookup</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="o">*</span><span class="n">ht</span><span class="p">,</span> <span class="kt">uint64_t</span> <span class="n">g</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">g</span> <span class="o">^</span> <span class="n">g</span><span class="o">&gt;&gt;</span><span class="mi">32</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">mask</span> <span class="o">=</span> <span class="p">(</span><span class="mi">1L</span> <span class="o">&lt;&lt;</span> <span class="n">HASHTAB_EXP</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="kt">size_t</span> <span class="n">step</span> <span class="o">=</span> <span class="n">hash</span><span class="o">&gt;&gt;</span><span class="p">(</span><span class="mi">64</span> <span class="o">-</span> <span class="n">HASHTAB_EXP</span><span class="p">)</span> <span class="o">|</span> <span class="mi">1</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span><span class="p">;;)</span> <span class="p">{</span>
        <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">step</span><span class="p">)</span><span class="o">&amp;</span><span class="n">mask</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">||</span> <span class="n">ht</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">&amp;</span><span class="mh">0x3ffffffffffffff</span> <span class="o">==</span> <span class="n">g</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="n">ht</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To explore a 6-by-6 grid I’d need to change my representation, which is
part of why I didn’t do it. I can’t fit two 36-bit bitboards in a 64-bit
integer, so I’d need to double my storage requirements, which are already
strained.</p>

<h3 id="computational-limitations">Computational limitations</h3>

<p>Due to the way seeds spread, game states resulting from different moves
rarely converge back to a common state later in the tree, so the hash
table isn’t doing much deduplication. Exhaustively exploring the entire
game tree, even cutting it down to an 8th using canonicalization, requires
substantial computing resources, more than I personally have available for
this project. So I had to stop at the slightly weaker form, find a winning
branch rather than maximizing a “score.”</p>

<p>I configure the program to allocate 2GiB for the hash table, but if you
run just a few dozen games off the same table (same program instance),
each exploring different parts of the game tree, you’ll exhaust this
table. A 6-by-6 doubles the memory requirements just to represent the
game, but it also slows the search and substantially increases the width
of the tree, which grows 44% faster. I’m sure it can be done, but it’s
just beyond the resources available to me.</p>

<h3 id="dandelion-puzzles">Dandelion Puzzles</h3>

<p>As a side effect, I wrote a small routine to randomly play out games in
search for “mate-in-two”-style puzzles. The dandelions have two flowers to
place and can force a win with two specific placements — and only those
two placements — regardless of how the wind blows. Here are two of the
better ones, each involving a small trick that I won’t give away here
(note: arrowheads indicate directions wind can still blow):</p>

<p><img src="/img/dandelions/puzzle1.svg" alt="" /></p>

<p><img src="/img/dandelions/puzzle2.svg" alt="" /></p>

<p>There are a variety of potential single-player puzzles of this form.</p>

<ul>
  <li>Cooperative: place a dandelion <em>and</em> pick the wind direction</li>
  <li>Avoidance: <em>don’t</em> seed a particular tile</li>
  <li>Hard ground: certain tiles can’t grow flowers (but still get seeded)</li>
  <li>Weeding: as wind, figure out which flower to remove before blowing</li>
</ul>

<p>There could be a whole “crossword book” of such dandelion puzzles.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>You might not need machine learning</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/11/24/"/>
    <id>urn:uuid:91aa121d-c796-4c11-99d4-41c707637672</id>
    <updated>2020-11-24T04:04:36Z</updated>
    <category term="ai"/><category term="c"/><category term="media"/><category term="compsci"/><category term="video"/>
    <content type="html">
      <![CDATA[<p><em>This article was discussed <a href="https://news.ycombinator.com/item?id=25196574">on Hacker News</a>.</em></p>

<p>Machine learning is a trendy topic, so naturally it’s often used for
inappropriate purposes where a simpler, more efficient, and more reliable
solution suffices. The other day I saw an illustrative and fun example of
this: <a href="https://www.youtube.com/watch?v=-sg-GgoFCP0">Neural Network Cars and Genetic Algorithms</a>. The video
demonstrates 2D cars driven by a neural network with weights determined by
a generic algorithm. However, the entire scheme can be replaced by a
first-degree polynomial without any loss in capability. The machine
learning part is overkill.</p>

<p><a href="https://nullprogram.com/video/?v=racetrack"><img src="/img/screenshot/racetrack.jpg" alt="" /></a></p>

<!--more-->

<p>Above demonstrates my implementation using a polynomial to drive the cars.
My wife drew the background. There’s no path-finding; these cars are just
feeling their way along the track, “following the rails” so to speak.</p>

<p>My intention is not to pick on this project in particular. The likely
motivation in the first place was a desire to apply a neural network to
<em>something</em>. Many of my own projects are little more than a vehicle to try
something new, so I can sympathize. Though a professional setting is
different, where machine learning should be viewed with a more skeptical
eye than it’s usually given. For instance, don’t use active learning to
select sample distribution when a <a href="http://extremelearning.com.au/unreasonable-effectiveness-of-quasirandom-sequences/">quasirandom sequence</a> will do.</p>

<p>In the video, the car has a limited turn radius, and minimum and maximum
speeds. (I’ve retained these contraints in my own simulation.) There are
five sensors — forward, forward-diagonals, and sides — each sensing the
distance to the nearest wall. These are fed into a 3-layer neural network,
and the outputs determine throttle and steering. Sounds pretty cool!</p>

<p><img src="/img/diagram/racecar.svg" alt="" /></p>

<p>A key feature of neural networks is that the outputs are a nonlinear
function of the inputs. However, steering a 2D car is simple enough that
<strong>a linear function is more than sufficient</strong>, and neural networks are
unnecessary. Here are my equations:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>steering = C0*input1 - C0*input3
throttle = C1*input2
</code></pre></div></div>

<p>I only need three of the original inputs — forward for throttle, and
diagonals for steering — and the driver has just two parameters, <code class="language-plaintext highlighter-rouge">C0</code> and
<code class="language-plaintext highlighter-rouge">C1</code>, the polynomial coefficients. Optimal values depend on the track
layout and car configuration, but for my simulation, most values above 0
and below 1 are good enough in most cases. It’s less a matter of crashing
and more about navigating the course quickly.</p>

<p>The lengths of the red lines below are the driver’s three inputs:</p>

<video src="/vid/racecar.mp4" width="530" height="330" loop="" muted="" autoplay="" controls="">
</video>

<p>These polynomials are obviously much faster than a neural network, but
they’re also easy to understand and debug. I can confidently reason about
the entire range of possible inputs rather than worry about a trained
neural network <a href="https://arxiv.org/abs/1903.06638">responding strangely</a> to untested inputs.</p>

<p>Instead of doing anything fancy, my program generates the coefficients at
random to explore the space. If I wanted to generate a good driver for a
course, I’d run a few thousand of these and pick the coefficients that
complete the course in the shortest time. For instance, these coefficients
make for a fast, capable driver for the course featured at the top of the
article:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C0 = 0.896336973, C1 = 0.0354805067
</code></pre></div></div>

<p>Many constants can complete the track, but some will be faster than
others. If I was developing a racing game using this as the AI, I’d not
just pick constants that successfully complete the track, but the ones
that do it quickly. Here’s what the spread can look like:</p>

<video src="/vid/racecars.mp4" width="530" height="330" loop="" muted="" autoplay="" controls="">
</video>

<p>If you want to play around with this yourself, here’s my C source code
that implements this driving AI and <a href="/blog/2017/11/03/">generates the videos and images
above</a>:</p>

<p><strong><a href="https://github.com/skeeto/scratch/blob/master/aidrivers/aidrivers.c">aidrivers.c</a></strong></p>

<p>Racetracks are just images drawn in your favorite image editing program
using the colors documented in the source header.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>I Solved British Square</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2020/10/19/"/>
    <id>urn:uuid:c500b91a-046f-4320-8eff-9bc8f8443ef3</id>
    <updated>2020-10-19T19:32:52Z</updated>
    <category term="c"/><category term="game"/><category term="ai"/><category term="optimization"/>
    <content type="html">
      <![CDATA[<p><em>Update</em>: I <a href="/blog/2022/10/12/">solved another game</a> using essentially the same
technique.</p>

<p><a href="https://boardgamegeek.com/boardgame/3719/british-square">British Square</a> is a 1978 abstract strategy board game which I
recently discovered <a href="https://www.youtube.com/watch?v=PChKZbut3lM&amp;t=10m">from a YouTube video</a>. It’s well-suited to play
by pencil-and-paper, so my wife and I played a few rounds to try it out.
Curious about strategies, I searched online for analysis and found
nothing whatsoever, meaning I’d have to discover strategies for myself.
This is <em>exactly</em> the sort of problem that <a href="https://xkcd.com/356/">nerd snipes</a>, and so I
sunk a couple of evenings building an analysis engine in C — enough to
fully solve the game and play <em>perfectly</em>.</p>

<p><strong>Repository</strong>: <a href="https://github.com/skeeto/british-square"><strong>British Square Analysis Engine</strong></a>
(and <a href="https://github.com/skeeto/british-square/releases">prebuilt binaries</a>)</p>

<p><a href="/img/british-square/british-square.jpg"><img src="/img/british-square/british-square-thumb.jpg" alt="" /></a>
<!-- Photo credit: Kelsey Wellons --></p>

<!--more-->

<p>The game is played on a 5-by-5 grid with two players taking turns
placing pieces of their color. Pieces may not be placed on tiles
4-adjacent to an opposing piece, and as a special rule, the first player
may not play the center tile on the first turn. Players pass when they
have no legal moves, and the game ends when both players pass. The score
is the difference between the piece counts for each player.</p>

<p>In the default configuration, my engine takes a few seconds to explore
the full game tree, then presents the <a href="https://en.wikipedia.org/wiki/Minimax">minimax</a> values for the
current game state along with the list of perfect moves. The UI allows
manually exploring down the game tree. It’s intended for analysis, but
there’s enough UI present to “play” against the AI should you so wish.
For some of my analysis I made small modifications to the program to
print or count game states matching certain conditions.</p>

<h3 id="game-analysis">Game analysis</h3>

<p>Not accounting for symmetries, there are 4,233,789,642,926,592 possible
playouts. In these playouts, the first player wins 2,179,847,574,830,592
(~51%), the second player wins 1,174,071,341,606,400 (~28%), and the
remaining 879,870,726,489,600 (~21%) are ties. It’s immediately obvious
the first player has a huge advantage.</p>

<p>Accounting for symmetries, there are 8,659,987 total game states. Of
these, 6,955 are terminal states, of which the first player wins 3,599
(~52%) and the second player wins 2,506 (~36%). This small number of
states is what allows the engine to fully explore the game tree in a few
seconds.</p>

<p>Most importantly: <strong>The first player can always win by two points.</strong> In
other words, it’s <em>not</em> like Tic-Tac-Toe where perfect play by both
players results in a tie. Due to the two-point margin, the first player
also has more room for mistakes and usually wins even without perfect
play. There are fewer opportunities to blunder, and a single blunder
usually results in a lower win score. The second player has a narrow
lane of perfect play, making it easy to blunder.</p>

<p>Below is the minimax analysis for the first player’s options. The number
is the first player’s score given perfect play from that point — i.e.
perfect play starts on the tiles marked “2”, and the tiles marked “0”
are blunders that lead to ties.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11111
12021
10-01
12021
11111
</code></pre></div></div>

<p>The special center rule probably exists to reduce the first player’s
obvious advantage, but in practice it makes little difference. Without
the rule, the first player has an additional (fifth) branch for a win by
two points:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11111
12021
10201
12021
11111
</code></pre></div></div>

<p>Improved alternative special rule: <strong>Bias the score by two in favor of
the second player.</strong> This fully eliminates the first player’s advantage,
perfect play by both sides results in a tie, and both players have a
narrow lane of perfect play.</p>

<p>The four tie openers are interesting because the reasoning does not
require computer assistance. If the first player opens on any of those
tiles, the second player can mirror each of the first player’s moves,
guaranteeing a tie. Note: The first player can still make mistakes that
results in a second player win <em>if</em> the second player knows when to stop
mirroring.</p>

<p>One of my goals was to develop a heuristic so that even human players
can play perfectly from memory, as in Tic-Tac-Toe. Unfortunately I was
not able to develop any such heuristic, though I <em>was</em> able to prove
that <strong>a greedy heuristic — always claim as much territory as possible —
is often incorrect</strong> and, in some cases, leads to blunders.</p>

<h3 id="engine-implementation">Engine implementation</h3>

<p>As <a href="/blog/2017/04/27/">I’ve done before</a>, my engine represents the game using
<a href="https://www.chessprogramming.org/Bitboards">bitboards</a>. Each player has a 25-bit bitboard representing their
pieces. To make move validation more efficient, it also sometimes tracks
a “mask” bitboard where invalid moves have been masked. Updating all
bitboards is cheap (<code class="language-plaintext highlighter-rouge">place()</code>, <code class="language-plaintext highlighter-rouge">mask()</code>), as is validating moves
against the mask (<code class="language-plaintext highlighter-rouge">valid()</code>).</p>

<p>The longest possible game is 32 moves. This would <em>just</em> fit in 5 bits,
except that I needed a special “invalid” turn, making it a total of 33
bits. So I use 6 bits to store the turn counter.</p>

<p>Besides generally being unnecessary, the validation masks can be derived
from the main bitboards, so I don’t need to store them in the game tree.
That means I need 25 bits per player, and 6 bits for the counter: <strong>56
bits total</strong>. I pack these into a 64-bit integer. The first player’s
bitboard goes in the bottom 25 bits, the second player in the next 25
bits, and the turn counter in the topmost 6 bits. The turn counter
starts at 1, so an all zero state is invalid. I exploit this in the hash
table so that zeroed slots are empty (more on this later).</p>

<p>In other words, the <em>empty</em> state is <code class="language-plaintext highlighter-rouge">0x4000000000000</code> (<code class="language-plaintext highlighter-rouge">INIT</code>) and zero
is the null (invalid) state.</p>

<p>Since the state is so small, rather than passing a pointer to a state to
be acted upon, bitboard functions return a new bitboard with the
requested changes… functional style.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="c1">// Compute bitboard+mask where first play is tile 6</span>
    <span class="c1">// -----</span>
    <span class="c1">// -X---</span>
    <span class="c1">// -----</span>
    <span class="c1">// -----</span>
    <span class="c1">// -----</span>
    <span class="kt">uint64_t</span> <span class="n">b</span> <span class="o">=</span> <span class="n">INIT</span><span class="p">;</span>
    <span class="kt">uint64_t</span> <span class="n">m</span> <span class="o">=</span> <span class="n">INIT</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">place</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="mi">6</span><span class="p">);</span>
    <span class="n">m</span> <span class="o">=</span> <span class="n">mask</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="mi">6</span><span class="p">);</span>
</code></pre></div></div>

<h4 id="minimax-costs">Minimax costs</h4>

<p>The engine uses minimax to propagate information up the tree. Since the
search extends to the very bottom of the tree, the minimax “heuristic”
evaluation function is the actual score, not an approximation, which is
why it’s able to play perfectly.</p>

<p>When <a href="/blog/2010/10/17/">I’ve used minimax before</a>, I built an actual tree data
structure in memory, linking states by pointer / reference. In this
engine there is no such linkage, and instead the links are computed
dynamically via the validation masks. Storing the pointers is more
expensive than computing their equivalents on the fly, <em>so I don’t store
them</em>. Therefore my game tree only requires 56 bits per node — or 64
bits in practice since I’m using a 64-bit integer. With only 8,659,987
nodes to store, that’s a mere 66MiB of memory! This analysis could have
easily been done on commodity hardware two decades ago.</p>

<p>What about the minimax values? Game scores range from -10 to 11: 22
distinct values. (That the first player can score up to 11 and the
second player at most 10 is another advantage to going first.) That’s 5
bits of information. However, I didn’t have this information up front,
and so I assumed a range from -25 to 25, which requires 6 bits.</p>

<p>There are still 8 spare bits left in the 64-bit integer, so I use 6 of
them for the minimax score. Rather than worry about two’s complement, I
bias the score to eliminate negative values before storing it. So the
minimax score rides along for free above the state bits.</p>

<h4 id="hash-table-memoization">Hash table (memoization)</h4>

<p>The vast majority of game tree branches are redundant. Even without
taking symmetries into account, nearly all states are reachable from
multiple branches. Exploring all these redundant branches would take
centuries. If I run into a state I’ve seen before, I don’t want to
recompute it.</p>

<p>Once I’ve computed a result, I store it in a hash table so that I can
find it later. Since the state is just a 64-bit integer, I use <a href="/blog/2018/07/31/">an
integer hash function</a> to compute a starting index from which to
linearly probe an open addressing hash table. The <em>entire</em> hash table
implementation is literally a dozen lines of code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="o">*</span>
<span class="nf">lookup</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">bitboard</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="kt">uint64_t</span> <span class="n">table</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
    <span class="kt">uint64_t</span> <span class="n">mask</span> <span class="o">=</span> <span class="mh">0xffffffffffffff</span><span class="p">;</span> <span class="c1">// sans minimax</span>
    <span class="kt">uint64_t</span> <span class="n">hash</span> <span class="o">=</span> <span class="n">bitboard</span><span class="p">;</span>
    <span class="n">hash</span> <span class="o">*=</span> <span class="mh">0xcca1cee435c5048f</span><span class="p">;</span>
    <span class="n">hash</span> <span class="o">^=</span> <span class="n">hash</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">hash</span> <span class="o">%</span> <span class="n">N</span><span class="p">;</span> <span class="p">;</span> <span class="n">i</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">N</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">||</span> <span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">&amp;</span><span class="n">mask</span> <span class="o">==</span> <span class="n">bitboard</span><span class="p">)</span> <span class="p">{</span>
            <span class="k">return</span> <span class="o">&amp;</span><span class="n">table</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the bitboard is not found, it returns a pointer to the (zero-valued)
slot where it should go so that the caller can fill it in.</p>

<h4 id="canonicalization">Canonicalization</h4>

<p>Memoization eliminates nearly all redundancy, but there’s still a major
optimization left. Many states are equivalent by symmetry or reflection.
Taking that into account, about 7/8th of the remaining work can still be
eliminated.</p>

<p>Multiple different states that are identical by symmetry must to be
somehow “folded” into a single, <em>canonical</em> state to represent them all.
I do this by visiting all 8 rotations and reflections and choosing the
one with the smallest 64-bit integer representation.</p>

<p>I only need two operations to visit all 8 symmetries, and I chose
transpose (flip around the diagonal) and vertical flip. Alternating
between these operations visits each symmetry. Since they’re bitboards,
transforms can be implemented using <a href="https://www.chessprogramming.org/Flipping_Mirroring_and_Rotating">fancy bit-twiddling hacks</a>.
Chess boards, with their power-of-two dimensions, have useful properties
which these British Square boards lack, so this is the best I could come
up with:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Transpose a board or mask (flip along the diagonal).</span>
<span class="kt">uint64_t</span>
<span class="nf">transpose</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00000020000010</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00000410000208</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span>  <span class="mi">8</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00008208004104</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span>  <span class="mi">4</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00104104082082</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfe082083041041</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span>  <span class="mi">4</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x01041040820820</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span>  <span class="mi">8</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00820800410400</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">12</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00410000208000</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">16</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x00200000100000</span><span class="p">);</span>
<span class="p">}</span>

<span class="c1">// Flip a board or mask vertically.</span>
<span class="kt">uint64_t</span>
<span class="nf">flipv</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span> <span class="mi">20</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x0000003e00001f</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span> <span class="mi">10</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x000007c00003e0</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&gt;&gt;</span>  <span class="mi">0</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0xfc00f800007c00</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">10</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x001f00000f8000</span><span class="p">)</span> <span class="o">|</span>
           <span class="p">((</span><span class="n">b</span> <span class="o">&lt;&lt;</span> <span class="mi">20</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mh">0x03e00001f00000</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>These transform both players’ bitboards in parallel while leaving the
turn counter intact. The logic here is quite simple: Shift the bitboard
a little bit at a time while using a mask to deposit bits in their new
home once they’re lined up. It’s like a coin sorter. Vertical flip is
analogous to byte-swapping, though with 5-bit “bytes”.</p>

<p>Canonicalizing a bitboard now looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span>
<span class="nf">canonicalize</span><span class="p">(</span><span class="kt">uint64_t</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">c</span> <span class="o">=</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">transpose</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">flipv</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>     <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">transpose</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">flipv</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>     <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">transpose</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">flipv</span><span class="p">(</span><span class="n">b</span><span class="p">);</span>     <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">transpose</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="n">c</span> <span class="o">=</span> <span class="n">c</span> <span class="o">&lt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">c</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">c</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Callers need only use <code class="language-plaintext highlighter-rouge">canonicalize()</code> on values they pass to <code class="language-plaintext highlighter-rouge">lookup()</code>
or store in the table (via the returned pointer).</p>

<h3 id="developing-a-heuristic">Developing a heuristic</h3>

<p>If you can come up with a perfect play heuristic, especially one that
can be reasonably performed by humans, I’d like to hear it. My engine
has a built-in heuristic tester, so I can test it against perfect play
at all possible game positions to check that it actually works. It’s
currently programmed to test the greedy heuristic and print out the
millions of cases where it fails. Even a heuristic that fails in only a
small number of cases would be pretty reasonable.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>When the Compiler Bites</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2018/05/01/"/>
    <id>urn:uuid:02b974e1-e25b-397d-a16f-c754338e9c1e</id>
    <updated>2018-05-01T23:28:06Z</updated>
    <category term="c"/><category term="x86"/><category term="optimization"/><category term="ai"/><category term="netsec"/>
    <content type="html">
      <![CDATA[<p><em>Update: There are discussions <a href="https://old.reddit.com/r/cpp/comments/8gfhq3/when_the_compiler_bites/">on Reddit</a> and <a href="https://news.ycombinator.com/item?id=16974770">on Hacker
News</a>.</em></p>

<p>So far this year I’ve been bitten three times by compiler edge cases
in GCC and Clang, each time catching me totally by surprise. Two were
caused by historical artifacts, where an ambiguous specification lead
to diverging implementations. The third was a compiler optimization
being far more clever than I expected, behaving almost like an
artificial intelligence.</p>

<p>In all examples I’ll be using GCC 7.3.0 and Clang 6.0.0 on Linux.</p>

<h3 id="x86-64-abi-ambiguity">x86-64 ABI ambiguity</h3>

<p>The first time I was bit — or, well, narrowly avoided being bit — was
when I examined a missed floating point optimization in both Clang and
GCC. Consider this function:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">double</span>
<span class="nf">zero_multiply</span><span class="p">(</span><span class="kt">double</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The function multiplies its argument by zero and returns the result. Any
number multiplied by zero is zero, so this should always return zero,
right? Unfortunately, no. IEEE 754 floating point arithmetic supports
NaN, infinities, and signed zeros. This function can return NaN,
positive zero, or negative zero. (In some cases, the operation could
also potentially produce a hardware exception.)</p>

<p>As a result, both GCC and Clang perform the multiply:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply:</span>
    <span class="nf">xorpd</span>  <span class="nv">xmm1</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">mulsd</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">-ffast-math</code> option relaxes the C standard floating point rules,
permitting an optimization at the cost of conformance and
<a href="https://possiblywrong.wordpress.com/2017/09/12/floating-point-agreement-between-matlab-and-c/">consistency</a>:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply:</span>
    <span class="nf">xorps</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Side note: <code class="language-plaintext highlighter-rouge">-ffast-math</code> doesn’t necessarily mean “less precise.”
Sometimes it will actually <a href="https://en.wikipedia.org/wiki/Multiply–accumulate_operation#Fused_multiply–add">improve precision</a>.</p>

<p>Here’s a modified version of the function that’s a little more
interesting. I’ve changed the argument to a <code class="language-plaintext highlighter-rouge">short</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">double</span>
<span class="nf">zero_multiply_short</span><span class="p">(</span><span class="kt">short</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s no longer possible for the argument to be one of those special
values. The <code class="language-plaintext highlighter-rouge">short</code> will be promoted to one of 65,535 possible <code class="language-plaintext highlighter-rouge">double</code>
values, each of which results in 0.0 when multiplied by 0.0. GCC misses
this optimization (<code class="language-plaintext highlighter-rouge">-Os</code>):</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply_short:</span>
    <span class="nf">movsx</span>     <span class="nb">edi</span><span class="p">,</span> <span class="nb">di</span>       <span class="c1">; sign-extend 16-bit argument</span>
    <span class="nf">xorps</span>     <span class="nv">xmm1</span><span class="p">,</span> <span class="nv">xmm1</span>    <span class="c1">; xmm1 = 0.0</span>
    <span class="nf">cvtsi2sd</span>  <span class="nv">xmm0</span><span class="p">,</span> <span class="nb">edi</span>     <span class="c1">; convert int to double</span>
    <span class="nf">mulsd</span>     <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>Clang also misses this optimization:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">zero_multiply_short:</span>
    <span class="nf">cvtsi2sd</span> <span class="nv">xmm1</span><span class="p">,</span> <span class="nb">edi</span>
    <span class="nf">xorpd</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm0</span>
    <span class="nf">mulsd</span>    <span class="nv">xmm0</span><span class="p">,</span> <span class="nv">xmm1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>But hang on a minute. This is shorter by one instruction. What
happened to the sign-extension (<code class="language-plaintext highlighter-rouge">movsx</code>)? Clang is treating that
<code class="language-plaintext highlighter-rouge">short</code> argument as if it were a 32-bit value. Why do GCC and Clang
differ? Is GCC doing something unnecessary?</p>

<p>It turns out that the <a href="https://www.uclibc.org/docs/psABI-x86_64.pdf">x86-64 ABI</a> didn’t specify what happens with
the upper bits in argument registers. Are they garbage? Are they zeroed?
GCC takes the conservative position of assuming the upper bits are
arbitrary garbage. Clang takes the boldest position of assuming
arguments smaller than 32 bits have been promoted to 32 bits by the
caller. This is what the ABI specification <em>should</em> have said, but
currently it does not.</p>

<p>Fortunately GCC also conservative when passing arguments. It promotes
arguments to 32 bits as necessary, so there are no conflicts when
linking against Clang-compiled code. However, this is not true for
Intel’s ICC compiler: <a href="https://web.archive.org/web/20180908113552/https://stackoverflow.com/a/36760539"><strong>Clang and ICC are not ABI-compatible on
x86-64</strong></a>.</p>

<p>I don’t use ICC, so that particular issue wouldn’t bite me, <em>but</em> if I
was ever writing assembly routines that called Clang-compiled code, I’d
eventually get bit by this.</p>

<h3 id="floating-point-precision">Floating point precision</h3>

<p>Without looking it up or trying it, what does this function return?
Think carefully.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span>
<span class="nf">float_compare</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">float</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">==</span> <span class="mi">1</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Confident in your answer? This is a trick question, because it can
return either 0 or 1 depending on the compiler. Boy was I confused when
this comparison returned 0 in my real world code.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc   -std=c99 -m32 cmp.c  # float_compare() == 0
$ clang -std=c99 -m32 cmp.c  # float_compare() == 1
</code></pre></div></div>

<p>So what’s going on here? The original ANSI C specification wasn’t
clear about how intermediate floating point values get rounded, and
implementations <a href="https://news.ycombinator.com/item?id=16974770">all did it differently</a>. The C99 specification
cleaned this all up and introduced <a href="https://en.wikipedia.org/wiki/C99#IEEE_754_floating_point_support"><code class="language-plaintext highlighter-rouge">FLT_EVAL_METHOD</code></a>.
Implementations can still differ, but at least you can now determine
at compile-time what the compiler would do by inspecting that macro.</p>

<p>Back in the late 1980’s or early 1990’s when the GCC developers were
deciding how GCC should implement floating point arithmetic, the trend
at the time was to use as much precision as possible. On the x86 this
meant using its support for 80-bit extended precision floating point
arithmetic. Floating point operations are performed in <code class="language-plaintext highlighter-rouge">long double</code>
precision and truncated afterward (<code class="language-plaintext highlighter-rouge">FLT_EVAL_METHOD == 2</code>).</p>

<p>In <code class="language-plaintext highlighter-rouge">float_compare()</code> the left-hand side is truncated to a <code class="language-plaintext highlighter-rouge">float</code> by the
assignment, but the right-hand side, <em>despite being a <code class="language-plaintext highlighter-rouge">float</code> literal</em>,
is actually “1.3” at 80 bits of precision as far as GCC is concerned.
That’s pretty unintuitive!</p>

<p>The remnants of this high precision trend are still in JavaScript, where
all arithmetic is double precision (even if <a href="http://thibaultlaurens.github.io/javascript/2013/04/29/how-the-v8-engine-works/#more-example-on-how-v8-optimized-javascript-code">simulated using
integers</a>), and great pains have been made <a href="https://blog.mozilla.org/javascript/2013/11/07/efficient-float32-arithmetic-in-javascript/">to work around</a>
the performance consequences of this. <a href="http://tirania.org/blog/archive/2018/Apr-11.html">Until recently</a>, Mono had
similar issues.</p>

<p>The trend reversed once SIMD hardware became widely available and
there were huge performance gains to be had. Multiple values could be
computed at once, side by side, at lower precision. So on x86-64, this
became the default (<code class="language-plaintext highlighter-rouge">FLT_EVAL_METHOD == 0</code>). The young Clang compiler
wasn’t around until well after this trend reversed, so it behaves
differently than the <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=323">backwards compatible</a> GCC on the old x86.</p>

<p>I’m a little ashamed that I’m only finding out about this now. However,
by the time I was competent enough to notice and understand this issue,
I was already doing nearly all my programming on the x86-64.</p>

<h3 id="built-in-function-elimination">Built-in Function Elimination</h3>

<p>I’ve saved this one for last since it’s my favorite. Suppose we have
this little function, <code class="language-plaintext highlighter-rouge">new_image()</code>, that allocates a greyscale image
for, say, <a href="/blog/2017/11/03/">some multimedia library</a>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span>
<span class="nf">new_image</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">w</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">h</span><span class="p">,</span> <span class="kt">int</span> <span class="n">shade</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">w</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">h</span> <span class="o">&lt;=</span> <span class="n">SIZE_MAX</span> <span class="o">/</span> <span class="n">w</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// overflow?</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memset</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">shade</span><span class="p">,</span> <span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s a static function because this would be part of some <a href="https://github.com/nothings/stb">slick
header library</a> (and, secretly, because it’s necessary for
illustrating the issue). Being a responsible citizen, the function
even <a href="/blog/2017/07/19/">checks for integer overflow</a> before allocating anything.</p>

<p>I write a unit test to make sure it detects overflow. This function
should return 0.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* expected return == 0 */</span>
<span class="kt">int</span>
<span class="nf">test_new_image_overflow</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">new_image</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">SIZE_MAX</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">!!</span><span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So far my test passes. Good.</p>

<p>I’d also like to make sure it correctly returns NULL — or, more
specifically, that it doesn’t crash — if the allocation fails. But how
can I make <code class="language-plaintext highlighter-rouge">malloc()</code> fail? As a hack I can pass image dimensions that
I know cannot ever practically be allocated. Essentially I want to
force a <code class="language-plaintext highlighter-rouge">malloc(SIZE_MAX)</code>, e.g. allocate every available byte in my
virtual address space. For a conventional 64-bit machine, that’s 16
exibytes of memory, and it leaves space for nothing else, including
the program itself.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* expected return == 0 */</span>
<span class="kt">int</span>
<span class="nf">test_new_image_oom</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">void</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">new_image</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">SIZE_MAX</span><span class="p">,</span> <span class="mh">0xff</span><span class="p">);</span>
    <span class="k">return</span> <span class="o">!!</span><span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I compile with GCC, test passes. I compile with Clang and the test
fails. That is, <strong>the test somehow managed to allocate 16 exibytes of
memory, <em>and</em> initialize it</strong>. Wat?</p>

<p>Disassembling the test reveals what’s going on:</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">test_new_image_overflow:</span>
    <span class="nf">xor</span>  <span class="nb">eax</span><span class="p">,</span> <span class="nb">eax</span>
    <span class="nf">ret</span>

<span class="nl">test_new_image_oom:</span>
    <span class="nf">mov</span>  <span class="nb">eax</span><span class="p">,</span> <span class="mi">1</span>
    <span class="nf">ret</span>
</code></pre></div></div>

<p>The first test is actually being evaluated at compile time by the
compiler. The function being tested was inlined into the unit test
itself. This permits the compiler to collapse the whole thing down to
a single instruction. The path with <code class="language-plaintext highlighter-rouge">malloc()</code> became dead code and
was trivially eliminated.</p>

<p>In the second test, Clang correctly determined that the image buffer is
not actually being used, despite the <code class="language-plaintext highlighter-rouge">memset()</code>, so it eliminated the
allocation altogether and then <em>simulated</em> a successful allocation
despite it being absurdly large. Allocating memory is not an observable
side effect as far as the language specification is concerned, so it’s
allowed to do this. My thinking was wrong, and the compiler outsmarted
me.</p>

<p>I soon realized I can take this further and trick Clang into
performing an invalid optimization, <a href="https://bugs.llvm.org/show_bug.cgi?id=37304">revealing a bug</a>. Consider
this slightly-optimized version that uses <code class="language-plaintext highlighter-rouge">calloc()</code> when the shade is
zero (black). The <code class="language-plaintext highlighter-rouge">calloc()</code> function does its own overflow check, so
<code class="language-plaintext highlighter-rouge">new_image()</code> doesn’t need to do it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="o">*</span>
<span class="nf">new_image</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">w</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">h</span><span class="p">,</span> <span class="kt">int</span> <span class="n">shade</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">shade</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// shortcut</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="n">h</span><span class="p">);</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">w</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">h</span> <span class="o">&lt;=</span> <span class="n">SIZE_MAX</span> <span class="o">/</span> <span class="n">w</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// overflow?</span>
        <span class="n">p</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">memset</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">color</span><span class="p">,</span> <span class="n">w</span> <span class="o">*</span> <span class="n">h</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With this change, my overflow unit test is now also failing. The
situation is even worse than before. The <code class="language-plaintext highlighter-rouge">calloc()</code> is being
eliminated <em>despite the overflow</em>, and replaced with a simulated
success. This time it’s actually a bug in Clang. While failing a unit
test is mostly harmless, <strong>this could introduce a vulnerability in a
real program</strong>. The OpenBSD folks are so worried about this sort of
thing that <a href="https://marc.info/?l=openbsd-cvs&amp;m=150125592126437&amp;w=2">they’ve disabled this optimization</a>.</p>

<p>Here’s a slightly-contrived example of this. Imagine a program that
maintains a table of unsigned integers, and we want to keep track of
how many times the program has accessed each table entry. The “access
counter” table is initialized to zero, but the table of values need
not be initialized, since they’ll be written before first access (or
something like that).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">table</span> <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="o">*</span><span class="n">counter</span><span class="p">;</span>
    <span class="kt">unsigned</span> <span class="o">*</span><span class="n">values</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kt">int</span>
<span class="nf">table_init</span><span class="p">(</span><span class="k">struct</span> <span class="n">table</span> <span class="o">*</span><span class="n">t</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">n</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span> <span class="o">=</span> <span class="n">calloc</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span><span class="p">));</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* Overflow already tested above */</span>
        <span class="n">t</span><span class="o">-&gt;</span><span class="n">values</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">n</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">));</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">values</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">free</span><span class="p">(</span><span class="n">t</span><span class="o">-&gt;</span><span class="n">counter</span><span class="p">);</span>
            <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// fail</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// success</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// fail</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function relies on the overflow test in <code class="language-plaintext highlighter-rouge">calloc()</code> for the second
<code class="language-plaintext highlighter-rouge">malloc()</code> allocation. However, this is a static function that’s
likely to get inlined, as we saw before. If the program doesn’t
actually make use of the <code class="language-plaintext highlighter-rouge">counter</code> table, and Clang is able to
statically determine this fact, it may eliminate the <code class="language-plaintext highlighter-rouge">calloc()</code>. This
would also <strong>eliminate the overflow test, introducing a
vulnerability</strong>. If an attacker can control <code class="language-plaintext highlighter-rouge">n</code>, then they can
overwrite arbitrary memory through that <code class="language-plaintext highlighter-rouge">values</code> pointer.</p>

<h3 id="the-takeaway">The takeaway</h3>

<p>Besides this surprising little bug, the main lesson for me is that I
should probably isolate unit tests from the code being tested. The
easiest solution is to put them in separate translation units and don’t
use link-time optimization (LTO). Allowing tested functions to be
inlined into the unit tests is probably a bad idea.</p>

<p>The unit test issues in my <em>real</em> program, which was <a href="https://github.com/skeeto/growable-buf">a bit more
sophisticated</a> than what was presented here, gave me artificial
intelligence vibes. It’s that situation where a computer algorithm did
something really clever and I felt it outsmarted me. It’s creepy to
consider <a href="https://wiki.lesswrong.com/wiki/Paperclip_maximizer">how far that can go</a>. I’ve gotten that even from
observing <a href="/blog/2017/04/27/">AI I’ve written myself</a>, and I know for sure no human
taught it some particularly clever trick.</p>

<p>My favorite AI story along these lines is about <a href="https://www.youtube.com/watch?v=xOCurBYI_gY">an AI that learned
how to play games on the Nintendo Entertainment System</a>. It
didn’t understand the games it was playing. It’s optimization task was
simply to choose controller inputs that maximized memory values,
because that’s generally associated with doing well — higher scores,
more progress, etc. The most unexpected part came when playing Tetris.
Eventually the screen would fill up with blocks, and the AI would face
the inevitable situation of losing the game, with all that memory
being reinitialized to low values. So what did it do?</p>

<p>Just before the end it would pause the game and wait… forever.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Two Games with Monte Carlo Tree Search</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2017/04/27/"/>
    <id>urn:uuid:b6f77cb1-01df-3714-4ba0-1859614364da</id>
    <updated>2017-04-27T21:27:50Z</updated>
    <category term="c"/><category term="ai"/><category term="game"/>
    <content type="html">
      <![CDATA[<p><em>Update 2020: A DOS build of Connect Four <a href="https://www.youtube.com/watch?v=K00BylbOQUo">was featured on GET OFF MY
LAWN</a>.</em></p>

<p><a href="https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/">Monte Carlo tree search</a> (MCTS) is the most impressive game
artificial intelligence I’ve ever used. At its core it simulates a
large number of games (<em>playouts</em>), starting from the current game
state, using random moves for each player. Then it simply picks the
move where it won most often. This description is sufficient to spot
one of its most valuable features: <strong>MCTS requires no knowledge of
strategy or effective play</strong>. The game’s rules — enough to simulate
the game — are all that’s needed to allow the AI to make decent moves.
Expert knowledge still makes for a stronger AI, but, more many games,
it’s unnecessary to construct a decent opponent.</p>

<p>A second valuable feature is that it’s easy to parallelize. Unlike
<a href="/blog/2011/08/24/">alpha-beta pruning</a>, which doesn’t mix well with parallel
searches of a Minimax tree, Monte Carlo simulations are practically
independent and can be run in parallel.</p>

<p>Finally, the third valuable feature is that the search can be stopped
at any time. The completion of any single simulation is as good a
stopping point as any. It could be due to a time limit, a memory
limit, or both. In general, the algorithm <em>converges</em> to a best move
rather than suddenly discovering it. The good moves are identified
quickly, and further simulations work to choose among them. More
simulations make for better moves, with exponentially diminishing
returns. Contrasted with Minimax, stopping early has the risk that the
good moves were never explored at all.</p>

<p>To try out MCTS myself, I wrote two games employing it:</p>

<ul>
  <li><a href="https://github.com/skeeto/connect4"><strong>Connect Four</strong></a> [<a href="https://github.com/skeeto/connect4/releases/download/1.0/connect4.exe">.exe x64</a>, 173kB]</li>
  <li><a href="https://github.com/skeeto/yavalath"><strong>Yavalath</strong></a>      [<a href="https://github.com/skeeto/yavalath/releases/download/1.0/yavalath.exe">.exe x64</a>, 174kB]</li>
</ul>

<p>They’re both written in C, for both unix-like and Windows, and should
be <a href="/blog/2017/03/30/">easy to build</a>. <strong>I challenge you to beat them both.</strong> The
Yavalath AI is easier to beat due to having blind spots, which I’ll
discuss below. The Connect Four AI is more difficult and will likely
take a number of tries.</p>

<h3 id="connect-four">Connect Four</h3>

<p><a href="/img/mcts/connect4.png"><img src="/img/mcts/connect4-thumb.png" alt="" /></a></p>

<p>MCTS works very well with Connect Four, and only requires modest
resources: 32MB of memory to store the results of random playouts, and
500,000 game simulations. With a few tweaks, it can even be run in
DOSBox. It stops when it hits either of those limits. In theory,
increasing both would make for stronger moves, but in practice I can’t
detect any difference. It’s like <a href="https://curiosity-driven.org/pi-approximation">computing pi with Monte Carlo</a>,
where eventually it just runs out of precision to make any more
progress.</p>

<p>Based on my simplified description above, you might wonder why it needs
all that memory. Not only does MCTS need to track its win/loss ratio for
each available move from the current state, it tracks the win/loss ratio
for moves in the states behind those moves. A large chunk of the game
tree is kept in memory to track all of the playout results. This is why
MCTS needs a lot more memory than Minimax, which can discard branches
that have been searched.</p>

<p><img src="/img/mcts/tree.svg" alt="" /></p>

<p>A convenient property of this tree is that the branch taken in the
actual game can be re-used in a future search. The root of the tree
becomes the node representing the taken game state, which has already
seen a number of playouts. Even better, MCTS is weighted towards
exploring good moves over bad moves, and good moves are more likely to
be taken in the real game. In general, a significant portion of the tree
gets to be reused in a future search.</p>

<p>I’m going to skip most of the details of the algorithm itself and focus
on my implementation. Other articles do a better job at detailing the
algorithm than I could.</p>

<p>My Connect Four engine doesn’t use dynamic allocation for this tree (or
at all). Instead it manages a static buffer — an array of tree nodes,
each representing a game state. All nodes are initially chained together
into a linked list of free nodes. As the tree is built, nodes are pulled
off the free list and linked together into a tree. When the game
advances to the next state, nodes on unreachable branches are added back
to the free list.</p>

<p>If at any point the free list is empty when a new node is needed, the
current search aborts. This is the out-of-memory condition, and no more
searching can be performed.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Connect Four is normally a 7 by 6 grid. */</span>
<span class="cp">#define CONNECT4_WIDTH  7
#define CONNECT4_HEIGHT 6
</span>
<span class="k">struct</span> <span class="n">connect4_node</span> <span class="p">{</span>
    <span class="kt">uint32_t</span> <span class="n">next</span><span class="p">[</span><span class="n">CONNECT4_WIDTH</span><span class="p">];</span>      <span class="c1">// "pointer" to next node</span>
    <span class="kt">uint32_t</span> <span class="n">playouts</span><span class="p">[</span><span class="n">CONNECT4_WIDTH</span><span class="p">];</span>  <span class="c1">// number of playouts</span>
    <span class="kt">float</span>    <span class="n">score</span><span class="p">[</span><span class="n">CONNECT4_WIDTH</span><span class="p">];</span>     <span class="c1">// pseudo win/loss ratio</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Rather than native C pointers, the structure uses 32-bit indexes into
the master array. This saves a lot of memory on 64-bit systems, and the
structure is the same size no matter the pointer size of the host. The
<code class="language-plaintext highlighter-rouge">next</code> field points to the next state for the nth move. Since 0 is a
valid index, -1 represents null (<code class="language-plaintext highlighter-rouge">CONNECT4_NULL</code>).</p>

<p>Each column is a potential move, so there are <code class="language-plaintext highlighter-rouge">CONNECT4_WIDTH</code>
possible moves at any given state. Each move has a floating point
score and a total number of playouts through that move. In my
implementation, <strong>the search can also halt due to an overflow in a
playout counter</strong>. The search can no longer be tracked in this
representation, so it has to stop. This generally only happens when
the game is nearly over and it’s grinding away on a small number of
possibilities.</p>

<p>Note that the actual game state (piece positions) is not tracked in the
node structure. That’s because it’s implicit. We know the state of the
game at the root, and simulating the moves while descending the tree
will keep track of the board state at the current node. That’s more
memory savings.</p>

<p>The state itself is a pair of bitboards, one for each player. Each
position on the grid gets a bit on each bitboard. The bitboard is very
fast to manipulate, and win states are checked with just a handful of
bit operations. My intention was to make playouts as fast as possible.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">connect4_ai</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">state</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>         <span class="c1">// game state at root (bitboard)</span>
    <span class="kt">uint64_t</span> <span class="n">rng</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>           <span class="c1">// random number generator state</span>
    <span class="kt">uint32_t</span> <span class="n">nodes_available</span><span class="p">;</span>  <span class="c1">// total number of nodes available</span>
    <span class="kt">uint32_t</span> <span class="n">nodes_allocated</span><span class="p">;</span>  <span class="c1">// number of nodes in the tree</span>
    <span class="kt">uint32_t</span> <span class="n">root</span><span class="p">;</span>             <span class="c1">// "pointer" to root node</span>
    <span class="kt">uint32_t</span> <span class="n">free</span><span class="p">;</span>             <span class="c1">// "pointer" to free list</span>
    <span class="kt">int</span> <span class="n">turn</span><span class="p">;</span>                  <span class="c1">// whose turn (0 or 1) at the root?</span>
<span class="p">};</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">nodes_available</code> and <code class="language-plaintext highlighter-rouge">nodes_allocated</code> are not necessary for
correctness nor speed. They’re useful for diagnostics and debugging.</p>

<p>All the functions that operate on these two structures are
straightforward, except for <code class="language-plaintext highlighter-rouge">connect4_playout</code>, a recursive function
which implements the bulk of MCTS. Depending on the state of the node
it’s at, it does one of two things:</p>

<ul>
  <li>
    <p>If there are unexplored moves (<code class="language-plaintext highlighter-rouge">playouts == 0</code>), it randomly chooses
an unplayed move, allocates exactly one node for the state behind that
move, and simulates the rest of the game in a loop, without recursion
or allocating any more nodes.</p>
  </li>
  <li>
    <p>If all moves have been explored at least once, it uses an upper
confidence bound (UCB1) to randomly choose a move, weighed towards
both moves that are under-explored and moves which are strongest.
Striking that balance is one of the challenges. It recurses into that
next state, then updates the node with the result as it propagates
back to the root.</p>
  </li>
</ul>

<p>That’s pretty much all there is to it.</p>

<h3 id="yavalath">Yavalath</h3>

<p><a href="/img/mcts/yavalath.png"><img src="/img/mcts/yavalath-thumb.png" alt="" /></a></p>

<p><a href="http://cambolbro.com/games/yavalath/">Yavalath</a> is a <a href="http://www.genetic-programming.org/hc2012/Browne-Paper-3-Yavalath-07.pdf">board game invented by a computer
program</a>. It’s a pretty fascinating story. The depth and strategy
are disproportionately deep relative to its dead simple rules: Get four
in a row without first getting three in a row. The game revolves around
forced moves.</p>

<p>The engine is structured almost identically to the Connect Four engine.
It uses 32-bit indexes instead of pointers. The game state is a pair of
bitboards, with end-game masks <a href="/blog/2016/11/15/">computed at compile time via
metaprogramming</a>. The AI allocates the tree from a single, massive
buffer — multiple GBs in this case, dynamically scaled to the available
physical memory. And the core MCTS function is nearly identical.</p>

<p>One important difference is that identical game states — states where
the pieces on the board are the same, but the node was reached through
a different series of moves — are coalesced into a single state in the
tree. This state deduplication is done through a hash table. This
saves on memory and allows multiple different paths through the game
tree to share playouts. It comes at a cost of including the game state
in the node (so it can be identified in the hash table) and reference
counting the nodes (since they might have more than one parent).</p>

<p>Unfortunately the AI has blind spots, and once you learn to spot them it
becomes easy to beat consistently. It can’t spot certain kinds of forced
moves, so it always falls for the same tricks. The <em>official</em> Yavalath
AI is slightly stronger than mine, but has a similar blindness. I think
MCTS just isn’t quite a good fit for Yavalath.</p>

<p><strong>The AI’s blindness is caused by <em>shallow traps</em></strong>, a common problem
for MCTS. It’s what makes MCTS a poor fit for Chess. A shallow trap is
a branch in the game tree where the game will abruptly end in a small
number of turns. If the random tree search doesn’t luckily stumble
upon a trap during its random traversal, it can’t take it into account
in its final decision. A skilled player will lead the game towards one
of these traps, and the AI will blunder along, not realizing what’s
happened until its too late.</p>

<p>I almost feel bad for it when this happens. If you watch the memory
usage and number of playouts, once it falls into a trap, you’ll see it
using almost no memory while performing a ton of playouts. It’s
desperately, frantically searching for a way out of the trap. But it’s
too late, little AI.</p>

<h3 id="another-tool-in-the-toolbelt">Another Tool in the Toolbelt</h3>

<p>I’m really happy to have sunk a couple weekends into playing with MCTS.
It’s not always a great fit, as seen with Yavalath, but it’s a really
neat algorithm. Now that I’ve wrapped my head around it, I’ll be ready
to use it should I run into an appropriate problem in the future.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>A GPU Approach to Path Finding</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2014/06/22/"/>
    <id>urn:uuid:29de5cb3-f93a-3e6e-9adc-ff689e736877</id>
    <updated>2014-06-22T22:51:46Z</updated>
    <category term="ai"/><category term="webgl"/><category term="javascript"/><category term="gpgpu"/><category term="opengl"/>
    <content type="html">
      <![CDATA[<p>Last time <a href="/blog/2014/06/10/">I demonstrated how to run Conway’s Game of Life</a>
entirely on a graphics card. This concept can be generalized to <em>any</em>
cellular automaton, including automata with more than two states. In
this article I’m going to exploit this to solve the <a href="http://en.wikipedia.org/wiki/Shortest_path_problem">shortest path
problem</a> for two-dimensional grids entirely on a GPU. It will be
just as fast as traditional searches on a CPU.</p>

<p>The JavaScript side of things is essentially the same as before — two
textures with fragment shader in between that steps the automaton
forward — so I won’t be repeating myself. The only parts that have
changed are the cell state encoding (to express all automaton states)
and the fragment shader (to code the new rules).</p>

<ul>
  <li><a href="https://skeeto.github.io/webgl-path-solver/">Online Demo</a>
(<a href="https://github.com/skeeto/webgl-path-solver">source</a>)</li>
</ul>

<p>Included is a pure JavaScript implementation of the cellular
automaton (State.js) that I used for debugging and experimentation,
but it doesn’t actually get used in the demo. A fragment shader
(12state.frag) encodes the full automaton rules for the GPU.</p>

<h3 id="maze-solving-cellular-automaton">Maze-solving Cellular Automaton</h3>

<p>There’s a dead simple 2-state cellular automaton that can solve any
<em>perfect</em> maze of arbitrary dimension. Each cell is either OPEN or a
WALL, only 4-connected neighbors are considered, and there’s only one
rule: if an OPEN cell has only one OPEN neighbor, it becomes a WALL.</p>

<p><img src="/img/path/simple.gif" alt="" /></p>

<p>On each step the dead ends collapse towards the solution. In the above
GIF, in order to keep the start and finish from collapsing, I’ve added
a third state (red) that holds them open. On a GPU, you’d have to do
as many draws as the length of the longest dead end.</p>

<p>A perfect maze is a maze where there is exactly one solution. This
technique doesn’t work for mazes with multiple solutions, loops, or
open spaces. The extra solutions won’t collapse into one, let alone
the shortest one.</p>

<p><img src="/img/path/simple-loop.gif" alt="" /></p>

<p>To fix this we need a more advanced cellular automaton.</p>

<h3 id="path-solving-cellular-automaton">Path-solving Cellular Automaton</h3>

<p>I came up with a 12-state cellular automaton that can not only solve
mazes, but will specifically find the shortest path. Like above, it
only considers 4-connected neighbors.</p>

<ul>
  <li>OPEN (white): passable space in the maze</li>
  <li>WALL (black): impassable space in the maze</li>
  <li>BEGIN (red): starting position</li>
  <li>END (red): goal position</li>
  <li>FLOW (green): flood fill that comes in four flavors: north, east, south, west</li>
  <li>ROUTE (blue): shortest path solution, also comes in four flavors</li>
</ul>

<p>If we wanted to consider 8-connected neighbors, everything would be
the same, but it would require 20 states (n, ne, e, se, s, sw, w, nw)
instead of 12. The rules are still pretty simple.</p>

<ul>
  <li>WALL and ROUTE cells never change state.</li>
  <li>OPEN becomes FLOW if it has any adjacent FLOW cells. It points
towards the neighboring FLOW cell (n, e, s, w).</li>
  <li>END becomes ROUTE if adjacent to a FLOW cell. It points towards the
FLOW cell (n, e, s, w). This rule is important for preventing
multiple solutions from appearing.</li>
  <li>FLOW becomes ROUTE if adjacent to a ROUTE cell that points towards
it. Combined with the above rule, it means when a FLOW cell touches
a ROUTE cell, there’s a cascade.</li>
  <li>BEGIN becomes ROUTE when adjacent to a ROUTE cell. The direction is
unimportant. This rule isn’t strictly necessary but will come in
handy later.</li>
</ul>

<p>This can be generalized for cellular grids of any arbitrary dimension,
and it could even run on a GPU for higher dimensions, limited
primarily by the number of texture uniform bindings (2D needs 1
texture binding, 3D needs 2 texture bindings, 4D needs 8 texture
bindings … I think). But if you need to find the shortest path along
a five-dimensional grid, I’d like to know why!</p>

<p>So what does it look like?</p>

<p><img src="/img/path/maze.gif" alt="" /></p>

<p>FLOW cells flood the entire maze. Branches of the maze are search in
parallel as they’re discovered. As soon as an END cell is touched, a
ROUTE is traced backwards along the flow to the BEGIN cell. It
requires double the number of steps as the length of the shortest
path.</p>

<p>Note that the FLOW cell keep flooding the maze even after the END was
found. It’s a cellular automaton, so there’s no way to communicate to
these other cells that the solution was discovered. However, when
running on a GPU this wouldn’t matter anyway. There’s no bailing out
early before all the fragment shaders have run.</p>

<p>What’s great about this is that we’re not limited to mazes whatsoever.
Here’s a path through a few connected rooms with open space.</p>

<p><img src="/img/path/flood.gif" alt="" /></p>

<h4 id="maze-types">Maze Types</h4>

<p>The worst-case solution is the longest possible shortest path. There’s
only one frontier and running the entire automaton to push it forward
by one cell is inefficient, even for a GPU.</p>

<p><img src="/img/path/spiral.gif" alt="" /></p>

<p>The way a maze is generated plays a large role in how quickly the
cellular automaton can solve it. A common maze generation algorithm
is a random depth-first search (DFS). The entire maze starts out
entirely walled in and the algorithm wanders around at random plowing
down walls, but never breaking into open space. When it comes to a
dead end, it unwinds looking for new walls to knock down. This methods
tends towards long, winding paths with a low branching factor.</p>

<p>The mazes you see in the demo are Kruskal’s algorithm mazes. Walls are
knocked out at random anywhere in the maze, without breaking the
perfect maze rule. It has a much higher branching factor and makes for
a much more interesting demo.</p>

<h4 id="skipping-the-route-step">Skipping the Route Step</h4>

<p>On my computers, with a 1023x1023 Kruskal maze <del>it’s about an
order of magnitude slower</del> (see update below) than <a href="http://en.wikipedia.org/wiki/A*_search_algorithm">A*</a>
(<a href="http://ondras.github.io/rot.js/hp/">rot.js’s version</a>) for the same maze. <del>Not very
impressive!</del> I <em>believe</em> this gap will close with time, as GPUs
become parallel faster than CPUs get faster. However, there’s
something important to consider: it’s not only solving the shortest
path between source and goal, <strong>it’s finding the shortest path between
the source and any other point</strong>. At its core it’s a <a href="http://www.redblobgames.com/pathfinding/tower-defense/">breadth-first
grid search</a>.</p>

<p><em>Update</em>: One day after writing this article I realized that
<code class="language-plaintext highlighter-rouge">glReadPixels</code> was causing a gigantic bottlebeck. By only checking for
the end conditions once every 500 iterations, this method is now
equally fast as A* on modern graphics cards, despite taking up to an
extra 499 iterations. <strong>In just a few more years, this technique
should be faster than A*.</strong></p>

<p>Really, there’s little use in ROUTE step. It’s a poor fit for the GPU.
It has no use in any real application. I’m using it here mainly for
demonstration purposes. If dropped, the cellular automaton would
become 6 states: OPEN, WALL, and four flavors of FLOW. Seed the source
point with a FLOW cell (arbitrary direction) and run the automaton
until all of the OPEN cells are gone.</p>

<h3 id="detecting-end-state">Detecting End State</h3>

<p>The ROUTE cells do have a useful purpose, though. How do we know when
we’re done? We can poll the BEGIN cell to check for when it becomes a
ROUTE cell. Then we know we’ve found the solution. This doesn’t
necessarily mean all of the FLOW cells have finished propagating,
though, especially in the case of a DFS-maze.</p>

<p>In a CPU-based solution, I’d keep a counter and increment it every
time an OPEN cell changes state. The the counter doesn’t change after
an iteration, I’m done. OpenGL 4.2 introduces an <a href="http://www.opengl.org/wiki/Atomic_Counter">atomic
counter</a> that could serve this role, but this isn’t available in
OpenGL ES / WebGL. The only thing left to do is use <code class="language-plaintext highlighter-rouge">glReadPixels</code> to
pull down the entire thing and check for end state on the CPU.</p>

<p>The original 2-state automaton above also suffers from this problem.</p>

<h3 id="encoding-cell-state">Encoding Cell State</h3>

<p>Cells are stored per pixel in a GPU texture. I spent quite some time
trying to brainstorm a clever way to encode the twelve cell states
into a vec4 color. Perhaps there’s some way to <a href="/blog/2014/06/21/">exploit
blending</a> to update cell states, or make use of some other kind
of built-in pixel math. I couldn’t think of anything better than a
straight-forward encoding of 0 to 11 into a single color channel (red
in my case).</p>

<div class="language-glsl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">state</span><span class="p">(</span><span class="kt">vec2</span> <span class="n">offset</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">vec2</span> <span class="n">coord</span> <span class="o">=</span> <span class="p">(</span><span class="nb">gl_FragCoord</span><span class="p">.</span><span class="n">xy</span> <span class="o">+</span> <span class="n">offset</span><span class="p">)</span> <span class="o">/</span> <span class="n">scale</span><span class="p">;</span>
    <span class="kt">vec4</span> <span class="n">color</span> <span class="o">=</span> <span class="n">texture2D</span><span class="p">(</span><span class="n">maze</span><span class="p">,</span> <span class="n">coord</span><span class="p">);</span>
    <span class="k">return</span> <span class="kt">int</span><span class="p">(</span><span class="n">color</span><span class="p">.</span><span class="n">r</span> <span class="o">*</span> <span class="mi">11</span><span class="p">.</span><span class="mi">0</span> <span class="o">+</span> <span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This leaves three untouched channels for other useful information. I
experimented (uncommitted) with writing distance in the green channel.
When an OPEN cell becomes a FLOW cell, it adds 1 to its adjacent FLOW
cell distance. I imagine this could be really useful in a real
application: put your map on the GPU, run the cellular automaton a
sufficient number of times, pull the map back off (<code class="language-plaintext highlighter-rouge">glReadPixels</code>),
and for every point you know both the path and total distance to the
source point.</p>

<h3 id="performance">Performance</h3>

<p>As mentioned above, I ran the GPU maze-solver against A* to test its
performance. I didn’t yet try running it against Dijkstra’s algorithm
on a CPU over the entire grid (one source, many destinations). If I
had to guess, I’d bet the GPU would come out on top for grids with a
high branching factor (open spaces, etc.) so that its parallelism is
most effectively exploited, but Dijkstra’s algorithm would win in all
other cases.</p>

<p>Overall this is more of a proof of concept than a practical
application. It’s proof that we can trick OpenGL into solving mazes
for us!</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Markov Chain Text Generation</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2012/09/05/"/>
    <id>urn:uuid:3f808165-be65-3f4b-f485-8df6aacccd04</id>
    <updated>2012-09-05T00:00:00Z</updated>
    <category term="emacs"/><category term="elisp"/><category term="ai"/>
    <content type="html">
      <![CDATA[<p>You may have been confused by
<a href="/blog/2012/09/04/">yesterday’s nonsense post</a>. That’s because it was
generated by a few
<a href="https://github.com/skeeto/markov-text">Elisp Markov chain functions</a>. It
was fed my entire blog and used to generate a ~1500 word post.  I
tidied up a bit to make sure the markup was valid and parenthesis were
balanced, but that’s about it.</p>

<p>The algorithm is really simple and I was quite surprised by the
quality of the output. After feeding it <em>Great Expectations</em> and <em>A
Princess of Mars</em> (easily obtainable from
<a href="http://www.gutenberg.org/">Project Gutenberg</a>) I had a good laugh at
some of the output. Some choice quotes,</p>

<blockquote>
  <p>He wiped himself again, as if he didn’t marry her by hand.</p>
</blockquote>

<blockquote>
  <p>I admit having done so, and the summer afternoon toned down into the
house.</p>
</blockquote>

<p>My favorite of yesterday’s post was this one,</p>

<blockquote>
  <p>Suppose you want to read a great story, I recommend it.</p>
</blockquote>

<p>The output also looks like some types of spam, so this may be how some
spammers generate content in order to get around spam filters.</p>

<p>To build a Markov chain from input, the program looks at
<code class="language-plaintext highlighter-rouge">markov-text-state-size</code> words (default 3) and makes note of what word
follows. Then it slides the window forward one word and repeats. To
generate text, the last <code class="language-plaintext highlighter-rouge">markov-text-state-size</code> words outputted is
the state and the next word is selected from these notes at random,
weighted by the frequency of its appearance in the input text. Smaller
state sizes generates more random output and larger state sizes
generates better structured output. Too large and the output is the
input verbatim.</p>

<p>For example, given this sentence and a state size of <em>two</em> words,</p>

<blockquote>
  <p>Quickly, he ran and he ran until he couldn’t.</p>
</blockquote>

<p>The produced chain looks like this in alist form,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>((("Quickly," "he") "ran")
 (("he" "ran") "and" "until")
 (("ran" "and") "he")
 (("and" "he") "ran")
 (("ran" "until") "he")
 (("until" "he") "couldn't.")
 (("he" "couldn't.")))
</code></pre></div></div>

<p><a href="/img/diagram/markov-chain.gv"><img src="/img/diagram/markov-chain.png" alt="" /></a></p>

<p>Because there are two options for (“he” “ran”), the generator might
loop around that state for awhile like so,</p>

<blockquote>
  <p>Quickly, he ran and he ran and he ran and he ran until he couldn’t.</p>
</blockquote>

<p>Or it might skip the section altogether,</p>

<blockquote>
  <p>Quickly, he ran until he couldn’t.</p>
</blockquote>

<p>Also notice that the punctuation is part of the word. This makes the
output more natural, automatically forming sentences. More so, my
program also holds onto all newlines. This breaks the output into nice
paragraphs without any extra effort. Since I wrote it in Elisp, I use
<code class="language-plaintext highlighter-rouge">fill-paragraph</code> to properly wrap the paragraphs as I generate them,
so superfluous single newlines don’t hurt anything.</p>

<p>One problem I did run into with my input text was quotes. I was using
novels so there is a lot of quoted text (character dialog). The
generated text tends to balance quotes poorly. My solution for the
moment is to strip these out along with spaces when forming
words. That’s still not ideal.</p>

<p>I’m going to play with this a bit more, using it as a tool for other
project ideas (ERC bot, etc.). I already did this by including a
<a href="http://en.wikipedia.org/wiki/Lorem_ipsum"><em>lorem ipsum</em></a> generator
alongside the <code class="language-plaintext highlighter-rouge">markov-text</code> package. The input text is Cicero’s <em>De
finibus bonorum et malorum</em>, the original source of <em>lorem
ipsum</em>. This was actually the original inspiration for this project,
after I saw <code class="language-plaintext highlighter-rouge">lorem-ipsum.el</code> on EmacsWiki and decided I could do
better.</p>
]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Implemented Is Simple Data Compression</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2012/09/04/"/>
    <id>urn:uuid:2dd8265c-e498-333f-7f07-4fd93d873975</id>
    <updated>2012-09-04T00:00:00Z</updated>
    <category term="emacs"/><category term="lisp"/><category term="ai"/>
    <content type="html">
      <![CDATA[<p><em>Update</em>: This post shouldn’t make sense to anyone
(hopefully). <a href="/blog/2012/09/05/">Read the follow-up</a> for an
explanation.</p>

<hr />

<p>When a branch of my posts remains simple.</p>

<p>This is necessary when one will assume Alan is more important than
number 12. By using numbers to repeat them, but this won’t work with
any sort of thing you want to load what’s needed. This includes
reimplementing the reader as it seems you still need to specify any
video-specific parameters, <code class="language-plaintext highlighter-rouge">ppmtoy4m</code> is the whole thing is just that,
decorated with some tips on how the current space as visited, then
recurse from the client to read a great story, I recommend you use to
launch a daemon process and prints the variable information to
stdout. As an added bonus, when a second variable for accumulation and
a second argument is relevant.</p>

<p>Suppose you want to read a great story, I recommend it.</p>

<p>This servlet uses the Term::ProgressBar, if it’s any good, but it’s
funny. As anyone with cats knows, it’s not <em>too</em> stupid to call
<code class="language-plaintext highlighter-rouge">fsync()</code> to force the write to the snapshot and uninterns any new
symbols. These symbols will be added to the the second experiment.</p>

<p>At this line, you can perform a number from a couple of these and give
them back any other language that can turn out even from a large
header comment in the logs, so getting someone into my honeypot
wouldn’t take long at all. The only proof I could then
cherry-pick/pull the issues from that repository and see the
polynomial interpolation at that time, presented in order. This makes
so much of web development (I think that’s his name). I am an Emacs
person myself, which I use branches all the time, now that they can be
written.</p>

<p>We will run your build system in a web front-end to it, and made a
couple of seconds.</p>

<p>You should also be a good head start, though. The SPARC is big-endian
and the results to seed their program accordingly. You could do this
is by mounting the compromised filesystem in a list. In the
decentralized model, everyone has their own solutions in parallel when
it comes across 10 it emits 0.</p>

<p>Here’s an example of some of the fire gem activated and exploded,
causing no blindness to me. They take a look at the same level as the
printed string. You can grab my source code in response to abuse by
spammers who hide fraudulent URLs behind shortened ones. If these
services ever went down all at once, these shortened URLs would rot,
destroying many of the image, with the FFI.</p>

<p>Because I wrote a shell script that will also remove the execs and
live with nested shells because the zeros cancel out everything else?
Here is the protocol.</p>

<p>Generate a 10-byte random IV. This need not implement this.</p>

<p>Note that the shell script, and the arcfour key scheduler at least n
days.</p>

<p>However, generating a series of commits to all other encounters
nothing changes.</p>

<p>Your program should simulate this by having the user to reseed
somewhere. There’s no direct way to install it to dominate for
awhile. It is strange that Matlab itself doesn’t have any sort of
syntax highlighting. Boring! I finally ran into this image. After each
paste, make a saving throw to prevent an explosion.</p>

<p>Because Gnohkk would also suffer from the bottom are arranged around
the cats in the logs, so getting someone into my honeypot wouldn’t
take long at the link in the block. Another was going to used a
stationary magnet.</p>

<p>Our team went with this array (and replaced the current layer 5). Now,
duplicate the work was done just once by freeing the entire number, it
can perform both compression and decompression on both sides don’t pay
attention to the development loop is just an ordered list of 50 H’s
and T’s. If you implement this in the same time. This is along the
way, clone my repository right into the official website so I had to
do this for any long-blocking function that I use <code class="language-plaintext highlighter-rouge">ppmtoy4m</code> to pipe
the new frames to keep, such as n^p mod M, which this will handle
efficiently. For example, to add a new compression algorithm in terms
of brute-force attacks it requires using numbers long enough to fit
three Emacs’ windows side-by-side at 78 columns each. The leftmost one
contains my active work buffer where I do most useful things, a fresh
array every time it sees a free musical.  Unfortunately, my writing
skills are even worse. I have gotten good mileage out of a file based
on their website demonstrating how to increment the iterator. I have
to type a negative comment about zip archives and moved on. I <em>am</em>
using a constant amount of memory.</p>

<p>It turns out that everyone is free to share his source code samples,
particularly more recent entries, was that producing the relief
surface was an e-mail address, I get home from work I don’t recommend
doing this with secret Java applets.</p>

<p>There are a few weeks since I last used KOffice, so I could easily
plug it into Emacs and run the test above, I would rather not do
damage, but rather a patient human being. Getting tired of manually
synchronizing them. It was finally time to document the effort as a
single mine is destroyed, the neighboring mines will replicate a
replacement. The minefield itself could therefore hold no secrets
whatsoever. This leaves out any possibility of a rumor among a group
of people. At any given time, each person in the background. My shell
habits looked like the ones you’re seeing after <code class="language-plaintext highlighter-rouge">end-package</code>.</p>

<p>It’s really simple way to detect edges all over the weekend I came up
with some rough edges. So I got it right while IE, Opera, Safari, and
Chrome all do it again.</p>

<p>Numbers can be found inside the fake closure provided by
lexical-let. In a previous post about Lua, another about a third of my
name generation code.</p>

<p>S-expressions are handy anywhere.</p>

<p>Two months ago I was so happy when I run the program with the proper
Perl regular expression contains quotes and these will not be worth
it.</p>

<p>I can’t help but think that a knight moving according to the current
symbol table to the existing mountain of elisp code out there,
requiring a massive increase in speed when using OpenCL. In fact,
there is virtually no computation involved. So what I want to look
like SBCL. Fortunately, that’s not all!!! There is a fake service or
computer on a chess board such that it’s somewhat easier to tell when
the handler can present any contents it wants. In this case, rather
than just one, even though I don’t know what it looks good, except you
want to italicize a few bits smaller than a minute. All the other day
I will probably be ordered by their own directory. Modern applications
have moved into a directory under <code class="language-plaintext highlighter-rouge">~/.config/</code>. Your script needs to
be broken into small computation units, because Emacs lacked network
functionality until recently was the package manager, <code class="language-plaintext highlighter-rouge">package</code>, and
the Emacs Lisp Package Archive.</p>

<p>One of the info field in the list, which sounds like a .emacs file in
your program. If the slot is already taken, the symbol was in an
external system.</p>

<p>After all this, I thought I’d give it a YouTube URL and a single
password if the required artifacts, digitally signs them, and bundles
them up.</p>

<p>The demo at the same length as the variable declarations are exactly
the right magical string of, say, 31 fractions.</p>

<p>The story is really happening. Optimizing away variables that point to
it.</p>

<p>Oh, and I was just a tiny subset of the memory at once became a lot of
memory. For example, here’s my laptop’s /bin/ls, very roughly
labeled.</p>

<p>The different segments of the game area was a mistake on my rolls and
had some wires, connected to some sort of bad things this may happen
subconsciously, which is given in ImageMagick’s montage tool, which
made the final montage out of the image functions described below.</p>

<p>You can write a lexer or tokenizer without one. Because of this tool,
Samuel Stoddard, gives some in-game context to the light of day. I
just use your own program, the script in your load-path somewhere.</p>

<p>I’ve frequently thought that a Lisp-based shell would be produced by
first individually gzipping each file in first.</p>

<p>For a long ways away from a simple double-click shortcut. If you just
want to duplicate the remaining canines. Her reward for victory was a
very similar process, but without any sort of thing is
transparent. I’ve already used it with a degree in, say, a few
months. I’ve used POSIX threads, Pthreads, before, so it suits my
needs for the first two arguments from filter2, as well as some more
to see my changes, but I don’t know much about it, user AJR spoiled it
with ssh-add and it queries for your passphrase, storing it in two
obarrays at once, these shortened URLs would rot, destroying many of
its input. For example, this is what registration looks like,</p>

<p>Unfortunately, the HTML output is a Harsh Mistress. If you know that
the opposite way that the adventures and characters are riddled with
mistakes and very unbalanced. For an easier way to set up properly in
your configuration.</p>

<p>I strongly recommend that you generally want to have a master pad, K,
that you often generate very improbable series of commits.</p>

<p>To all other encounters nothing changes.</p>

<p>And that’s it! I put this line in your program. If you are subscribed
to the rescue!</p>
]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  <entry>
    <title>Traveling Salesman Problem by Genetic Algorithm</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2007/11/30/"/>
    <id>urn:uuid:2a90302c-08f3-3b32-d41d-7d41d26cd9fc</id>
    <updated>2007-11-30T00:00:00Z</updated>
    <category term="ai"/>
    <content type="html">
      <![CDATA[<ul>
  <li><a href="/download/genetic.tar.gz">/download/genetic.tar.gz</a> (6.19KB)</li>
</ul>

<p>Here is another project for my artificial intelligence class. I wrote
a generic <a href="http://en.wikipedia.org/wiki/Genetic_algorithm">genetic algorithm</a> class in C++ and then applied that
class to the traveling salesman problem. A genetic algorithm more
tuned to the traveling salesman problem would work better.</p>

<p>This particular implementation can use up to 16 points defined in the
weight matrix stored in travel.dat. This weight matrix can either be
defined by hand or generated using gendat.m from a list of points
stored in points.txt. A chromosome is 64 bits wide, which is 16 points
with 4 bits each. To make sure that every possible chromosome is a
valid solution, the points are selected out of a circular queue. Every
4 bits describes how far along the queue to walk before pulling out a
point. With the circular queue, the chromosome could be as short as 50
bits, but I was trying different things and 64 bits is the simplest
way to represent a solution. Here are some sample chromosomes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1101100010100000011001010101010111000110101100111111000010111010, 9315
0000110100000001000010010101010111011000101100010110000010111010, 9339
1010001100010001010010011001010010100000101100011111000010111110, 9410
0101001000000011010010011001010011010100101011100111000111011110, 9355
0101001001000000010110100001010011010110101101000001000101000110, 9349
0101011100010011000010011011011011010101101100011111010010111110, 9311
0000111101000001000010011001010010110100101100011111100010000010, 9350
1000001111100000010110101011011011100000111101010111000010111010, 9428
0000111011000001000010010111011111111110101100010111000001000111, 9448
</code></pre></div></div>

<p>The second number is the fitness value of the chromosome (10000 - path
length). Below is a path found after 20000 iterations:</p>

<p><img src="/img/diagram/salesman.png" alt="" /></p>

<p>It takes many iterations to find a reasonable solution and it never
finds a <em>really</em> good solution. This is because each node in the
chromosome depends on every single node before it. This is terrible
for a genetic algorithm, but it’s really the best I could think of
when using a generic genetic algorithm class like this.</p>

<p>A much better method would have the genetic algorithm actually know
about the problem at hand, working with nodes rather than bits.
Breeding would make cuts on nodes. Mutations would swap single nodes.
Perhaps this can be written another time.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  <entry>
    <title>Neural Network Blackjack Game</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2007/11/13/"/>
    <id>urn:uuid:d9bb1ed9-9d44-3747-6f94-a580912e68e5</id>
    <updated>2007-11-13T00:00:00Z</updated>
    <category term="ai"/>
    <content type="html">
      <![CDATA[<p>Get the source code (C++):</p>

<ul>
  <li><a href="/download/neural.tar.gz">/download/neural.tar.gz</a> (16.27KB)</li>
</ul>

<p>This is a neural network I wrote for an artificial intelligence class
I took about a year ago. It includes a stand alone neural network
class you can easily use in your own C++ program. Built around this
neural network is a simple version of Blackjack (hit or stand only).
You can play the neural network at Blackjack after it has finished
training, which can take up to a minute or so.</p>

<p>My implementation of the neural network is a really simple one, using
only back propagation, but it still has some neat surprises in it.
When I was working with it, I would use a simple GNU Octave script to
watch what was going on in real time.</p>

<p><img src="/img/diagram/neural.png" alt="" /></p>

<p>The x-axis is the number of iterations in tens of thousands and the
y-axis describes how often the neural network plays exactly the same
way a cheater would at the same seat. A cheater is defined as someone
who knows what card is next and can play perfectly. Note: the neural
network itself is not cheating. At the end, it agrees with the cheater
about 83% of the time. This is the script that reads the neural
network output. For those who don’t recognize the
<a href="http://en.wikipedia.org/wiki/She-Bang">she-bang</a>, save this to the file <code>plotdat</code> and
set the execution permission (<code class="language-plaintext highlighter-rouge">chmod +x plotdat</code>).</p>

<div class="language-matlab highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">#!</span><span class="p">/</span><span class="n">usr</span><span class="p">/</span><span class="n">bin</span><span class="p">/</span><span class="n">octave</span> <span class="o">-</span><span class="n">qf</span>
<span class="err">#</span> <span class="n">Usage</span><span class="p">:</span>
<span class="err">#</span> <span class="n">plotdat</span> <span class="n">filename</span> <span class="p">[</span><span class="n">loop</span><span class="p">]</span> <span class="p">[</span><span class="n">last</span> <span class="n">index</span><span class="p">]</span>

<span class="n">dat</span> <span class="o">=</span> <span class="nb">dlmread</span><span class="p">(</span><span class="n">argv</span><span class="p">{</span><span class="mi">1</span><span class="p">});</span>
<span class="n">x</span> <span class="o">=</span> <span class="nb">length</span><span class="p">(</span><span class="n">dat</span><span class="p">);</span>
<span class="n">loop</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

<span class="k">if</span> <span class="p">(</span><span class="nb">length</span><span class="p">(</span><span class="n">argv</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">)</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">argv</span><span class="p">{</span><span class="mi">2</span><span class="p">}</span> <span class="o">==</span> <span class="s1">'loop'</span><span class="p">)</span>
    <span class="n">loop</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
  <span class="k">else</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">argv</span><span class="p">{</span><span class="mi">2</span><span class="p">};</span>
  <span class="k">end</span>
<span class="k">end</span>

<span class="nb">plot</span><span class="p">(</span><span class="n">dat</span><span class="p">(</span><span class="mi">1</span><span class="p">:</span><span class="n">x</span><span class="p">));</span>

<span class="k">while</span><span class="p">(</span><span class="n">loop</span><span class="p">)</span>
  <span class="n">dat</span> <span class="o">=</span> <span class="nb">dlmread</span><span class="p">(</span><span class="n">argv</span><span class="p">{</span><span class="mi">1</span><span class="p">});</span>
  <span class="nb">plot</span><span class="p">(</span><span class="n">dat</span><span class="p">);</span>
  <span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="k">end</span>

<span class="nb">pause</span><span class="p">;</span>
</code></pre></div></div>

<p>When you run the program, a blank, dumb neural network is created. It
doesn’t know how to play blackjack. In fact, just as it seems with my
sister at times, it doesn’t know how to do anything at all. It’s like
a newborn baby, but without any instincts. To make this neural network
useful, it is taught to play blackjack by running it very quickly
through about one million games, all tutored by a teacher who gets to
cheat by looking ahead in the deck. This allows the teacher to always
know the proper move.</p>

<p>The training works by giving the network an input and the desired
output (determined by the cheating). The network adjusts its internal
weights depending on the error between what its current output for the
input and the desired output. In doing this, the neural network picks
up on the statistical nature of its inputs and will pick up on
patterns.</p>

<p>When setting up the neural network in your own program, the tricky
part is determining how to encode the inputs so that patterns will be
found. In this case, I provide 3 integers to the network: its lower
bound score, its best score, and the visible opponent score. By lower
bound and best score, I am talking about Aces. All aces are treated as
1’s in the lower bound and ideal (highest without bust) values in the
best score. Scores never exceed 31, so we can encode this in 5 bits
each for a total of 15 bits. We only need 1 output bit: hit (0) or
stand (1). The network looks something like this, but with 30 hidden
layer neurons and a lot more connections.</p>

<p><img src="/img/diagram/network.png" alt="" /></p>

<p>Here is an example run of the blackjack game,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Creating network ... created!
Training ...
Win, lose, total, c%, win%: 1536, 7640, 10000, 42.1168% , 15.36%
Win, lose, total, c%, win%: 3728, 14438, 20000, 55.8875% , 21.92%

... (lots of output for a few seconds) ...

Win, lose, total, c%, win%: 203744, 700338, 980000, 82.0455% , 20.3%
Win, lose, total, c%, win%: 205849, 707461, 990000, 82.5488% , 21.05%
Win, lose, total, c%, win%: 207981, 714515, 1000000, 82.3059% , 21.32%

Begin game:
Computer is dealt: (hole) 5
Computer is dealt: (hole) 5 4
Computer stands.
Your hand: 2 8
Hit? (y/n): y
Your hand: 2 8 3
Hit? (y/n): y
Your hand: 2 8 3 10
You bust with 23
Computer wins with 19 against your 23
Play again? (y/n) : y

Begin game:
Computer is dealt: (hole) 4
Computer is dealt: (hole) 4 10
Computer stands.
Your hand: 10 A
Hit? (y/n): n
Your hand: 10 A
You win with 21 against computer's 20
Play again? (y/n) : n

Computer wins, losses, push: 1 (50%), 1 (50%), 0 (0%)
</code></pre></div></div>

<p>If you decide to try using the neural network in your own program, be
sure to play with different sized networks to see what works best. In
my implementation, a small network would be 1 layer with 3 nodes and a
large network would be 3 or 4 layers with hundreds of nodes. Bigger
networks may not work well at all, and there is no way to know what
size is best other than trial and error.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Iterated Prisoner's Dilemma</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2007/11/06/"/>
    <id>urn:uuid:4068d624-311a-30f4-a291-63713ffbc932</id>
    <updated>2007-11-06T00:00:00Z</updated>
    <category term="ai"/><category term="video"/><category term="lisp"/>
    <content type="html">
      <![CDATA[<p><img src="/img/prison/top.gif" alt="" /></p>

<p>I was reading about the <a href="http://en.wikipedia.org/wiki/Prisoner's_dilemma">prisoner’s dilemma</a> game the other day
and was inspired to simulate it myself. It would also be a good
project to start learning Common Lisp. All of the source code is
available in its original source file here:</p>

<ul>
  <li><a href="/download/prison/prison.lisp">/download/prison/prison.lisp</a></li>
</ul>

<p>I have only tried this code in my favorite Common Lisp implementation,
<a href="http://clisp.cons.org/">CLISP</a>, as well <a href="http://www.cons.org/cmucl/">CMUCL</a>.</p>

<p>In prisoner’s dilemma, two players acting as prisoners are given the
option of cooperating with or betraying (defecting) the other player.
Each player’s decision along with his opponents decision determines
the length of his prison sentence. It is bad news for the cooperating
player when the other player is defecting.</p>

<p>Prisoner’s dilemma becomes more interesting in the iterated version of
the game, where the same two players play repeatedly. This allows
players to “punish” each other for uncooperative play. Scoring
generally works as so (higher is better),</p>

<table>
<tr><td colspan="2"></td><th colspan="2">Player A</th></tr>
<tr><td colspan="2"></td><td>coop</td><td>defect</td></tr>
<tr><th rowspan="2">Player B</th><td>coop</td>
<td>(3,3)</td><td>(0,5)</td></tr>
<tr><td>defect</td><td>(5,0)</td><td>(1,1)</td></tr>
</table>

<p>The most famous, and strongest individual strategy, is tit-for-tat.
This player begins by playing cooperatively, then does whatever the
its opponent did last. Here is the Common Lisp code to run a
tit-for-tat strategy,</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">tit-for-tat</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nv">x</span><span class="p">)</span>
    <span class="p">(</span><span class="k">if</span> <span class="p">(</span><span class="nb">null</span> <span class="nv">x</span><span class="p">)</span> <span class="ss">:coop</span> <span class="nv">x</span><span class="p">)))</span>
</code></pre></div></div>

<p>If you are unfamiliar with Common Lisp, the <code class="language-plaintext highlighter-rouge">lambda</code> part is returning
an anonymous function that actually plays the tit-for-tat strategy.
The <code class="language-plaintext highlighter-rouge">tit-for-tat</code> function generates a tit-for-tat player along with
its own closure. The argument to the anonymous function supplies the
opponent’s last move, which is one of the symbols <code class="language-plaintext highlighter-rouge">:coop</code> or
<code class="language-plaintext highlighter-rouge">:defect</code>. In the case of the first move, <code class="language-plaintext highlighter-rouge">nil</code> is passed. These are
some really simple strategies that ignore their arguments,</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">rand-play</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nv">x</span><span class="p">)</span>
    <span class="p">(</span><span class="k">declare</span> <span class="p">(</span><span class="k">ignore</span> <span class="nv">x</span><span class="p">))</span>
    <span class="p">(</span><span class="k">if</span> <span class="p">(</span><span class="nb">&gt;</span> <span class="p">(</span><span class="nb">random</span> <span class="mi">2</span><span class="p">)</span> <span class="mi">0</span><span class="p">)</span> <span class="ss">:coop</span> <span class="ss">:defect</span><span class="p">)))</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">switcher-coop</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nb">last</span> <span class="ss">:coop</span><span class="p">))</span>
    <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nv">x</span><span class="p">)</span>
      <span class="p">(</span><span class="k">declare</span> <span class="p">(</span><span class="k">ignore</span> <span class="nv">x</span><span class="p">))</span>
      <span class="p">(</span><span class="k">if</span> <span class="p">(</span><span class="nb">eq</span> <span class="nb">last</span> <span class="ss">:coop</span><span class="p">)</span>
          <span class="p">(</span><span class="nb">setf</span> <span class="nb">last</span> <span class="ss">:defect</span><span class="p">)</span>
          <span class="p">(</span><span class="nb">setf</span> <span class="nb">last</span> <span class="ss">:coop</span><span class="p">)))))</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">switcher-defect</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nb">last</span> <span class="ss">:defect</span><span class="p">))</span>
    <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nv">x</span><span class="p">)</span>
      <span class="p">(</span><span class="k">declare</span> <span class="p">(</span><span class="k">ignore</span> <span class="nv">x</span><span class="p">))</span>
      <span class="p">(</span><span class="k">if</span> <span class="p">(</span><span class="nb">eq</span> <span class="nb">last</span> <span class="ss">:coop</span><span class="p">)</span>
          <span class="p">(</span><span class="nb">setf</span> <span class="nb">last</span> <span class="ss">:defect</span><span class="p">)</span>
          <span class="p">(</span><span class="nb">setf</span> <span class="nb">last</span> <span class="ss">:coop</span><span class="p">)))))</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">always-coop</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nv">x</span><span class="p">)</span>
    <span class="p">(</span><span class="k">declare</span> <span class="p">(</span><span class="k">ignore</span> <span class="nv">x</span><span class="p">))</span>
    <span class="ss">:coop</span><span class="p">))</span>

<span class="p">(</span><span class="nb">defun</span> <span class="nv">always-defect</span> <span class="p">()</span>
  <span class="p">(</span><span class="k">lambda</span> <span class="p">(</span><span class="nv">x</span><span class="p">)</span>
    <span class="p">(</span><span class="k">declare</span> <span class="p">(</span><span class="k">ignore</span> <span class="nv">x</span><span class="p">))</span>
    <span class="ss">:defect</span><span class="p">))</span>
</code></pre></div></div>

<p>Patrick Grim did an interesting study about ten years ago on iterated
prisoner’s dilemma involving competing strategies in a 2-dimensional
area: <a href="http://www.sunysb.edu/philosophy/faculty/pgrim/SPATIALP.HTM">Undecidability in the Spatialized Prisoner’s Dilemma: Some
Philosophical Implications</a>. It is very interesting, but I really
wanted to play around with some different configurations myself. So
what I did was extend my iterated prisoner’s dilemma engine above to
run over a 2-dimensional grid.</p>

<p>Grim’s idea was this: place different strategies in a 2-dimensional
grid. Each strategy competes against its immediate neighbors. (The
paper doesn’t specify which kind of neighbor, 4-connected or
8-connected, so I went with 4-connected.) The sum of these
competitions are added up to make that cell’s final score. After
scoring, each cell takes on the strategy of its highest neighbor, if
any of its neighbors have a higher score than itself. Repeat.</p>

<p>The paper showed some interesting results, where the tit-for-tat
strategy would sometimes dominate, and, in other cases, be quickly
wiped out, depending on starting conditions. Here was my first real
test of my simulation. Three strategies were placed randomly in a
50x50 grid: tit-for-tat, always-cooperate, and always-defect. This is
the first twenty iterations. It stabilizes after 16 iterations.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">run-random-matrix</span> <span class="mi">50</span> <span class="mi">100</span> <span class="mi">20</span> <span class="o">'</span><span class="p">(</span><span class="nv">tit-for-tat</span> <span class="nv">always-coop</span> <span class="nv">always-defect</span><span class="p">))</span>
</code></pre></div></div>

<p><img src="/img/prison/random.gif" alt="" /></p>

<p>White is always-cooperate, black is always-defect, and cyan is
tit-for-tat. Notice how the always-defect quickly exploits the
always-cooperate and dominates the first few iterations. However, as
the always-cooperate resource becomes exhausted, the tit-for-tat
cooperative strategy works together with itself, as well as the
remaining always-cooperate, to eliminate the always-defect invaders,
who have no one left to exploit. In the end, a few always-defect cells
are left in equilibrium, feeding off of always-cooperate neighbors,
who themselves have enough cooperating neighbors to hold their ground.</p>

<p>The effect can be seen more easily here. Around the outside is
tit-for-tat, in the middle is always-cooperate, and a single
always-defect cell is placed in the middle.</p>

<div class="language-cl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">run-matrix</span> <span class="p">(</span><span class="nv">create-three-box</span><span class="p">)</span> <span class="mi">100</span> <span class="mi">30</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/img/prison/boxes.gif" alt="" /></p>

<p>The asymmetric pattern is due to the way that ties are broken.</p>

<p>The lisp code only spits out text, which isn’t very easy to follow
whats going on. To generate these gifs, I first used this Octave
script to convert the text into images. Just dump the lisp output to a
text file and remove the hash table dump at the end. Then run this
script on that file:</p>

<ul>
  <li><a href="/download/prison/pd_plot.m">/download/prison/pd_plot.m</a></li>
</ul>

<p>The text file input should look like this:</p>

<ul>
  <li><a href="/download/prison/example.txt">/download/prison/example.txt</a></li>
</ul>

<p><del>You will need Octave-Forge.</del></p>

<p>The script will make PNGs. You can either change the script to make
GIFs (didn’t try this myself), or use something like
<a href="http://www.imagemagick.org/">ImageMagick</a> to convert the images afterward. Then, you
compile frames into the animated GIF using <a href="http://www.lcdf.org/gifsicle/">Gifsicle</a>.</p>

<p>See if you can come up with some different strategies and make some
special patterns for them. You may be able to observe some interesting
interactions. The image at the beginning of the article uses all of
the listed strategies in a random matrix.</p>

<p>I will continue to try out some more to see if I can find something
particularly interesting.</p>

]]>
    </content>
  </entry>
    
  
    
  <entry>
    <title>Chess AI Idea</title>
    <link rel="alternate" type="text/html" href="https://nullprogram.com/blog/2007/10/24/"/>
    <id>urn:uuid:2e66347a-fe5d-3b70-58ea-9a9e9bd6248a</id>
    <updated>2007-10-24T00:00:00Z</updated>
    <category term="ai"/><category term="game"/>
    <content type="html">
      <![CDATA[<p>So, I had this idea using a genetic algorithm to optimize the
parameters of a program that plays chess. Now, the genetic algorithm
wouldn’t be used at all during a game, but rather to optimize the
board evaluation parameters beforehand. I don’t know much about
writing board game AI programs, as I have only written a few of them
for fun (tic-tac-toe, connect 4, Pente). For this chess program, I am
taking a simple approach because I am more interested in seeing the
genetic algorithm at work than seeing the chess playing AI do well
against other chess AI or people.</p>

<p>The program would search the game tree using the <a href="http://en.wikipedia.org/wiki/Minimax">minimax</a>
algorithm, with some <a href="http://en.wikipedia.org/wiki/Alpha-beta_pruning">possible optimizations</a> added
afterward. Tree searching is just a matter of generating all possible
moves and looking at them. The hard part is the board evaluation
function, which evaluates the a particular board’s score based on the
arrangement of the pieces. Parameters to this evaluation function
would be, for example, the piece values. The pawn would be locked in
at a value of 1, which anchors the other values and provides a base
unit to work from.</p>

<p>Again, the parameters would not change during a game. We use the
genetic algorithm ahead of time to determine the parameters.</p>

<p>For the genetic algorithm, the set of parameters strung together makes
up a single chromosome. We maintain a pool of different chromosomes,
i.e. different sets of parameters, and breed these chromosomes
together to improve our parameter sets. We start out with a random
pool made of parameters that are most likely pretty terrible.</p>

<p>To evaluate the chromosomes, we need a fitness function, which
evaluates each chromosome for its level of “fitness” deciding if it
breeds or not. To do this we simply play the chromosome we are
evaluating against some base chromosome, which may just be parameters
chosen intuitively by the programmer. Or, the base chromosome could be
random too. Starting with a better base chromosome would be a good
head start, though. The fitness of the chromosome is how often it wins
against the base chromosome in, say, a few hundred games.</p>

<p>The most fit chromosomes are bred by taking a few parameters from each
to make a new chromosome. Mutations are occasionally added in order to
keep the chromosome pool from getting stuck in a local maximum. A
mutation involves changing one or more parameters in a chromosome
slightly in some random way. Mutations are rare and will usually be
detrimental to the chromosome quickly killing it off, but will
occasionally cause a good change that will be spread to other
chromosomes in the next generation, improving the gene pool.</p>

<p>We iterate this until either the maximum fitness level in the pool is
stuck for several iterations (we aren’t getting anywhere and mutations
aren’t helping), or the chromosomes are so good they always beat the
base chromosome, making the fitness algorithm meaningless. When this
happens, we replace the base chromosome with the best chromosome in
the pool and start over from scratch again with a random, or mostly
random, pool.</p>

<p>As you would expect, I have looked into parallelizing this process to
take advantage of a cluster. This is easy for several reasons. First,
evaluating chromosomes can be done simultaneously. No evaluation
depends on another chromosome’s evaluation. Second, the minimax game
tree search can be parallelized so that several different processes
search the game tree and give their results back to the parent
process. This works very well because the data being sent back to the
parent will be a single integer. No need to send large amounts of data
around the network.</p>

<p>I spent an afternoon hacking at this, but its still too crude to share
yet. I got the non-parallel version of the chess engine built but I am
still working on the evaluation function. The genetic algorithm hasn’t
been started. The only parameters at the moment are piece values. The
board evaluation function just adds up the piece values on the board
completely ignoring their positions. This makes the computer play
extremely aggressively, capturing the opponent’s pieces whenever it
can. This makes for a somewhat interesting bloodbath where the board
goes empty after just a few moves.</p>

<p>My problem right now is finding a good way to represent piece
movements so that it can be recycled. That is, I want to represent
movements when generating my search tree, verifying the legality of a
move, and evaluating the board all the same way so that I don’t have
to program in piece movements several times.</p>

]]>
    </content>
  </entry>
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  
    
  

</feed>
