Articles tagged ai at null program

2026 has been the most pivotal year in my career… and it's only March

2026-03-29T21:38:22Z

In February I left my employer after nearly two decades of service. In the moment I was optimistic, yet unsure I made the right choice. Dust settled, I’m now absolutely sure I chose correctly. I’m happier and better for it. There were multiple factors, but it’s not mere chance it coincides with these early months of the automation of software engineering. I left an employer that is years behind adopting AI to one actively supporting and encouraging it. As of March, in my professional capacity I no longer write code myself. My current situation was unimaginable to me only a year ago. Like it or not, this is the future of software engineering. Turns out I like it, and having tasted the future I don’t want to go back to the old ways.

In case you’re worried, this is still me. These are my own words. Writing is thinking, and it would defeat the purpose for an AI to write in my place on my personal blog. That’s not going to change.

I still spend much time reading and understanding code, and using most of the same development tools. It’s more like being a manager, orchestrating a nebulous team of inhumanly-fast, nameless assistants. Instead of dicing the vegetables, I conjure a helper to do it while I continue to run the kitchen. I haven’t managed people in some 20 years now, but I can feel those old muscles being put to use again as I improve at this new role. Will these kitchens still need human chefs like me by the end of the decade? Unclear, and it’s something we all need to prepare for.

My situation gave me an experience onboarding with AI assistance — a fast process given a near-instant, infinitely-patient helper answering any question about the code. By second week I was making substantial, wide contributions to the large C++ code base. It’s difficult to attach a quantifiable factor like 2x, 5x, 10x, etc. faster, but I can say for certain this wouldn’t have been possible without AI. The bottlenecks have shifted from producing code, which now takes relatively no time at all, to other points, and we’re all still trying to figure it out.

My personal programming has transformed as well. Everything I said about AI in late 2024 is, as I predicted, utterly obsolete. There’s a huge, growing gap between open weight models and the frontier. Models you can run yourself are toys. In general, almost any AI product or service worth your attention costs money. The free stuff is, at minimum, months behind. Most people only use limited, free services, so there’s a broad unawareness of just how far AI has advanced. AI is now highly skilled at programming, and better than me at almost every programming task, with inhumanly-low defect rates. The remaining issues are mainly steering problems: If AI code doesn’t do what I need, likely the AI writing it didn’t understand what I needed.

I’ll still write code myself from time to time for fun — minimalist, with my style and techniques — the same way I play shogi on the weekends for fun. However, artisan production is uneconomical in the presence of industrialization. AI makes programming so cheap that only the rich will write code by hand.

A small part of me is sad at what is lost. A bigger part is excited about the possibilities of the future. I’ve always had more ideas than time or energy to pursue them. With AI at my command, the problem changes shape. I can comfortably take on complexity from which I previously shied away, and I can take a shot at any idea sufficiently formed in my mind to prompt an AI — a whole skill of its own that I’m actively developing.

For instance, a couple weeks ago I put AI to work on a problem, and it produced a working solution for me after ~12 hours of continuous, autonomous work, literally while I slept. The past month w64devkit has burst with activity, almost entirely AI-driven. Some of it architectural changes I’ve wanted for years, but would require hours of tedious work, and so I never got around to it. AI knocked it out in minutes, with the new architecture opening new opportunities. It’s also taken on most of the cognitive load of maintenance.

Quilt.cpp

So far the my biggest, successful undertaking is Quilt.cpp, a C++ clone of Quilt, an early, actively-used source control system for patch management. Git is a glaring omission from the almost complete w64devkit, due platform and build issues. I’ve thought Quilt could fill some of that source control hole, except the original is written in Bash, Perl, and GNU Coreutils — even more of a challenge than Git. Since Quilt is conceptually simple, and I could lean on busybox-w32 diff and patch, I’ve considered writing my own implementation, just as I did pkg-config, but I never found the energy to do it.

Then I got good enough with AI to knock out a near feature-complete clone in about four days, including a built-in diff and patch so it doesn’t actually depend on external tools (except invoking $EDITOR). On Windows it’s a ~1.6MB standalone EXE, to be included in future w64devkit releases. The source is distributed as an amalgamation, a single file quilt.cpp per its namesake:

$ c++ -std=c++20 -O2 -s -o quilt.exe quilt.cpp
$ ./quilt.exe --help
Usage: quilt [--quiltrc file]  [options] [args]

Commands:
  new        Create a new empty patch
  add        Add files to the topmost patch
  push       Apply patches to the source tree
  pop        Remove applied patches from the stack
  refresh    Regenerate a patch from working tree changes
  diff       Show the diff of the topmost or a specified patch
  series     List all patches in the series
  applied    List applied patches
  unapplied  List patches not yet applied
  top        Show the topmost applied patch
  next       Show the next patch after the top or a given patch
  previous   Show the patch before the top or a given patch
  delete     Remove a patch from the series
  rename     Rename a patch
  import     Import an external patch into the series
  header     Print or modify a patch header
  files      List files modified by a patch
  patches    List patches that modify a given file
  edit       Add files to the topmost patch and open an editor
  revert     Discard working tree changes to files in a patch
  remove     Remove files from the topmost patch
  fold       Fold a diff from stdin into the topmost patch
  fork       Create a copy of the topmost patch under a new name
  annotate   Show which patch modified each line of a file
  graph      Print a dot dependency graph of applied patches
  mail       Generate an mbox file from a range of patches
  grep       Search source files (not implemented)
  setup      Set up a source tree from a series file (not implemented)
  shell      Open a subshell (not implemented)
  snapshot   Save a snapshot of the working tree for later diff
  upgrade    Upgrade quilt metadata to the current format
  init       Initialize quilt metadata in the current directory

Use "quilt  --help" for details on a specific command.

It supports Windows and POSIX, and runs ~5x faster than the original. AI developed it on Windows, Linux, and macOS: It’s best when the AI can close the debug loop and tackle problems autonomously without involving a human slowpoke. The handful of “not implemented” parts aren’t because they’re too hard — each would probably take an AI ~10 minutes — but deliberate decisions of taste.

There’s an irony that the reason I could produce Quilt.cpp with such ease is also a reason I don’t really need it anymore.

I changed the output of quilt mail to be more Git-compatible. The mbox produced by Quilt.cpp can be imported into Git with a plain git am:

$ quilt mail --mbox feature-branch.mbox
$ git am feature-branch.mbox

The idea being that I could work on a machine without Git (e.g. Windows XP), and copy/mail the mbox to another machine where Git can absorb it as though it were in Git the whole time. git format-patch to quilt import sends commits in the opposite direction, useful for manually testing Quilt.cpp on real change sets.

To be clear, I could not have done this if the original Quilt did not exist as a working program. I began with an AI generating a conformance suite based on the original, its documentation, and other online documentation, validating that suite against the original implementation (see -DQUILT_TEST_EXECUTABLE). Then had another AI code to the tests, on architectural guidance from me, with -D_GLIBCXX_DEBUG and sanitizers as guardrails. That was day one. The next three days were lots of refining and iteration as I discover the gaps in the test suite. I’d prompt AI to compare Quilt.cpp to the original Quilt man page, add tests for missing features, validate the new tests against the original Quilt, then run several agents to fix the tests. While they worked I’d try the latest build and note any bugs. As of this writing, the result is about equal parts test and non-test, ~9KLoC each.

I’m likely to use this technique to clone other tools with implementations unsuitable for my purposes. I learned quite a bit from this first attempt.

Why C++ instead of my usual choice of C? As we know, conventional C is highly error-prone. Even AI has trouble with it. In the ~9k lines of C++ that is Quilt.cpp, I am only aware of three memory safety errors by the AI. Two were null-terminated string issues with strtol, where the AI was essentially writing C instead of C++, after which I directed the AI to use std::from_chars and drop as much direct libc use as possible. (The other was an unlikely branch with std::vector::back on an empty vector.) We can rescue C with better techniques like arena allocation, counted strings, and slices, but while (current) state of the art AI understands these things, it cannot work effectively with them in C. I’ve tried. So I picked C++, and from my professional work I know AI is better at C++ than me.

Also like a manager, I have not read most of the code, and instead focused on results, so you might say this was “vibe-coded.” It is thoroughly tested, though I’m sure there are still bugs to be ironed out, especially on the more esoteric features I haven’t tried by hand yet.

Let’s discuss tools

After opposing CMake for years, you may have noticed the latest w64devkit now includes CMake and Ninja. What happened? Preparing for my anticipated employment change, this past December I read Professional CMake. I realized that my practical problems with CMake were that nearly everyone uses it incorrectly. Most CMake builds are a disaster, but my new-found knowledge allows me to navigate the common mistakes. Only high profile open source projects manage to put together proper CMake builds. Otherwise the internet is loaded with CMake misinformation. Similar to AI, if you’re not paying for CMake knowledge then it’s likely wrong or misleading. So I highly recommend that book!

Frontier AI is very good with CMake. When a project has a CMake build that isn’t too badly broken, just tell AI to fix it, without any specifics, and build problems disappear in mere minutes without having to think about it. It’s awesome. Combine it with the previous discussion about tests making AI so much more effective, and that it also knows CTest well, and you’ve got a killer formula. I’m more effective with CTest myself merely from observing how AI uses it. AI (currently) cannot use debuggers, so putting powerful, familiar testing tools in its hands helps a lot, versus the usual bespoke, debugger-friendly solutions I prefer.

Similar to solving CMake problems: Have a hairy merge conflict? Just ask AI resolve it. It’s like magic. I no longer fear merge conflicts.

So part of my motivation for adding CMake to w64devkit was anticipation of projects like Quilt.cpp, where they’d be available to AI, or at least so I could use the tools the AI used to build/test myself. It’s already paid for itself, and there’s more to come.

For agent software, on personal projects I’m using Claude Code. It’s a great value, cheaper than paying API rates but requires working around 5-hour limit windows. I started with Pro (US$20/mo), but I’m getting so much out of it that as of this writing I’m on 5x Max (US$100/mo) simply to have enough to explore all my ideas. Be warned: Anthropic software is quite buggy, more so than industry average, and it’s obvious that they never even start, let alone test, some of their released software on disfavored platforms (Windows, Android). Don’t expect to use Claude Code effectively for native Windows platform development, which sadly includes w64devkit. Hopefully that’s fixed someday. I suspect Anthropic hit a bottleneck on QA, and unable to fit AI in that role they don’t bother. You can theoretically report bugs on GitHub, but they’re just ignored and closed. (Why don’t they have AI agents jumping on this wealth of bug reports?)

At work I’m using Cursor where I get a choice of models. My favorite for March has been GPT-5.4, which in my experience beats Opus 4.6 on Claude Code by a small margin. It’s immediately obvious that Cursor is better agent software than Claude Code. It’s more robust, more featureful, and with a clearer UI than Claude Code. It has no trouble on Windows and can drive w64devkit flawlessly. It’s also more expensive than Claude Code. My employer currently spends ~US$250/mo on my AI tokens, dirt cheap considering what they’re getting out of it. I have bottlenecks elsewhere that keep me from spending even more.

Neither Cursor nor Claude Code are open source, so what are the purists to do, even if they’re willing to pay API rates for tokens? Sadly I have no answers for you. I haven’t gotten any open source agent software actually working, and it seems they may lack the necessary secret sauce.

Update: Several folks suggested I give OpenCode another shot, and this time I got over the configuration hurdle. Single executable, slick interface, and unlike Claude Code, I observed no bugs in my brief trial. Give that a shot if you’re looking for an open source client.

The future is going to be weird. My experience is only a peek at what’s to come, and my head is still spinning. However, the more I adapt to the changes, the better I feel. If you’re feeling anxious like I was, don’t flinch from improving your own AI knowledge and experience.

Everything I've learned so far about running local LLMs

2024-11-10T05:05:20Z

This article was discussed on Hacker News.

Over the past month I’ve been exploring the rapidly evolving world of Large Language Models (LLM). It’s now accessible enough to run a LLM on a Raspberry Pi smarter than the original ChatGPT (November 2022). A modest desktop or laptop supports even smarter AI. It’s also private, offline, unlimited, and registration-free. The technology is improving at breakneck speed, and information is outdated in a matter of months. This article snapshots my practical, hands-on knowledge and experiences — information I wish I had when starting. Keep in mind that I’m a LLM layman, I have no novel insights to share, and it’s likely I’ve misunderstood certain aspects. In a year this article will mostly be a historical footnote, which is simultaneously exciting and scary.

In case you’ve been living under a rock — as an under-the-rock inhabitant myself, welcome! — LLMs are neural networks that underwent a breakthrough in 2022 when trained for conversational “chat.” Through it, users converse with a wickedly creative artificial intelligence indistinguishable from a human, which smashes the Turing test and can be wickedly creative. Interacting with one for the first time is unsettling, a feeling which will last for days. When you bought your most recent home computer, you probably did not expect to have a meaningful conversation with it.

I’ve found this experience reminiscent of the desktop computing revolution of the 1990s, where your newly purchased computer seemed obsolete by the time you got it home from the store. There are new developments each week, and as a rule I ignore almost any information more than a year old. The best way to keep up has been r/LocalLLaMa. Everything is hyped to the stratosphere, so take claims with a grain of salt.

I’m wary of vendor lock-in, having experienced the rug pulled out from under me by services shutting down, changing, or otherwise dropping my use case. I want the option to continue, even if it means changing providers. So for a couple of years I’d ignored LLMs. The “closed” models, accessibly only as a service, have the classic lock-in problem, including silent degradation. That changed when I learned I can run models close to the state-of-the-art on my own hardware — the exact opposite of vendor lock-in.

This article is about running LLMs, not fine-tuning, and definitely not training. It’s also only about text, and not vision, voice, or other “multimodal” capabilities, which aren’t nearly so useful to me personally.

To run a LLM on your own hardware you need software and a model.

The software

I’ve exclusively used the astounding llama.cpp. Other options exist, but for basic CPU inference — that is, generating tokens using a CPU rather than a GPU — llama.cpp requires nothing beyond a C++ toolchain. In particular, no Python fiddling that plagues much of the ecosystem. On Windows it will be a 5MB llama-server.exe with no runtime dependencies. From just two files, EXE and GGUF (model), both designed to load via memory map, you could likely still run the same LLM 25 years from now, in exactly the same way, out-of-the-box on some future Windows OS.

Full disclosure: I’m biased because the official Windows build process is w64devkit. What can I say? These folks have good taste! That being said, you should only do CPU inference if GPU inference is impractical. It works reasonably up to ~10B parameter models on a desktop or laptop, but it’s slower. My primary use case is not built with w64devkit because I’m using CUDA for inference, which requires a MSVC toolchain. Just for fun, I ported llama.cpp to Windows XP and ran a 360M model on a 2008-era laptop. It was magical to load that old laptop with technology that, at the time it was new, would have been worth billions of dollars.

The bottleneck for GPU inference is video RAM, or VRAM. These models are, well, large. The more RAM you have, the larger the model and the longer the context window. Larger models are smarter, and longer contexts let you process more information at once. GPU inference is not worth it below 8GB of VRAM. If “GPU poor”, stick with CPU inference. On the plus side, it’s simpler and easier to get started with CPU inference.

There are many utilities in llama.cpp, but this article is concerned with just one: llama-server is the program you want to run. It’s an HTTP server (default port 8080) with a chat UI at its root, and APIs for use by programs, including other user interfaces. A typical invocation:

$ llama-server --flash-attn --ctx-size 0 --model MODEL.gguf

The context size is the largest number of tokens the LLM can handle at once, input plus output. Contexts typically range from 8K to 128K tokens, and depending on the model’s tokenizer, normal English text is ~1.6 tokens per word as counted by wc -w. If the model supports a large context you may run out of memory. If so, set a smaller context size, like --ctx-size $((1<<13)) (i.e. 8K tokens).

I do not yet understand what flash attention is about, and I don’t know why --flash-attn/-fa is not the default (lower accuracy?), but you should always request it because it reduces memory requirements when active and is well worth the cost.

If the server started successfully, visit it (http://localhost:8080/) to try it out. Though of course you’ll need a model first.

The models

Hugging Face (HF) is “the GitHub of LLMs.” It’s an incredible service that has earned that title. “Small” models are around a few GBs, large models are hundreds of GBs, and HF hosts it all for free. With a few exceptions that do not matter in practice, you don’t even need to sign up to download models! (I’ve been so impressed that after a few days they got a penny-pincher like me to pay for pro account.) That means you can immediately download and try any of the stuff I’m about to discuss.

If you look now, you’ll wonder, “There’s a lot of stuff here, so what the heck am I supposed to download?” That was me one month ago. For llama.cpp, the answer is GGUF. None of the models are natively in GGUF. Instead GGUFs are in a repository with “GGUF” in the name, usually by a third party: one of the heroic, prolific GGUF quantizers.

(Note how nowhere does the official documentation define what “GGUF” stands for. Get used that. This is a technological frontier, and if the information exists at all, it’s not in the obvious place. If you’re considering asking your LLM about this once it’s running: Sweet summer child, we’ll soon talk about why that doesn’t work. As far as I can tell, “GGUF” has no authoritative definition (update: the U stands for “Unified”, but the rest is still ambiguous).)

Since llama.cpp is named after the Meta’s flagship model, their model is a reasonable start, though it’s not my personal favorite. The latest is Llama 3.2, but at the moment only the 1B and 3B models — that is, ~1 billion and ~3 billion parameters — work in Llama.cpp. Those are a little too small to be of much use, and your computer can likely to better if it’s not a Raspberry Pi, even with CPU inference. Llama 3.1 8B is a better option. (If you’ve got at least 24GB of VRAM then maybe you can even do Llama 3.1 70B.)

If you search for Llama 3.1 8B you’ll find two options, one qualified “instruct” and one with no qualifier. Instruct means it was trained to follow instructions, i.e. to chat, and that’s nearly always what you want. The other is the “base” model which can only continue a text. (Technically the instruct model is still just completion, but we’ll get to that later.) It would be great if base models were qualified “Base” but, for dumb path dependency reasons, they’re usually not.

You will not find GGUF in the “Files” for the instruct model, nor can you download the model without signing up in order to agree to the community license. Go back to the search, add GGUF, and look for the matching GGUF model: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF. bartowski is one of the prolific and well-regarded GGUF quantizers. Not only will this be in the right format for llama.cpp, you won’t need to sign up.

In “Files” you will now see many GGUFs. These are different quantizations of the same model. The original model has bfloat16 tensors, but for merely running the model we can throw away most of that precision with minimal damage. It will be a tiny bit dumber and less knowledgeable, but will require substantially fewer resources. The general recommendation, which fits my experience, is to use Q4_K_M, a 4-bit quantization. In general, better to run a 4-bit quant of a larger model than an 8-bit quant of a smaller model. Once you’ve got the basics understood, experiment with different quants and see what you like!

My favorite models

Models are trained for different trade-offs and differ in strengths and weaknesses, so no model is best at everything — especially on “GPU-poor” configurations. My desktop system has an RTX 3050 Ti with 8GB VRAM, and its limitations have shaped my choices. I can comfortably run ~10B models, and ~30B models just barely enough to test their capabilities. For ~70B I rely on third-party hosts. My “t/s” numbers are all on this system running 4-bit quants.

This list omits “instruct” from the model name, but assume the instruct model unless I say otherwise. A few are bona fide open source, at least as far as LLMs practically can be, and I’ve noted the license when that’s the case. The rest place restrictions on both use and distribution.

Mistral-Nemo-2407 (12B) [Apache 2.0]

A collaboration between Mistral AI and Nvidia (“Nemo”), the most well-rounded ~10B model I’ve used, and my default. Inference starts at a comfortable 30 t/s. It’s strengths are writing and proofreading, and it can review code nearly as well as ~70B models. It was trained for a context length of 128K, but its effective context length is closer to 16K — a limitation I’ve personally observed.

The “2407” is a date (July 2024) as version number, a versioning scheme I wholeheartedly support. A date tells you about its knowledge cut-off and tech level. It sorts well. Otherwise LLM versioning is a mess. Just as open source is bad with naming, AI companies do not comprehend versioning.
Qwen2.5-14B [Apache 2.0]

Qwen models, by Alibaba Cloud, impressively punch above their weight at all sizes. 14B inference starts at 11 t/s, with capabilities on par with Mistral Nemo. If I could run 72B on my own hardware, it would probably be my default. I’ve been trying it through Hugging Face’s inference API. There’s a 32B model, but it’s impractical for my hardware, so I haven’t spent much time with it.
Gemma-2-2B

Google’s model is popular, perhaps due to its playful demeanor. For me, the 2B model is great for fast translation. It’s amazing that LLMs have nearly obsoleted Google Translate, and you can run it on your home computer. Though it’s more resource-intensive, and refuses to translate texts it finds offensive, which sounds like a plot element from a sci-fi story. In my translation script, I send it text marked up with HTML. Simply asking Gemma to preserve the markup Just Works! The 9B model is even better, but slower, and I’d use it instead of 2B for translating my own messages into another language.
Phi3.5-Mini (4B) [MIT]

Microsoft’s niche is training on synthetic data. The result is a model that does well in tests, but doesn’t work so well in practice. For me, its strength is document evaluation. I’ve loaded the context with up to 40K-token documents — it helps that it’s a 4B model — and successfully queried accurate summaries and data listings.
SmolLM2-360M [Apache 2.0]

Hugging Face doesn’t just host models; their 360M model is unusually good for its size. It fits on my 2008-era, 1G RAM, Celeron, and 32-bit operating system laptop. It also runs well on older Raspberry Pis. It’s creative, fast, converses competently, can write poetry, and a fun toy in cramped spaces.
Mixtral-8x7B (48B) [Apache 2.0]

Another Mistral AI model, and more of a runner up. 48B seems too large, but this is a Mixture of Experts (MoE) model. Inference uses only 13B parameters at a time. It’s reasonably-suited to CPU inference on a machine with at least 32G of RAM. The model retains more of its training inputs, more like a database, but for reasons we’ll see soon, it isn’t as useful as it might seem.
Llama-3.1-70B and Llama-3.1-Nemotron-70B

More models I cannot run myself, but which I access remotely. The latter bears “Nemo” because it’s an Nvidia fine-tune. If I could run 70B models myself, Nemotron might just be my default. I’d need to spent more time evaluating it against Qwen2.5-72B.

Most of these models have abliterated or “uncensored” versions, in which refusal is partially fine-tuned out at a cost of model degradation. Refusals are annoying — such as Gemma refusing to translate texts it dislikes — but doesn’t happen enough for me to make that trade-off. Maybe I’m just boring. Also refusals seem to decrease with larger contexts, as though “in for a penny, in for a pound.”

The next group are “coder” models trained for programming. In particular, they have fill-in-the-middle (FIM) training for generating code inside an existing program. I’ll discuss what that entails in a moment. As far as I can tell, they’re no better at code review nor other instruct-oriented tasks. It’s the opposite: FIM training is done in the base model, with instruct training applied later on top, so instruct works against FIM! In other words, base model FIM outputs are markedly better, though you lose the ability to converse with them.

There will be a section on evaluation later, but I want to note now that LLMs produce mediocre code, even at the state-of-the-art. The rankings here are relative to other models, not about overall capability.

DeepSeek-Coder-V2-Lite (16B)

A self-titled MoE model from DeepSeek. It uses 2B parameters during inference, making it as fast as Gemma 2 2B but as smart as Mistral Nemo, striking a great balance, especially because it out-competes ~30B models at code generation. If I’m playing around with FIM, this is my default choice.
Qwen2.5-Coder-7B [Apache 2.0]

Qwen Coder is a close second. Output is nearly as good, but slightly slower since it’s not MoE. It’s a better choice than DeepSeek if you’re memory-constrained. While writing this article, Alibaba Cloud released a new Qwen2.5-Coder-7B but failed to increment the version number, which is horribly confusing. The community has taken to calling it Qwen2.5.1. Remember what I said about AI companies and versions? (Update: One day publication, 14B and 32B coder models were released. I tried both, and neither are quite as good as DeepSeek-Coder-V2-Lite, so my rankings are unchanged.)
Granite-8B-Code [Apache 2.0]

IBM’s line of models is named Granite. In general Granite models are disappointing, except that they’re unusually good at FIM. It’s tied in second place with Qwen2.5 7B in my experience.

I also evaluated CodeLlama, CodeGemma, Codestral, and StarCoder. Their FIM outputs were so poor as to be effectively worthless at that task, and I found no reason to use these models. The negative effects of instruct training were most pronounced for CodeLlama.

The user interfaces

I pointed out Llama.cpp’s built-in UI, and I’d used similar UIs with other LLM software. As is typical, no UI is to my liking, especially in matters of productivity, so I built my own, Illume. This command line program converts standard input into an API query, makes the query, and streams the response to standard output. Should be simple enough to integrate into any extensible text editor, but I only needed it for Vim. Vimscript is miserable, probably the second worst programming language I’ve ever touched, so my goal was to write as little as possible.

I created Illume to scratch my own itch, to support my exploration of the LLM ecosystem. I actively break things and add features as needed, and I make no promises about interface stability. You probably don’t want to use it.

Lines that begin with ! are directives interpreted by Illume, chosen because it’s unlikely to appear in normal text. A conversation alternates between !user and !assistant in a buffer.

!user
Write a Haiku about time travelers disguised as frogs.

!assistant
Green, leaping through time,
Frog tongues lick the future's rim,
Disguised in pond's guise.

It’s still a text editor buffer, so I can edit the assistant response, reword my original request, etc. before continuing the conversation. For composing fiction, I can request it to continue some text (which does not require instruct training):

!completion
Din the Wizard stalked the dim castle

I can stop it, make changes, add my own writing, and keep going. I ought to spend more time practicing with it. If you introduce out-of-story note syntax, the LLM will pick up on it, and then you can use notes to guide the LLM’s writing.

While the main target is llama.cpp, I query different APIs, implemented by different LLM software, with incompatibilities across APIs (a parameter required by one API is forbidden by another), so directives must be flexible and powerful. So directives can set arbitrary HTTP and JSON parameters. Illume doesn’t try to abstract the API, but exposes it at a low level, so effective use requires knowing the remote API. For example, the “profile” for talking to llama.cpp looks like this:

!api http://localhost:8080/v1
!:cache_prompt true

Where cache_prompt is a llama.cpp-specific JSON parameter (!:). Prompt cache nearly always better enabled, yet for some reason it’s disabled by default. Other APIs refuse requests with this parameter, so then I must omit or otherwise disable it. The Hugging Face “profile” looks like this:

!api https://api-inference.huggingface.co/models/{model}/v1
!:model Qwen/Qwen2.5-72B-Instruct
!>x-use-cache false

For the sake of HF, Illume can interpolate JSON parameters into the URL. The HF API caches also aggressively caches. I never want this, so I supply an HTTP parameter (!>) to turn it off.

Unique to llama.cpp is an /infill endpoint for FIM. It requires a model with extra metadata, trained a certain way, but this is usually not the case. So while Illume can use /infill, I also added FIM configuration so, after reading the model’s documentation and configuring Illume for that model’s FIM behavior, I can do FIM completion through the normal completion API on any FIM-trained model, even on non-llama.cpp APIs.

Fill-in-the-Middle (FIM) tokens

It’s time to discuss FIM. To get to the bottom of FIM I needed to go to the source of truth, the original FIM paper: Efficient Training of Language Models to Fill in the Middle. This allowed me to understand how these models are FIM-trained, at least enough to put that training to use. Even so, model documentation tends to be thin on FIM because they expect you to run their code.

Ultimately an LLM can only predict the next token. So pick some special tokens that don’t appear in inputs, use them to delimit a prefix and suffix, and middle (PSM) — or sometimes ordered suffix-prefix-middle (SPM) — in a large training corpus. Later in inference we can use those tokens to provide a prefix, suffix, and let it “predict” the middle. Crazy, but this actually works!

{prefix}{suffix}

For example when filling the parentheses of dist = sqrt(x*x + y*y):

dist = sqrt()x*x + y*y

To have the LLM fill in the parentheses, we’d stop at and let the LLM predict from there. Note how is essentially the cursor. By the way, this is basically how instruct training works, but instead of prefix and suffix, special tokens delimit instructions and conversation.

Some LLM folks interpret the paper quite literally and use

, etc.
for their FIM tokens, although these look nothing like their other special
tokens. More thoughtful trainers picked <|fim_prefix|>, etc. Illume
accepts FIM templates, and I wrote templates for the popular models. For
example, here’s Qwen (PSM):

<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>


Mistral AI prefers square brackets, SPM, and no “middle” token:

[SUFFIX]{suffix}[PREFIX]{prefix}


With these templates I could access the FIM training in models unsupported
by llama.cpp’s /infill API.

Besides just failing the prompt, the biggest problem I’ve had with FIM is
LLMs not know when to stop. For example, if I ask it to fill out this
function (i.e. assign something r):

def norm(x: float, y: float) -> float):
    return r


(Side note: Static types, including the hints here, produce better results
from LLMs, acting as guardrails.) It’s not unusual to get something like:

def norm(x: float, y: float) -> float):
    r = sqrt(x*x + y*y)
    return r

def norm3(x: float, y: float, z: float) -> float):
    r = sqrt(x*x + y*y + z*z)
    return r

def norm4(x: float, y: float, z: float, w: float) -> float):
    r = sqrt(x*x + y*y + z*z + w*w)
    return r


Where the original return r became the return for norm4. Technically
it fits the prompt, but it’s obviously not what I want. So be ready to
mash the “stop” button when it gets out of control. The three coder models
I recommended exhibit this behavior less often. It might be more robust to
combine it with a non-LLM system that understands the code semantically
and automatically stops generation when the LLM begins generating tokens
in a higher scope. That would make more coder models viable, but this goes
beyond my own fiddling.

Figuring out FIM and putting it into action revealed to me that FIM is
still in its early stages, and hardly anyone is generating code via FIM. I
guess everyone’s just using plain old completion?

So what are LLMs good for?

LLMs are fun, but what the productive uses do they have? That’s a question
I’ve been trying to answer this past month, and it’s come up shorter than
I hoped. It might be useful to establish boundaries — tasks that LLMs
definitely cannot do.

First, LLMs are no good if correctness cannot be readily verified.
They are untrustworthy hallucinators. Often if you’re in position to
verify LLM output, you didn’t need it in the first place. This is why
Mixtral, with its large “database” of knowledge, isn’t so useful. It also
means it’s reckless and irresponsible to inject LLM output into search
results — just shameful.

LLM enthusiasts, who ought to know better, fall into this trap anyway and
propagate hallucinations. It makes discourse around LLMs less trustworthy
than normal, and I need to approach LLM information with extra skepticism.
Case in point: Recall how “GGUF” doesn’t have an authoritative definition.
Search for one and you’ll find an obvious hallucination that made it all
the way into official IBM documentation. I won’t repeat it hear as to not
make things worse.

Second, LLMs have goldfish-sized working memory. That is, they’re held
back by small context lengths. Some models are trained on larger contexts,
but their effective context length is usually much smaller. In
practice, an LLM can hold several book chapters worth of comprehension “in
its head” at a time. For code it’s 2k or 3k lines (code is token-dense).
That’s the most you can work with at once. Compared to a human, it’s tiny.
There are tools like retrieval-augmented generation and fine-tuning
to mitigate it… slightly.

Third, LLMs are poor programmers. At best they write code at maybe an
undergraduate student level who’s read a lot of documentation. That sounds
better than it is. The typical fresh graduate enters the workforce knowing
practically nothing about software engineering. Day one on the job is the
first day of their real education. In that sense, LLMs today
haven’t even begun their education.

To be fair, that LLMs work as well as they do is amazing! Thrown into the
middle of a program in my unconvential style, LLMs figure it out
and make use of the custom interfaces. (Caveat: My code and writing is in
the training data of most of these LLMs.) So the more context, the better,
within the effective context length. The challenge is getting something
useful out of an LLM in less time than writing it myself.

Writing new code is the easy part. The hard part is maintaining code,
and writing new code with that maintenance in mind. Even when an LLM
produces code that works, there’s no thought to maintenance, nor could
there be. In general the reliability of generate code follows the inverse
square law by length, and generating more than a dozen lines at a time is
fraught. I really tried, but never saw LLM output beyond 2–3 lines of code
which I would consider acceptable.

Quality varies substantially by language. LLMs are better at Python than
C, and better at C than assembly. I suspect it’s related to the difficulty
of the language and the quality of the input. It’s trained on lots of
terrible C — the internet is loaded with it after all — and probably the
only labeled x86 assembly it’s seen is crummy beginner tutorials. Ask it
to use SDL2 and it reliably produces the common mistakes because
it’s been trained to do so.

What about boilerplate? That’s something an LLM could probably do with a
low error rate, and perhaps there’s merit to it. Though the fastest way to
deal with boilerplate is to not write it at all. Change your problem to
not require boilerplate.

Without taking my word for it, consider how it show up in the economics:
If AI companies could deliver the productivity gains they claim, they
wouldn’t sell AI. They’d keep it to themselves and gobble up the software
industry. Or consider the software products produced by companies on the
bleeding edge of AI. It’s still the same old, bloated web garbage everyone
else is building. (My LLM research has involved navigating their awful web
sites, and it’s made be bitter.)

In code generation, hallucinations are less concerning. You already knew
what you wanted when you asked, so you can review it, and your compiler
will help catch problems you miss (e.g. calling a hallucinated method).
However, small context and poor code generation remain roadblocks, and I
haven’t yet made this work effectively.

So then, what can I do with LLMs? A list is apt because LLMs love lists:


  
    Proofreading has been most useful for me. I give it a document such as
an email or this article (~8,000 tokens), tell it to look over grammar,
call out passive voice, and so on, and suggest changes. I accept or
reject its suggestions and move on. Most suggestions will be poor, and
this very article was long enough that even ~70B models suggested
changes to hallucinated sentences. Regardless, there’s signal in the
noise, and it fits within the limitations outlined above. I’m still
trying to apply this technique (“find bugs, please”) to code review, but
so far success is elusive.
  
  
    Writing short fiction. Hallucinations are not a problem; they’re a
feature! Context lengths are the limiting factor, though perhaps you can
stretch it by supplying chapter summaries, also written by LLM. I’m
still exploring this. If you’re feeling lazy, tell it to offer you three
possible story branches at each turn, and you pick the most interesting.
Or even tell it to combine two of them! LLMs are clever and will figure
it out. Some genres work better than others, and concrete works better
than abstract. (I wonder if professional writers judge its writing as
poor as I judge its programming.)
  
  
    Generative fun. Have an argument with Benjamin Franklin (note: this
probably violates the Acceptable Use Policy of some models), hang
out with a character from your favorite book, or generate a new scene of
Falstaff’s blustering antics. Talking to historical figures
has been educational: The character says something unexpected, I look it
up the old-fashioned way to see what it’s about, then learn something
new.
  
  
    Language translation. I’ve been browsing foreign language subreddits
through Gemma-2-2B translation, and it’s been insightful. (I had no idea
German speakers were so distrustful of artificial sweeteners.)
  


Despite the short list of useful applications, this is the most excited
I’ve been about a new technology in years!

I solved the Dandelions paper-and-pencil game

2022-10-12T03:02:27Z

I’ve been reading Math Games with Bad Drawings, a great book well-aligned to my interests. It’s given me a lot of new, interesting programming puzzles to consider. The first to truly nerd snipe me was Dandelions (full rules), an asymmetric paper-and-pencil game invented by the book’s author, Ben Orlin. Just as with British Square two years ago — and essentially following the same technique — I wrote a program that explores the game tree sufficiently to play either side perfectly, “solving” the game in its standard 5-by-5 configuration.

The source: dandelions.c

The game is played on a 5-by-5 grid where one player plays the dandelions, the other plays the wind. Players alternate, dandelions placing flowers and wind blowing in one of the eight directions, spreading seeds from all flowers along the direction of the wind. Each side gets seven moves, and the wind cannot blow in the same direction twice. The dandelions’ goal is to fill the grid with seeds, and the wind’s goal is to prevent this.

Try playing a few rounds with a friend, and you will probably find that dandelions is difficult, at least in your first games, as though it cannot be won. However, my engine proves the opposite: The dandelions always win with perfect play. In fact, it’s so lopsided that the dandelions’ first move is irrelevant. Every first move is winnable. If the dandelions blunder, typically wind has one narrow chance to seize control, after which wind probably wins with any (or almost any) move.

For reasons I’ll discuss later, I only solved the 5-by-5 game, and the situation may be different for the 6-by-6 variant. Also, unlike British Square, my engine does not exhaustively explore the entire game tree because it’s far too large. Instead it does a minimax search to the bottom of the tree and stops when it finds a branch where all leaves are wins for the current player. Because of this, it cannot maximize the outcome — winning as early as possible as dandelions or maximizing the number of empty grid spaces as wind. I also can’t quantify the exact size of tree.

Like with British Square, my game engine only has a crude user interface for interactively exploring the game tree. While you can “play” it in a sense, it’s not intended to be played. It also takes a few seconds to initially explore the game tree, so wait for the >> prompt.

Bitboard seeding

I used bitboards of course: a 25-bit bitboard for flowers, a 25-bit bitboard for seeds, and an 8-bit set to track which directions the wind has blown. It’s especially well-suited for this game since seeds can be spread in parallel using bitwise operations. Shift the flower bitboard in the direction of the wind four times, ORing it into the seeds bitboard on each shift:

int wind;
uint32_t seeds, flowers;

flowers >>= wind;  seeds |= flowers;
flowers >>= wind;  seeds |= flowers;
flowers >>= wind;  seeds |= flowers;
flowers >>= wind;  seeds |= flowers;

Of course it’s a little more complicated than this. The flowers must be masked to keep them from wrapping around the grid, and wind may require shifting in the other direction. In order to “negative shift” I actually use a rotation (notated with >>> below). Consider, to rotate an N-bit integer left by R, one can right-rotate it by N-R — ex. on a 32-bit integer, a left-rotate by 1 is the same as a right-rotate by 31. So for a negative wind that goes in the other direction:

flowers >>> (wind & 31);

With such a “programmable shift” I can implement the bulk of the game rules using a couple of tables and no branches:

// clockwise, east is zero
static int8_t rot[] = {-1, -6, -5, -4, +1, +6, +5, +4};
static uint32_t mask[] = {
    0x0f7bdef, 0x007bdef, 0x00fffff, 0x00f7bde,
    0x1ef7bde, 0x1ef7bc0, 0x1ffffe0, 0x0f7bde0
};
f &= mask[dir];  f >>>= rot[i] & 31;  s |= f;
f &= mask[dir];  f >>>= rot[i] & 31;  s |= f;
f &= mask[dir];  f >>>= rot[i] & 31;  s |= f;
f &= mask[dir];  f >>>= rot[i] & 31;  s |= f;

The masks clear out the column/row about to be shifted “out” so that it doesn’t wrap around. Viewed in base-2, they’re 5-bit patterns repeated 5 times.

Bitboard packing and canonicalization

The entire game state is two 25-bit bitboards and an 8-bit set. That’s 58 bits, which fits in a 64-bit integer with bits to spare. How incredibly convenient! So I represent the game state using a 64-bit integer, using a packing like I did with British Square. The bottom 25 bits are the seeds, the next 25 bits are the flowers, and the next 8 is the wind set.

000000 WWWWWWWW FFFFFFFFFFFFFFFFFFFFFFFFF SSSSSSSSSSSSSSSSSSSSSSSSS

Even more convenient, I could reuse my bitboard canonicalization code from British Square, also a 5-by-5 grid packed in the same way, saving me the trouble of working out all the bit sieves. I only had to figure out how to transpose and flip the wind bitset. Turns out that’s pretty easy, too. Here’s how I represent the 8 wind directions:

567
4 0
321

Flipping this vertically I get:

321
4 0
567

Unroll these to show how old maps onto new:

old: 01234567
new: 07654321

The new is just the old rotated and reversed. Transposition is the same story, just a different rotation. I use a small lookup table to reverse the bits, and then an 8-bit rotation. (See revrot.)

To determine how many moves have been made, popcount the flower bitboard and wind bitset.

int moves = POPCOUNT64(g & 0x3fffffffe000000);

To test if dandelions have won:

int win = (g&0x1ffffff) == 0x1ffffff;

Since the plan is to store all the game states in a big hash table — an MSI double hash in this case — I’d like to reserve the zero value as a “null” board state. This lets me zero-initialize the hash table. To do this, I invert the wind bitset such that a 1 indicates the direction is still available. So the initial game state looks like this (in the real program this is accounted for in the previously-discussed turn popcount):

#define GAME_INIT ((uint64_t)255 << 50)

The remaining 6 bits can be used to cache information about the rest of tree under this game state, namely who wins from this position, and this serves as the “value” in the hash table. Turns out the bitboards are already noisy enough that a single xorshift makes for a great hash function. The hash table, including hash function, is under a dozen lines of code.

// Find the hash table slot for the given game state.
uint64_t *lookup(uint64_t *ht, uint64_t g)
{
    uint64_t hash = g ^ g>>32;
    size_t mask = (1L << HASHTAB_EXP) - 1;
    size_t step = hash>>(64 - HASHTAB_EXP) | 1;
    for (size_t i = hash;;) {
        i = (i + step)&mask;
        if (!ht[i] || ht[i]&0x3ffffffffffffff == g) {
            return ht + i;
        }
    }
}

To explore a 6-by-6 grid I’d need to change my representation, which is part of why I didn’t do it. I can’t fit two 36-bit bitboards in a 64-bit integer, so I’d need to double my storage requirements, which are already strained.

Computational limitations

Due to the way seeds spread, game states resulting from different moves rarely converge back to a common state later in the tree, so the hash table isn’t doing much deduplication. Exhaustively exploring the entire game tree, even cutting it down to an 8th using canonicalization, requires substantial computing resources, more than I personally have available for this project. So I had to stop at the slightly weaker form, find a winning branch rather than maximizing a “score.”

I configure the program to allocate 2GiB for the hash table, but if you run just a few dozen games off the same table (same program instance), each exploring different parts of the game tree, you’ll exhaust this table. A 6-by-6 doubles the memory requirements just to represent the game, but it also slows the search and substantially increases the width of the tree, which grows 44% faster. I’m sure it can be done, but it’s just beyond the resources available to me.

Dandelion Puzzles

As a side effect, I wrote a small routine to randomly play out games in search for “mate-in-two”-style puzzles. The dandelions have two flowers to place and can force a win with two specific placements — and only those two placements — regardless of how the wind blows. Here are two of the better ones, each involving a small trick that I won’t give away here (note: arrowheads indicate directions wind can still blow):

There are a variety of potential single-player puzzles of this form.

Cooperative: place a dandelion and pick the wind direction
Avoidance: don’t seed a particular tile
Hard ground: certain tiles can’t grow flowers (but still get seeded)
Weeding: as wind, figure out which flower to remove before blowing

There could be a whole “crossword book” of such dandelion puzzles.

You might not need machine learning

2020-11-24T04:04:36Z

This article was discussed on Hacker News.

Machine learning is a trendy topic, so naturally it’s often used for inappropriate purposes where a simpler, more efficient, and more reliable solution suffices. The other day I saw an illustrative and fun example of this: Neural Network Cars and Genetic Algorithms. The video demonstrates 2D cars driven by a neural network with weights determined by a generic algorithm. However, the entire scheme can be replaced by a first-degree polynomial without any loss in capability. The machine learning part is overkill.

Above demonstrates my implementation using a polynomial to drive the cars. My wife drew the background. There’s no path-finding; these cars are just feeling their way along the track, “following the rails” so to speak.

My intention is not to pick on this project in particular. The likely motivation in the first place was a desire to apply a neural network to something. Many of my own projects are little more than a vehicle to try something new, so I can sympathize. Though a professional setting is different, where machine learning should be viewed with a more skeptical eye than it’s usually given. For instance, don’t use active learning to select sample distribution when a quasirandom sequence will do.

In the video, the car has a limited turn radius, and minimum and maximum speeds. (I’ve retained these contraints in my own simulation.) There are five sensors — forward, forward-diagonals, and sides — each sensing the distance to the nearest wall. These are fed into a 3-layer neural network, and the outputs determine throttle and steering. Sounds pretty cool!

A key feature of neural networks is that the outputs are a nonlinear function of the inputs. However, steering a 2D car is simple enough that a linear function is more than sufficient, and neural networks are unnecessary. Here are my equations:

steering = C0*input1 - C0*input3
throttle = C1*input2

I only need three of the original inputs — forward for throttle, and diagonals for steering — and the driver has just two parameters, C0 and C1, the polynomial coefficients. Optimal values depend on the track layout and car configuration, but for my simulation, most values above 0 and below 1 are good enough in most cases. It’s less a matter of crashing and more about navigating the course quickly.

The lengths of the red lines below are the driver’s three inputs:

These polynomials are obviously much faster than a neural network, but they’re also easy to understand and debug. I can confidently reason about the entire range of possible inputs rather than worry about a trained neural network responding strangely to untested inputs.

Instead of doing anything fancy, my program generates the coefficients at random to explore the space. If I wanted to generate a good driver for a course, I’d run a few thousand of these and pick the coefficients that complete the course in the shortest time. For instance, these coefficients make for a fast, capable driver for the course featured at the top of the article:

C0 = 0.896336973, C1 = 0.0354805067

Many constants can complete the track, but some will be faster than others. If I was developing a racing game using this as the AI, I’d not just pick constants that successfully complete the track, but the ones that do it quickly. Here’s what the spread can look like:

If you want to play around with this yourself, here’s my C source code that implements this driving AI and generates the videos and images above:

aidrivers.c

Racetracks are just images drawn in your favorite image editing program using the colors documented in the source header.

I Solved British Square

2020-10-19T19:32:52Z

Update: I solved another game using essentially the same technique.

British Square is a 1978 abstract strategy board game which I recently discovered from a YouTube video. It’s well-suited to play by pencil-and-paper, so my wife and I played a few rounds to try it out. Curious about strategies, I searched online for analysis and found nothing whatsoever, meaning I’d have to discover strategies for myself. This is exactly the sort of problem that nerd snipes, and so I sunk a couple of evenings building an analysis engine in C — enough to fully solve the game and play perfectly.

Repository: British Square Analysis Engine (and prebuilt binaries)

The game is played on a 5-by-5 grid with two players taking turns placing pieces of their color. Pieces may not be placed on tiles 4-adjacent to an opposing piece, and as a special rule, the first player may not play the center tile on the first turn. Players pass when they have no legal moves, and the game ends when both players pass. The score is the difference between the piece counts for each player.

In the default configuration, my engine takes a few seconds to explore the full game tree, then presents the minimax values for the current game state along with the list of perfect moves. The UI allows manually exploring down the game tree. It’s intended for analysis, but there’s enough UI present to “play” against the AI should you so wish. For some of my analysis I made small modifications to the program to print or count game states matching certain conditions.

Game analysis

Not accounting for symmetries, there are 4,233,789,642,926,592 possible playouts. In these playouts, the first player wins 2,179,847,574,830,592 (~51%), the second player wins 1,174,071,341,606,400 (~28%), and the remaining 879,870,726,489,600 (~21%) are ties. It’s immediately obvious the first player has a huge advantage.

Accounting for symmetries, there are 8,659,987 total game states. Of these, 6,955 are terminal states, of which the first player wins 3,599 (~52%) and the second player wins 2,506 (~36%). This small number of states is what allows the engine to fully explore the game tree in a few seconds.

Most importantly: The first player can always win by two points. In other words, it’s not like Tic-Tac-Toe where perfect play by both players results in a tie. Due to the two-point margin, the first player also has more room for mistakes and usually wins even without perfect play. There are fewer opportunities to blunder, and a single blunder usually results in a lower win score. The second player has a narrow lane of perfect play, making it easy to blunder.

Below is the minimax analysis for the first player’s options. The number is the first player’s score given perfect play from that point — i.e. perfect play starts on the tiles marked “2”, and the tiles marked “0” are blunders that lead to ties.

The special center rule probably exists to reduce the first player’s obvious advantage, but in practice it makes little difference. Without the rule, the first player has an additional (fifth) branch for a win by two points:

Improved alternative special rule: Bias the score by two in favor of the second player. This fully eliminates the first player’s advantage, perfect play by both sides results in a tie, and both players have a narrow lane of perfect play.

The four tie openers are interesting because the reasoning does not require computer assistance. If the first player opens on any of those tiles, the second player can mirror each of the first player’s moves, guaranteeing a tie. Note: The first player can still make mistakes that results in a second player win if the second player knows when to stop mirroring.

One of my goals was to develop a heuristic so that even human players can play perfectly from memory, as in Tic-Tac-Toe. Unfortunately I was not able to develop any such heuristic, though I was able to prove that a greedy heuristic — always claim as much territory as possible — is often incorrect and, in some cases, leads to blunders.

Engine implementation

As I’ve done before, my engine represents the game using bitboards. Each player has a 25-bit bitboard representing their pieces. To make move validation more efficient, it also sometimes tracks a “mask” bitboard where invalid moves have been masked. Updating all bitboards is cheap (place(), mask()), as is validating moves against the mask (valid()).

The longest possible game is 32 moves. This would just fit in 5 bits, except that I needed a special “invalid” turn, making it a total of 33 bits. So I use 6 bits to store the turn counter.

Besides generally being unnecessary, the validation masks can be derived from the main bitboards, so I don’t need to store them in the game tree. That means I need 25 bits per player, and 6 bits for the counter: 56 bits total. I pack these into a 64-bit integer. The first player’s bitboard goes in the bottom 25 bits, the second player in the next 25 bits, and the turn counter in the topmost 6 bits. The turn counter starts at 1, so an all zero state is invalid. I exploit this in the hash table so that zeroed slots are empty (more on this later).

In other words, the empty state is 0x4000000000000 (INIT) and zero is the null (invalid) state.

Since the state is so small, rather than passing a pointer to a state to be acted upon, bitboard functions return a new bitboard with the requested changes… functional style.

    // Compute bitboard+mask where first play is tile 6
    // -----
    // -X---
    // -----
    // -----
    // -----
    uint64_t b = INIT;
    uint64_t m = INIT;
    b = place(b, 6);
    m = mask(m, 6);

Minimax costs

The engine uses minimax to propagate information up the tree. Since the search extends to the very bottom of the tree, the minimax “heuristic” evaluation function is the actual score, not an approximation, which is why it’s able to play perfectly.

When I’ve used minimax before, I built an actual tree data structure in memory, linking states by pointer / reference. In this engine there is no such linkage, and instead the links are computed dynamically via the validation masks. Storing the pointers is more expensive than computing their equivalents on the fly, so I don’t store them. Therefore my game tree only requires 56 bits per node — or 64 bits in practice since I’m using a 64-bit integer. With only 8,659,987 nodes to store, that’s a mere 66MiB of memory! This analysis could have easily been done on commodity hardware two decades ago.

What about the minimax values? Game scores range from -10 to 11: 22 distinct values. (That the first player can score up to 11 and the second player at most 10 is another advantage to going first.) That’s 5 bits of information. However, I didn’t have this information up front, and so I assumed a range from -25 to 25, which requires 6 bits.

There are still 8 spare bits left in the 64-bit integer, so I use 6 of them for the minimax score. Rather than worry about two’s complement, I bias the score to eliminate negative values before storing it. So the minimax score rides along for free above the state bits.

Hash table (memoization)

The vast majority of game tree branches are redundant. Even without taking symmetries into account, nearly all states are reachable from multiple branches. Exploring all these redundant branches would take centuries. If I run into a state I’ve seen before, I don’t want to recompute it.

Once I’ve computed a result, I store it in a hash table so that I can find it later. Since the state is just a 64-bit integer, I use an integer hash function to compute a starting index from which to linearly probe an open addressing hash table. The entire hash table implementation is literally a dozen lines of code:

uint64_t *
lookup(uint64_t bitboard)
{
    static uint64_t table[N];
    uint64_t mask = 0xffffffffffffff; // sans minimax
    uint64_t hash = bitboard;
    hash *= 0xcca1cee435c5048f;
    hash ^= hash >> 32;
    for (size_t i = hash % N; ; i = (i + 1) % N) {
        if (!table[i] || table[i]&mask == bitboard) {
            return &table[i];
        }
    }
}

If the bitboard is not found, it returns a pointer to the (zero-valued) slot where it should go so that the caller can fill it in.

Canonicalization

Memoization eliminates nearly all redundancy, but there’s still a major optimization left. Many states are equivalent by symmetry or reflection. Taking that into account, about 7/8th of the remaining work can still be eliminated.

Multiple different states that are identical by symmetry must to be somehow “folded” into a single, canonical state to represent them all. I do this by visiting all 8 rotations and reflections and choosing the one with the smallest 64-bit integer representation.

I only need two operations to visit all 8 symmetries, and I chose transpose (flip around the diagonal) and vertical flip. Alternating between these operations visits each symmetry. Since they’re bitboards, transforms can be implemented using fancy bit-twiddling hacks. Chess boards, with their power-of-two dimensions, have useful properties which these British Square boards lack, so this is the best I could come up with:

// Transpose a board or mask (flip along the diagonal).
uint64_t
transpose(uint64_t b)
{
    return ((b >> 16) & 0x00000020000010) |
           ((b >> 12) & 0x00000410000208) |
           ((b >>  8) & 0x00008208004104) |
           ((b >>  4) & 0x00104104082082) |
           ((b >>  0) & 0xfe082083041041) |
           ((b <<  4) & 0x01041040820820) |
           ((b <<  8) & 0x00820800410400) |
           ((b << 12) & 0x00410000208000) |
           ((b << 16) & 0x00200000100000);
}

// Flip a board or mask vertically.
uint64_t
flipv(uint64_t b)
{
    return ((b >> 20) & 0x0000003e00001f) |
           ((b >> 10) & 0x000007c00003e0) |
           ((b >>  0) & 0xfc00f800007c00) |
           ((b << 10) & 0x001f00000f8000) |
           ((b << 20) & 0x03e00001f00000);
}

These transform both players’ bitboards in parallel while leaving the turn counter intact. The logic here is quite simple: Shift the bitboard a little bit at a time while using a mask to deposit bits in their new home once they’re lined up. It’s like a coin sorter. Vertical flip is analogous to byte-swapping, though with 5-bit “bytes”.

Canonicalizing a bitboard now looks like this:

uint64_t
canonicalize(uint64_t b)
{
    uint64_t c = b;
    b = transpose(b); c = c < b ? c : b;
    b = flipv(b);     c = c < b ? c : b;
    b = transpose(b); c = c < b ? c : b;
    b = flipv(b);     c = c < b ? c : b;
    b = transpose(b); c = c < b ? c : b;
    b = flipv(b);     c = c < b ? c : b;
    b = transpose(b); c = c < b ? c : b;
    return c;
}

Callers need only use canonicalize() on values they pass to lookup() or store in the table (via the returned pointer).

Developing a heuristic

If you can come up with a perfect play heuristic, especially one that can be reasonably performed by humans, I’d like to hear it. My engine has a built-in heuristic tester, so I can test it against perfect play at all possible game positions to check that it actually works. It’s currently programmed to test the greedy heuristic and print out the millions of cases where it fails. Even a heuristic that fails in only a small number of cases would be pretty reasonable.

When the Compiler Bites

2018-05-01T23:28:06Z

Update: There are discussions on Reddit and on Hacker News.

So far this year I’ve been bitten three times by compiler edge cases in GCC and Clang, each time catching me totally by surprise. Two were caused by historical artifacts, where an ambiguous specification lead to diverging implementations. The third was a compiler optimization being far more clever than I expected, behaving almost like an artificial intelligence.

In all examples I’ll be using GCC 7.3.0 and Clang 6.0.0 on Linux.

x86-64 ABI ambiguity

The first time I was bit — or, well, narrowly avoided being bit — was when I examined a missed floating point optimization in both Clang and GCC. Consider this function:

double
zero_multiply(double x)
{
    return x * 0.0;
}

The function multiplies its argument by zero and returns the result. Any number multiplied by zero is zero, so this should always return zero, right? Unfortunately, no. IEEE 754 floating point arithmetic supports NaN, infinities, and signed zeros. This function can return NaN, positive zero, or negative zero. (In some cases, the operation could also potentially produce a hardware exception.)

As a result, both GCC and Clang perform the multiply:

zero_multiply:
    xorpd  xmm1, xmm1
    mulsd  xmm0, xmm1
    ret

The -ffast-math option relaxes the C standard floating point rules, permitting an optimization at the cost of conformance and consistency:

zero_multiply:
    xorps  xmm0, xmm0
    ret

Side note: -ffast-math doesn’t necessarily mean “less precise.” Sometimes it will actually improve precision.

Here’s a modified version of the function that’s a little more interesting. I’ve changed the argument to a short:

double
zero_multiply_short(short x)
{
    return x * 0.0;
}

It’s no longer possible for the argument to be one of those special values. The short will be promoted to one of 65,535 possible double values, each of which results in 0.0 when multiplied by 0.0. GCC misses this optimization (-Os):

zero_multiply_short:
    movsx     edi, di       ; sign-extend 16-bit argument
    xorps     xmm1, xmm1    ; xmm1 = 0.0
    cvtsi2sd  xmm0, edi     ; convert int to double
    mulsd     xmm0, xmm1
    ret

Clang also misses this optimization:

zero_multiply_short:
    cvtsi2sd xmm1, edi
    xorpd    xmm0, xmm0
    mulsd    xmm0, xmm1
    ret

But hang on a minute. This is shorter by one instruction. What happened to the sign-extension (movsx)? Clang is treating that short argument as if it were a 32-bit value. Why do GCC and Clang differ? Is GCC doing something unnecessary?

It turns out that the x86-64 ABI didn’t specify what happens with the upper bits in argument registers. Are they garbage? Are they zeroed? GCC takes the conservative position of assuming the upper bits are arbitrary garbage. Clang takes the boldest position of assuming arguments smaller than 32 bits have been promoted to 32 bits by the caller. This is what the ABI specification should have said, but currently it does not.

Fortunately GCC also conservative when passing arguments. It promotes arguments to 32 bits as necessary, so there are no conflicts when linking against Clang-compiled code. However, this is not true for Intel’s ICC compiler: Clang and ICC are not ABI-compatible on x86-64.

I don’t use ICC, so that particular issue wouldn’t bite me, but if I was ever writing assembly routines that called Clang-compiled code, I’d eventually get bit by this.

Floating point precision

Without looking it up or trying it, what does this function return? Think carefully.

int
float_compare(void)
{
    float x = 1.3f;
    return x == 1.3f;
}

Confident in your answer? This is a trick question, because it can return either 0 or 1 depending on the compiler. Boy was I confused when this comparison returned 0 in my real world code.

$ gcc   -std=c99 -m32 cmp.c  # float_compare() == 0
$ clang -std=c99 -m32 cmp.c  # float_compare() == 1

So what’s going on here? The original ANSI C specification wasn’t clear about how intermediate floating point values get rounded, and implementations all did it differently. The C99 specification cleaned this all up and introduced FLT_EVAL_METHOD. Implementations can still differ, but at least you can now determine at compile-time what the compiler would do by inspecting that macro.

Back in the late 1980’s or early 1990’s when the GCC developers were deciding how GCC should implement floating point arithmetic, the trend at the time was to use as much precision as possible. On the x86 this meant using its support for 80-bit extended precision floating point arithmetic. Floating point operations are performed in long double precision and truncated afterward (FLT_EVAL_METHOD == 2).

In float_compare() the left-hand side is truncated to a float by the assignment, but the right-hand side, despite being a float literal, is actually “1.3” at 80 bits of precision as far as GCC is concerned. That’s pretty unintuitive!

The remnants of this high precision trend are still in JavaScript, where all arithmetic is double precision (even if simulated using integers), and great pains have been made to work around the performance consequences of this. Until recently, Mono had similar issues.

The trend reversed once SIMD hardware became widely available and there were huge performance gains to be had. Multiple values could be computed at once, side by side, at lower precision. So on x86-64, this became the default (FLT_EVAL_METHOD == 0). The young Clang compiler wasn’t around until well after this trend reversed, so it behaves differently than the backwards compatible GCC on the old x86.

I’m a little ashamed that I’m only finding out about this now. However, by the time I was competent enough to notice and understand this issue, I was already doing nearly all my programming on the x86-64.

Built-in Function Elimination

I’ve saved this one for last since it’s my favorite. Suppose we have this little function, new_image(), that allocates a greyscale image for, say, some multimedia library.

static unsigned char *
new_image(size_t w, size_t h, int shade)
{
    unsigned char *p = 0;
    if (w == 0 || h <= SIZE_MAX / w) { // overflow?
        p = malloc(w * h);
        if (p) {
            memset(p, shade, w * h);
        }
    }
    return p;
}

It’s a static function because this would be part of some slick header library (and, secretly, because it’s necessary for illustrating the issue). Being a responsible citizen, the function even checks for integer overflow before allocating anything.

I write a unit test to make sure it detects overflow. This function should return 0.

/* expected return == 0 */
int
test_new_image_overflow(void)
{
    void *p = new_image(2, SIZE_MAX, 0);
    return !!p;
}

So far my test passes. Good.

I’d also like to make sure it correctly returns NULL — or, more specifically, that it doesn’t crash — if the allocation fails. But how can I make malloc() fail? As a hack I can pass image dimensions that I know cannot ever practically be allocated. Essentially I want to force a malloc(SIZE_MAX), e.g. allocate every available byte in my virtual address space. For a conventional 64-bit machine, that’s 16 exibytes of memory, and it leaves space for nothing else, including the program itself.

/* expected return == 0 */
int
test_new_image_oom(void)
{
    void *p = new_image(1, SIZE_MAX, 0xff);
    return !!p;
}

I compile with GCC, test passes. I compile with Clang and the test fails. That is, the test somehow managed to allocate 16 exibytes of memory, and initialize it. Wat?

Disassembling the test reveals what’s going on:

test_new_image_overflow:
    xor  eax, eax
    ret

test_new_image_oom:
    mov  eax, 1
    ret

The first test is actually being evaluated at compile time by the compiler. The function being tested was inlined into the unit test itself. This permits the compiler to collapse the whole thing down to a single instruction. The path with malloc() became dead code and was trivially eliminated.

In the second test, Clang correctly determined that the image buffer is not actually being used, despite the memset(), so it eliminated the allocation altogether and then simulated a successful allocation despite it being absurdly large. Allocating memory is not an observable side effect as far as the language specification is concerned, so it’s allowed to do this. My thinking was wrong, and the compiler outsmarted me.

I soon realized I can take this further and trick Clang into performing an invalid optimization, revealing a bug. Consider this slightly-optimized version that uses calloc() when the shade is zero (black). The calloc() function does its own overflow check, so new_image() doesn’t need to do it.

static void *
new_image(size_t w, size_t h, int shade)
{
    unsigned char *p = 0;
    if (shade == 0) { // shortcut
        p = calloc(w, h);
    } else if (w == 0 || h <= SIZE_MAX / w) { // overflow?
        p = malloc(w * h);
        if (p) {
            memset(p, color, w * h);
        }
    }
    return p;
}

With this change, my overflow unit test is now also failing. The situation is even worse than before. The calloc() is being eliminated despite the overflow, and replaced with a simulated success. This time it’s actually a bug in Clang. While failing a unit test is mostly harmless, this could introduce a vulnerability in a real program. The OpenBSD folks are so worried about this sort of thing that they’ve disabled this optimization.

Here’s a slightly-contrived example of this. Imagine a program that maintains a table of unsigned integers, and we want to keep track of how many times the program has accessed each table entry. The “access counter” table is initialized to zero, but the table of values need not be initialized, since they’ll be written before first access (or something like that).

struct table {
    unsigned *counter;
    unsigned *values;
};

static int
table_init(struct table *t, size_t n)
{
    t->counter = calloc(n, sizeof(*t->counter));
    if (t->counter) {
        /* Overflow already tested above */
        t->values = malloc(n * sizeof(*t->values));
        if (!t->values) {
            free(t->counter);
            return 0; // fail
        }
        return 1; // success
    }
    return 0; // fail
}

This function relies on the overflow test in calloc() for the second malloc() allocation. However, this is a static function that’s likely to get inlined, as we saw before. If the program doesn’t actually make use of the counter table, and Clang is able to statically determine this fact, it may eliminate the calloc(). This would also eliminate the overflow test, introducing a vulnerability. If an attacker can control n, then they can overwrite arbitrary memory through that values pointer.

The takeaway

Besides this surprising little bug, the main lesson for me is that I should probably isolate unit tests from the code being tested. The easiest solution is to put them in separate translation units and don’t use link-time optimization (LTO). Allowing tested functions to be inlined into the unit tests is probably a bad idea.

The unit test issues in my real program, which was a bit more sophisticated than what was presented here, gave me artificial intelligence vibes. It’s that situation where a computer algorithm did something really clever and I felt it outsmarted me. It’s creepy to consider how far that can go. I’ve gotten that even from observing AI I’ve written myself, and I know for sure no human taught it some particularly clever trick.

My favorite AI story along these lines is about an AI that learned how to play games on the Nintendo Entertainment System. It didn’t understand the games it was playing. It’s optimization task was simply to choose controller inputs that maximized memory values, because that’s generally associated with doing well — higher scores, more progress, etc. The most unexpected part came when playing Tetris. Eventually the screen would fill up with blocks, and the AI would face the inevitable situation of losing the game, with all that memory being reinitialized to low values. So what did it do?

Just before the end it would pause the game and wait… forever.

Two Games with Monte Carlo Tree Search

2017-04-27T21:27:50Z

Update 2020: A DOS build of Connect Four was featured on GET OFF MY LAWN.

Monte Carlo tree search (MCTS) is the most impressive game artificial intelligence I’ve ever used. At its core it simulates a large number of games (playouts), starting from the current game state, using random moves for each player. Then it simply picks the move where it won most often. This description is sufficient to spot one of its most valuable features: MCTS requires no knowledge of strategy or effective play. The game’s rules — enough to simulate the game — are all that’s needed to allow the AI to make decent moves. Expert knowledge still makes for a stronger AI, but, more many games, it’s unnecessary to construct a decent opponent.

A second valuable feature is that it’s easy to parallelize. Unlike alpha-beta pruning, which doesn’t mix well with parallel searches of a Minimax tree, Monte Carlo simulations are practically independent and can be run in parallel.

Finally, the third valuable feature is that the search can be stopped at any time. The completion of any single simulation is as good a stopping point as any. It could be due to a time limit, a memory limit, or both. In general, the algorithm converges to a best move rather than suddenly discovering it. The good moves are identified quickly, and further simulations work to choose among them. More simulations make for better moves, with exponentially diminishing returns. Contrasted with Minimax, stopping early has the risk that the good moves were never explored at all.

To try out MCTS myself, I wrote two games employing it:

Connect Four [.exe x64, 173kB]
Yavalath [.exe x64, 174kB]

They’re both written in C, for both unix-like and Windows, and should be easy to build. I challenge you to beat them both. The Yavalath AI is easier to beat due to having blind spots, which I’ll discuss below. The Connect Four AI is more difficult and will likely take a number of tries.

Connect Four

MCTS works very well with Connect Four, and only requires modest resources: 32MB of memory to store the results of random playouts, and 500,000 game simulations. With a few tweaks, it can even be run in DOSBox. It stops when it hits either of those limits. In theory, increasing both would make for stronger moves, but in practice I can’t detect any difference. It’s like computing pi with Monte Carlo, where eventually it just runs out of precision to make any more progress.

Based on my simplified description above, you might wonder why it needs all that memory. Not only does MCTS need to track its win/loss ratio for each available move from the current state, it tracks the win/loss ratio for moves in the states behind those moves. A large chunk of the game tree is kept in memory to track all of the playout results. This is why MCTS needs a lot more memory than Minimax, which can discard branches that have been searched.

A convenient property of this tree is that the branch taken in the actual game can be re-used in a future search. The root of the tree becomes the node representing the taken game state, which has already seen a number of playouts. Even better, MCTS is weighted towards exploring good moves over bad moves, and good moves are more likely to be taken in the real game. In general, a significant portion of the tree gets to be reused in a future search.

I’m going to skip most of the details of the algorithm itself and focus on my implementation. Other articles do a better job at detailing the algorithm than I could.

My Connect Four engine doesn’t use dynamic allocation for this tree (or at all). Instead it manages a static buffer — an array of tree nodes, each representing a game state. All nodes are initially chained together into a linked list of free nodes. As the tree is built, nodes are pulled off the free list and linked together into a tree. When the game advances to the next state, nodes on unreachable branches are added back to the free list.

If at any point the free list is empty when a new node is needed, the current search aborts. This is the out-of-memory condition, and no more searching can be performed.

/* Connect Four is normally a 7 by 6 grid. */
#define CONNECT4_WIDTH  7
#define CONNECT4_HEIGHT 6

struct connect4_node {
    uint32_t next[CONNECT4_WIDTH];      // "pointer" to next node
    uint32_t playouts[CONNECT4_WIDTH];  // number of playouts
    float    score[CONNECT4_WIDTH];     // pseudo win/loss ratio
};

Rather than native C pointers, the structure uses 32-bit indexes into the master array. This saves a lot of memory on 64-bit systems, and the structure is the same size no matter the pointer size of the host. The next field points to the next state for the nth move. Since 0 is a valid index, -1 represents null (CONNECT4_NULL).

Each column is a potential move, so there are CONNECT4_WIDTH possible moves at any given state. Each move has a floating point score and a total number of playouts through that move. In my implementation, the search can also halt due to an overflow in a playout counter. The search can no longer be tracked in this representation, so it has to stop. This generally only happens when the game is nearly over and it’s grinding away on a small number of possibilities.

Note that the actual game state (piece positions) is not tracked in the node structure. That’s because it’s implicit. We know the state of the game at the root, and simulating the moves while descending the tree will keep track of the board state at the current node. That’s more memory savings.

The state itself is a pair of bitboards, one for each player. Each position on the grid gets a bit on each bitboard. The bitboard is very fast to manipulate, and win states are checked with just a handful of bit operations. My intention was to make playouts as fast as possible.

struct connect4_ai {
    uint64_t state[2];         // game state at root (bitboard)
    uint64_t rng[2];           // random number generator state
    uint32_t nodes_available;  // total number of nodes available
    uint32_t nodes_allocated;  // number of nodes in the tree
    uint32_t root;             // "pointer" to root node
    uint32_t free;             // "pointer" to free list
    int turn;                  // whose turn (0 or 1) at the root?
};

The nodes_available and nodes_allocated are not necessary for correctness nor speed. They’re useful for diagnostics and debugging.

All the functions that operate on these two structures are straightforward, except for connect4_playout, a recursive function which implements the bulk of MCTS. Depending on the state of the node it’s at, it does one of two things:

If there are unexplored moves (playouts == 0), it randomly chooses an unplayed move, allocates exactly one node for the state behind that move, and simulates the rest of the game in a loop, without recursion or allocating any more nodes.
If all moves have been explored at least once, it uses an upper confidence bound (UCB1) to randomly choose a move, weighed towards both moves that are under-explored and moves which are strongest. Striking that balance is one of the challenges. It recurses into that next state, then updates the node with the result as it propagates back to the root.

That’s pretty much all there is to it.

Yavalath

Yavalath is a board game invented by a computer program. It’s a pretty fascinating story. The depth and strategy are disproportionately deep relative to its dead simple rules: Get four in a row without first getting three in a row. The game revolves around forced moves.

The engine is structured almost identically to the Connect Four engine. It uses 32-bit indexes instead of pointers. The game state is a pair of bitboards, with end-game masks computed at compile time via metaprogramming. The AI allocates the tree from a single, massive buffer — multiple GBs in this case, dynamically scaled to the available physical memory. And the core MCTS function is nearly identical.

One important difference is that identical game states — states where the pieces on the board are the same, but the node was reached through a different series of moves — are coalesced into a single state in the tree. This state deduplication is done through a hash table. This saves on memory and allows multiple different paths through the game tree to share playouts. It comes at a cost of including the game state in the node (so it can be identified in the hash table) and reference counting the nodes (since they might have more than one parent).

Unfortunately the AI has blind spots, and once you learn to spot them it becomes easy to beat consistently. It can’t spot certain kinds of forced moves, so it always falls for the same tricks. The official Yavalath AI is slightly stronger than mine, but has a similar blindness. I think MCTS just isn’t quite a good fit for Yavalath.

The AI’s blindness is caused by shallow traps, a common problem for MCTS. It’s what makes MCTS a poor fit for Chess. A shallow trap is a branch in the game tree where the game will abruptly end in a small number of turns. If the random tree search doesn’t luckily stumble upon a trap during its random traversal, it can’t take it into account in its final decision. A skilled player will lead the game towards one of these traps, and the AI will blunder along, not realizing what’s happened until its too late.

I almost feel bad for it when this happens. If you watch the memory usage and number of playouts, once it falls into a trap, you’ll see it using almost no memory while performing a ton of playouts. It’s desperately, frantically searching for a way out of the trap. But it’s too late, little AI.

Another Tool in the Toolbelt

I’m really happy to have sunk a couple weekends into playing with MCTS. It’s not always a great fit, as seen with Yavalath, but it’s a really neat algorithm. Now that I’ve wrapped my head around it, I’ll be ready to use it should I run into an appropriate problem in the future.

A GPU Approach to Path Finding

2014-06-22T22:51:46Z

Last time I demonstrated how to run Conway’s Game of Life entirely on a graphics card. This concept can be generalized to any cellular automaton, including automata with more than two states. In this article I’m going to exploit this to solve the shortest path problem for two-dimensional grids entirely on a GPU. It will be just as fast as traditional searches on a CPU.

The JavaScript side of things is essentially the same as before — two textures with fragment shader in between that steps the automaton forward — so I won’t be repeating myself. The only parts that have changed are the cell state encoding (to express all automaton states) and the fragment shader (to code the new rules).

Online Demo (source)

Included is a pure JavaScript implementation of the cellular automaton (State.js) that I used for debugging and experimentation, but it doesn’t actually get used in the demo. A fragment shader (12state.frag) encodes the full automaton rules for the GPU.

Maze-solving Cellular Automaton

There’s a dead simple 2-state cellular automaton that can solve any perfect maze of arbitrary dimension. Each cell is either OPEN or a WALL, only 4-connected neighbors are considered, and there’s only one rule: if an OPEN cell has only one OPEN neighbor, it becomes a WALL.

On each step the dead ends collapse towards the solution. In the above GIF, in order to keep the start and finish from collapsing, I’ve added a third state (red) that holds them open. On a GPU, you’d have to do as many draws as the length of the longest dead end.

A perfect maze is a maze where there is exactly one solution. This technique doesn’t work for mazes with multiple solutions, loops, or open spaces. The extra solutions won’t collapse into one, let alone the shortest one.

To fix this we need a more advanced cellular automaton.

Path-solving Cellular Automaton

I came up with a 12-state cellular automaton that can not only solve mazes, but will specifically find the shortest path. Like above, it only considers 4-connected neighbors.

OPEN (white): passable space in the maze
WALL (black): impassable space in the maze
BEGIN (red): starting position
END (red): goal position
FLOW (green): flood fill that comes in four flavors: north, east, south, west
ROUTE (blue): shortest path solution, also comes in four flavors

If we wanted to consider 8-connected neighbors, everything would be the same, but it would require 20 states (n, ne, e, se, s, sw, w, nw) instead of 12. The rules are still pretty simple.

WALL and ROUTE cells never change state.
OPEN becomes FLOW if it has any adjacent FLOW cells. It points towards the neighboring FLOW cell (n, e, s, w).
END becomes ROUTE if adjacent to a FLOW cell. It points towards the FLOW cell (n, e, s, w). This rule is important for preventing multiple solutions from appearing.
FLOW becomes ROUTE if adjacent to a ROUTE cell that points towards it. Combined with the above rule, it means when a FLOW cell touches a ROUTE cell, there’s a cascade.
BEGIN becomes ROUTE when adjacent to a ROUTE cell. The direction is unimportant. This rule isn’t strictly necessary but will come in handy later.

This can be generalized for cellular grids of any arbitrary dimension, and it could even run on a GPU for higher dimensions, limited primarily by the number of texture uniform bindings (2D needs 1 texture binding, 3D needs 2 texture bindings, 4D needs 8 texture bindings … I think). But if you need to find the shortest path along a five-dimensional grid, I’d like to know why!

So what does it look like?

FLOW cells flood the entire maze. Branches of the maze are search in parallel as they’re discovered. As soon as an END cell is touched, a ROUTE is traced backwards along the flow to the BEGIN cell. It requires double the number of steps as the length of the shortest path.

Note that the FLOW cell keep flooding the maze even after the END was found. It’s a cellular automaton, so there’s no way to communicate to these other cells that the solution was discovered. However, when running on a GPU this wouldn’t matter anyway. There’s no bailing out early before all the fragment shaders have run.

What’s great about this is that we’re not limited to mazes whatsoever. Here’s a path through a few connected rooms with open space.

Maze Types

The worst-case solution is the longest possible shortest path. There’s only one frontier and running the entire automaton to push it forward by one cell is inefficient, even for a GPU.

The way a maze is generated plays a large role in how quickly the cellular automaton can solve it. A common maze generation algorithm is a random depth-first search (DFS). The entire maze starts out entirely walled in and the algorithm wanders around at random plowing down walls, but never breaking into open space. When it comes to a dead end, it unwinds looking for new walls to knock down. This methods tends towards long, winding paths with a low branching factor.

The mazes you see in the demo are Kruskal’s algorithm mazes. Walls are knocked out at random anywhere in the maze, without breaking the perfect maze rule. It has a much higher branching factor and makes for a much more interesting demo.

Skipping the Route Step

On my computers, with a 1023x1023 Kruskal maze ~~it’s about an order of magnitude slower~~ (see update below) than A* (rot.js’s version) for the same maze. ~~Not very impressive!~~ I believe this gap will close with time, as GPUs become parallel faster than CPUs get faster. However, there’s something important to consider: it’s not only solving the shortest path between source and goal, it’s finding the shortest path between the source and any other point. At its core it’s a breadth-first grid search.

Update: One day after writing this article I realized that glReadPixels was causing a gigantic bottlebeck. By only checking for the end conditions once every 500 iterations, this method is now equally fast as A* on modern graphics cards, despite taking up to an extra 499 iterations. In just a few more years, this technique should be faster than A*.

Really, there’s little use in ROUTE step. It’s a poor fit for the GPU. It has no use in any real application. I’m using it here mainly for demonstration purposes. If dropped, the cellular automaton would become 6 states: OPEN, WALL, and four flavors of FLOW. Seed the source point with a FLOW cell (arbitrary direction) and run the automaton until all of the OPEN cells are gone.

Detecting End State

The ROUTE cells do have a useful purpose, though. How do we know when we’re done? We can poll the BEGIN cell to check for when it becomes a ROUTE cell. Then we know we’ve found the solution. This doesn’t necessarily mean all of the FLOW cells have finished propagating, though, especially in the case of a DFS-maze.

In a CPU-based solution, I’d keep a counter and increment it every time an OPEN cell changes state. The the counter doesn’t change after an iteration, I’m done. OpenGL 4.2 introduces an atomic counter that could serve this role, but this isn’t available in OpenGL ES / WebGL. The only thing left to do is use glReadPixels to pull down the entire thing and check for end state on the CPU.

The original 2-state automaton above also suffers from this problem.

Encoding Cell State

Cells are stored per pixel in a GPU texture. I spent quite some time trying to brainstorm a clever way to encode the twelve cell states into a vec4 color. Perhaps there’s some way to exploit blending to update cell states, or make use of some other kind of built-in pixel math. I couldn’t think of anything better than a straight-forward encoding of 0 to 11 into a single color channel (red in my case).

int state(vec2 offset) {
    vec2 coord = (gl_FragCoord.xy + offset) / scale;
    vec4 color = texture2D(maze, coord);
    return int(color.r * 11.0 + 0.5);
}

This leaves three untouched channels for other useful information. I experimented (uncommitted) with writing distance in the green channel. When an OPEN cell becomes a FLOW cell, it adds 1 to its adjacent FLOW cell distance. I imagine this could be really useful in a real application: put your map on the GPU, run the cellular automaton a sufficient number of times, pull the map back off (glReadPixels), and for every point you know both the path and total distance to the source point.

Performance

As mentioned above, I ran the GPU maze-solver against A* to test its performance. I didn’t yet try running it against Dijkstra’s algorithm on a CPU over the entire grid (one source, many destinations). If I had to guess, I’d bet the GPU would come out on top for grids with a high branching factor (open spaces, etc.) so that its parallelism is most effectively exploited, but Dijkstra’s algorithm would win in all other cases.

Overall this is more of a proof of concept than a practical application. It’s proof that we can trick OpenGL into solving mazes for us!

Markov Chain Text Generation

2012-09-05T00:00:00Z

You may have been confused by yesterday’s nonsense post. That’s because it was generated by a few Elisp Markov chain functions. It was fed my entire blog and used to generate a ~1500 word post. I tidied up a bit to make sure the markup was valid and parenthesis were balanced, but that’s about it.

The algorithm is really simple and I was quite surprised by the quality of the output. After feeding it Great Expectations and A Princess of Mars (easily obtainable from Project Gutenberg) I had a good laugh at some of the output. Some choice quotes,

He wiped himself again, as if he didn’t marry her by hand.

I admit having done so, and the summer afternoon toned down into the house.

My favorite of yesterday’s post was this one,

Suppose you want to read a great story, I recommend it.

The output also looks like some types of spam, so this may be how some spammers generate content in order to get around spam filters.

To build a Markov chain from input, the program looks at markov-text-state-size words (default 3) and makes note of what word follows. Then it slides the window forward one word and repeats. To generate text, the last markov-text-state-size words outputted is the state and the next word is selected from these notes at random, weighted by the frequency of its appearance in the input text. Smaller state sizes generates more random output and larger state sizes generates better structured output. Too large and the output is the input verbatim.

For example, given this sentence and a state size of two words,

Quickly, he ran and he ran until he couldn’t.

The produced chain looks like this in alist form,

((("Quickly," "he") "ran")
 (("he" "ran") "and" "until")
 (("ran" "and") "he")
 (("and" "he") "ran")
 (("ran" "until") "he")
 (("until" "he") "couldn't.")
 (("he" "couldn't.")))

Because there are two options for (“he” “ran”), the generator might loop around that state for awhile like so,

Quickly, he ran and he ran and he ran and he ran until he couldn’t.

Or it might skip the section altogether,

Quickly, he ran until he couldn’t.

Also notice that the punctuation is part of the word. This makes the output more natural, automatically forming sentences. More so, my program also holds onto all newlines. This breaks the output into nice paragraphs without any extra effort. Since I wrote it in Elisp, I use fill-paragraph to properly wrap the paragraphs as I generate them, so superfluous single newlines don’t hurt anything.

One problem I did run into with my input text was quotes. I was using novels so there is a lot of quoted text (character dialog). The generated text tends to balance quotes poorly. My solution for the moment is to strip these out along with spaces when forming words. That’s still not ideal.

I’m going to play with this a bit more, using it as a tool for other project ideas (ERC bot, etc.). I already did this by including a lorem ipsum generator alongside the markov-text package. The input text is Cicero’s De finibus bonorum et malorum, the original source of lorem ipsum. This was actually the original inspiration for this project, after I saw lorem-ipsum.el on EmacsWiki and decided I could do better.

Implemented Is Simple Data Compression

2012-09-04T00:00:00Z

Update: This post shouldn’t make sense to anyone (hopefully). Read the follow-up for an explanation.

When a branch of my posts remains simple.

This is necessary when one will assume Alan is more important than number 12. By using numbers to repeat them, but this won’t work with any sort of thing you want to load what’s needed. This includes reimplementing the reader as it seems you still need to specify any video-specific parameters, ppmtoy4m is the whole thing is just that, decorated with some tips on how the current space as visited, then recurse from the client to read a great story, I recommend you use to launch a daemon process and prints the variable information to stdout. As an added bonus, when a second variable for accumulation and a second argument is relevant.

Suppose you want to read a great story, I recommend it.

This servlet uses the Term::ProgressBar, if it’s any good, but it’s funny. As anyone with cats knows, it’s not too stupid to call fsync() to force the write to the snapshot and uninterns any new symbols. These symbols will be added to the the second experiment.

At this line, you can perform a number from a couple of these and give them back any other language that can turn out even from a large header comment in the logs, so getting someone into my honeypot wouldn’t take long at all. The only proof I could then cherry-pick/pull the issues from that repository and see the polynomial interpolation at that time, presented in order. This makes so much of web development (I think that’s his name). I am an Emacs person myself, which I use branches all the time, now that they can be written.

We will run your build system in a web front-end to it, and made a couple of seconds.

You should also be a good head start, though. The SPARC is big-endian and the results to seed their program accordingly. You could do this is by mounting the compromised filesystem in a list. In the decentralized model, everyone has their own solutions in parallel when it comes across 10 it emits 0.

Here’s an example of some of the fire gem activated and exploded, causing no blindness to me. They take a look at the same level as the printed string. You can grab my source code in response to abuse by spammers who hide fraudulent URLs behind shortened ones. If these services ever went down all at once, these shortened URLs would rot, destroying many of the image, with the FFI.

Because I wrote a shell script that will also remove the execs and live with nested shells because the zeros cancel out everything else? Here is the protocol.

Generate a 10-byte random IV. This need not implement this.

Note that the shell script, and the arcfour key scheduler at least n days.

However, generating a series of commits to all other encounters nothing changes.

Your program should simulate this by having the user to reseed somewhere. There’s no direct way to install it to dominate for awhile. It is strange that Matlab itself doesn’t have any sort of syntax highlighting. Boring! I finally ran into this image. After each paste, make a saving throw to prevent an explosion.

Because Gnohkk would also suffer from the bottom are arranged around the cats in the logs, so getting someone into my honeypot wouldn’t take long at the link in the block. Another was going to used a stationary magnet.

Our team went with this array (and replaced the current layer 5). Now, duplicate the work was done just once by freeing the entire number, it can perform both compression and decompression on both sides don’t pay attention to the development loop is just an ordered list of 50 H’s and T’s. If you implement this in the same time. This is along the way, clone my repository right into the official website so I had to do this for any long-blocking function that I use ppmtoy4m to pipe the new frames to keep, such as n^p mod M, which this will handle efficiently. For example, to add a new compression algorithm in terms of brute-force attacks it requires using numbers long enough to fit three Emacs’ windows side-by-side at 78 columns each. The leftmost one contains my active work buffer where I do most useful things, a fresh array every time it sees a free musical. Unfortunately, my writing skills are even worse. I have gotten good mileage out of a file based on their website demonstrating how to increment the iterator. I have to type a negative comment about zip archives and moved on. I am using a constant amount of memory.

It turns out that everyone is free to share his source code samples, particularly more recent entries, was that producing the relief surface was an e-mail address, I get home from work I don’t recommend doing this with secret Java applets.

There are a few weeks since I last used KOffice, so I could easily plug it into Emacs and run the test above, I would rather not do damage, but rather a patient human being. Getting tired of manually synchronizing them. It was finally time to document the effort as a single mine is destroyed, the neighboring mines will replicate a replacement. The minefield itself could therefore hold no secrets whatsoever. This leaves out any possibility of a rumor among a group of people. At any given time, each person in the background. My shell habits looked like the ones you’re seeing after end-package.

It’s really simple way to detect edges all over the weekend I came up with some rough edges. So I got it right while IE, Opera, Safari, and Chrome all do it again.

Numbers can be found inside the fake closure provided by lexical-let. In a previous post about Lua, another about a third of my name generation code.

S-expressions are handy anywhere.

Two months ago I was so happy when I run the program with the proper Perl regular expression contains quotes and these will not be worth it.

I can’t help but think that a knight moving according to the current symbol table to the existing mountain of elisp code out there, requiring a massive increase in speed when using OpenCL. In fact, there is virtually no computation involved. So what I want to look like SBCL. Fortunately, that’s not all!!! There is a fake service or computer on a chess board such that it’s somewhat easier to tell when the handler can present any contents it wants. In this case, rather than just one, even though I don’t know what it looks good, except you want to italicize a few bits smaller than a minute. All the other day I will probably be ordered by their own directory. Modern applications have moved into a directory under ~/.config/. Your script needs to be broken into small computation units, because Emacs lacked network functionality until recently was the package manager, package, and the Emacs Lisp Package Archive.

One of the info field in the list, which sounds like a .emacs file in your program. If the slot is already taken, the symbol was in an external system.

After all this, I thought I’d give it a YouTube URL and a single password if the required artifacts, digitally signs them, and bundles them up.

The demo at the same length as the variable declarations are exactly the right magical string of, say, 31 fractions.

The story is really happening. Optimizing away variables that point to it.

Oh, and I was just a tiny subset of the memory at once became a lot of memory. For example, here’s my laptop’s /bin/ls, very roughly labeled.

The different segments of the game area was a mistake on my rolls and had some wires, connected to some sort of bad things this may happen subconsciously, which is given in ImageMagick’s montage tool, which made the final montage out of the image functions described below.

You can write a lexer or tokenizer without one. Because of this tool, Samuel Stoddard, gives some in-game context to the light of day. I just use your own program, the script in your load-path somewhere.

I’ve frequently thought that a Lisp-based shell would be produced by first individually gzipping each file in first.

For a long ways away from a simple double-click shortcut. If you just want to duplicate the remaining canines. Her reward for victory was a very similar process, but without any sort of thing is transparent. I’ve already used it with a degree in, say, a few months. I’ve used POSIX threads, Pthreads, before, so it suits my needs for the first two arguments from filter2, as well as some more to see my changes, but I don’t know much about it, user AJR spoiled it with ssh-add and it queries for your passphrase, storing it in two obarrays at once, these shortened URLs would rot, destroying many of its input. For example, this is what registration looks like,

Unfortunately, the HTML output is a Harsh Mistress. If you know that the opposite way that the adventures and characters are riddled with mistakes and very unbalanced. For an easier way to set up properly in your configuration.

I strongly recommend that you generally want to have a master pad, K, that you often generate very improbable series of commits.

To all other encounters nothing changes.

And that’s it! I put this line in your program. If you are subscribed to the rescue!

Traveling Salesman Problem by Genetic Algorithm

2007-11-30T00:00:00Z

/download/genetic.tar.gz (6.19KB)

Here is another project for my artificial intelligence class. I wrote a generic genetic algorithm class in C++ and then applied that class to the traveling salesman problem. A genetic algorithm more tuned to the traveling salesman problem would work better.

This particular implementation can use up to 16 points defined in the weight matrix stored in travel.dat. This weight matrix can either be defined by hand or generated using gendat.m from a list of points stored in points.txt. A chromosome is 64 bits wide, which is 16 points with 4 bits each. To make sure that every possible chromosome is a valid solution, the points are selected out of a circular queue. Every 4 bits describes how far along the queue to walk before pulling out a point. With the circular queue, the chromosome could be as short as 50 bits, but I was trying different things and 64 bits is the simplest way to represent a solution. Here are some sample chromosomes:

1101100010100000011001010101010111000110101100111111000010111010, 9315
0000110100000001000010010101010111011000101100010110000010111010, 9339
1010001100010001010010011001010010100000101100011111000010111110, 9410
0101001000000011010010011001010011010100101011100111000111011110, 9355
0101001001000000010110100001010011010110101101000001000101000110, 9349
0101011100010011000010011011011011010101101100011111010010111110, 9311
0000111101000001000010011001010010110100101100011111100010000010, 9350
1000001111100000010110101011011011100000111101010111000010111010, 9428
0000111011000001000010010111011111111110101100010111000001000111, 9448

The second number is the fitness value of the chromosome (10000 - path length). Below is a path found after 20000 iterations:

It takes many iterations to find a reasonable solution and it never finds a really good solution. This is because each node in the chromosome depends on every single node before it. This is terrible for a genetic algorithm, but it’s really the best I could think of when using a generic genetic algorithm class like this.

A much better method would have the genetic algorithm actually know about the problem at hand, working with nodes rather than bits. Breeding would make cuts on nodes. Mutations would swap single nodes. Perhaps this can be written another time.

Neural Network Blackjack Game

2007-11-13T00:00:00Z

Get the source code (C++):

/download/neural.tar.gz (16.27KB)

This is a neural network I wrote for an artificial intelligence class I took about a year ago. It includes a stand alone neural network class you can easily use in your own C++ program. Built around this neural network is a simple version of Blackjack (hit or stand only). You can play the neural network at Blackjack after it has finished training, which can take up to a minute or so.

My implementation of the neural network is a really simple one, using only back propagation, but it still has some neat surprises in it. When I was working with it, I would use a simple GNU Octave script to watch what was going on in real time.

The x-axis is the number of iterations in tens of thousands and the y-axis describes how often the neural network plays exactly the same way a cheater would at the same seat. A cheater is defined as someone who knows what card is next and can play perfectly. Note: the neural network itself is not cheating. At the end, it agrees with the cheater about 83% of the time. This is the script that reads the neural network output. For those who don’t recognize the she-bang, save this to the file plotdat and set the execution permission (chmod +x plotdat).

#!/usr/bin/octave -qf
# Usage:
# plotdat filename [loop] [last index]

dat = dlmread(argv{1});
x = length(dat);
loop = 0;

if (length(argv) > 1)
  if (argv{2} == 'loop')
    loop = 1;
  else
    x = argv{2};
  end
end

plot(dat(1:x));

while(loop)
  dat = dlmread(argv{1});
  plot(dat);
  sleep(1);
end

pause;

When you run the program, a blank, dumb neural network is created. It doesn’t know how to play blackjack. In fact, just as it seems with my sister at times, it doesn’t know how to do anything at all. It’s like a newborn baby, but without any instincts. To make this neural network useful, it is taught to play blackjack by running it very quickly through about one million games, all tutored by a teacher who gets to cheat by looking ahead in the deck. This allows the teacher to always know the proper move.

The training works by giving the network an input and the desired output (determined by the cheating). The network adjusts its internal weights depending on the error between what its current output for the input and the desired output. In doing this, the neural network picks up on the statistical nature of its inputs and will pick up on patterns.

When setting up the neural network in your own program, the tricky part is determining how to encode the inputs so that patterns will be found. In this case, I provide 3 integers to the network: its lower bound score, its best score, and the visible opponent score. By lower bound and best score, I am talking about Aces. All aces are treated as 1’s in the lower bound and ideal (highest without bust) values in the best score. Scores never exceed 31, so we can encode this in 5 bits each for a total of 15 bits. We only need 1 output bit: hit (0) or stand (1). The network looks something like this, but with 30 hidden layer neurons and a lot more connections.

Here is an example run of the blackjack game,

Creating network ... created!
Training ...
Win, lose, total, c%, win%: 1536, 7640, 10000, 42.1168% , 15.36%
Win, lose, total, c%, win%: 3728, 14438, 20000, 55.8875% , 21.92%

... (lots of output for a few seconds) ...

Win, lose, total, c%, win%: 203744, 700338, 980000, 82.0455% , 20.3%
Win, lose, total, c%, win%: 205849, 707461, 990000, 82.5488% , 21.05%
Win, lose, total, c%, win%: 207981, 714515, 1000000, 82.3059% , 21.32%

Begin game:
Computer is dealt: (hole) 5
Computer is dealt: (hole) 5 4
Computer stands.
Your hand: 2 8
Hit? (y/n): y
Your hand: 2 8 3
Hit? (y/n): y
Your hand: 2 8 3 10
You bust with 23
Computer wins with 19 against your 23
Play again? (y/n) : y

Begin game:
Computer is dealt: (hole) 4
Computer is dealt: (hole) 4 10
Computer stands.
Your hand: 10 A
Hit? (y/n): n
Your hand: 10 A
You win with 21 against computer's 20
Play again? (y/n) : n

Computer wins, losses, push: 1 (50%), 1 (50%), 0 (0%)

If you decide to try using the neural network in your own program, be sure to play with different sized networks to see what works best. In my implementation, a small network would be 1 layer with 3 nodes and a large network would be 3 or 4 layers with hundreds of nodes. Bigger networks may not work well at all, and there is no way to know what size is best other than trial and error.

Iterated Prisoner's Dilemma

2007-11-06T00:00:00Z

I was reading about the prisoner’s dilemma game the other day and was inspired to simulate it myself. It would also be a good project to start learning Common Lisp. All of the source code is available in its original source file here:

/download/prison/prison.lisp

I have only tried this code in my favorite Common Lisp implementation, CLISP, as well CMUCL.

In prisoner’s dilemma, two players acting as prisoners are given the option of cooperating with or betraying (defecting) the other player. Each player’s decision along with his opponents decision determines the length of his prison sentence. It is bad news for the cooperating player when the other player is defecting.

Prisoner’s dilemma becomes more interesting in the iterated version of the game, where the same two players play repeatedly. This allows players to “punish” each other for uncooperative play. Scoring generally works as so (higher is better),

		Player A
		coop	defect
Player B	coop	(3,3)	(0,5)
Player B	defect	(5,0)	(1,1)

The most famous, and strongest individual strategy, is tit-for-tat. This player begins by playing cooperatively, then does whatever the its opponent did last. Here is the Common Lisp code to run a tit-for-tat strategy,

(defun tit-for-tat ()
  (lambda (x)
    (if (null x) :coop x)))

If you are unfamiliar with Common Lisp, the lambda part is returning an anonymous function that actually plays the tit-for-tat strategy. The tit-for-tat function generates a tit-for-tat player along with its own closure. The argument to the anonymous function supplies the opponent’s last move, which is one of the symbols :coop or :defect. In the case of the first move, nil is passed. These are some really simple strategies that ignore their arguments,

(defun rand-play ()
  (lambda (x)
    (declare (ignore x))
    (if (> (random 2) 0) :coop :defect)))

(defun switcher-coop ()
  (let ((last :coop))
    (lambda (x)
      (declare (ignore x))
      (if (eq last :coop)
          (setf last :defect)
          (setf last :coop)))))

(defun switcher-defect ()
  (let ((last :defect))
    (lambda (x)
      (declare (ignore x))
      (if (eq last :coop)
          (setf last :defect)
          (setf last :coop)))))

(defun always-coop ()
  (lambda (x)
    (declare (ignore x))
    :coop))

(defun always-defect ()
  (lambda (x)
    (declare (ignore x))
    :defect))

Patrick Grim did an interesting study about ten years ago on iterated prisoner’s dilemma involving competing strategies in a 2-dimensional area: Undecidability in the Spatialized Prisoner’s Dilemma: Some Philosophical Implications. It is very interesting, but I really wanted to play around with some different configurations myself. So what I did was extend my iterated prisoner’s dilemma engine above to run over a 2-dimensional grid.

Grim’s idea was this: place different strategies in a 2-dimensional grid. Each strategy competes against its immediate neighbors. (The paper doesn’t specify which kind of neighbor, 4-connected or 8-connected, so I went with 4-connected.) The sum of these competitions are added up to make that cell’s final score. After scoring, each cell takes on the strategy of its highest neighbor, if any of its neighbors have a higher score than itself. Repeat.

The paper showed some interesting results, where the tit-for-tat strategy would sometimes dominate, and, in other cases, be quickly wiped out, depending on starting conditions. Here was my first real test of my simulation. Three strategies were placed randomly in a 50x50 grid: tit-for-tat, always-cooperate, and always-defect. This is the first twenty iterations. It stabilizes after 16 iterations.

(run-random-matrix 50 100 20 '(tit-for-tat always-coop always-defect))

White is always-cooperate, black is always-defect, and cyan is tit-for-tat. Notice how the always-defect quickly exploits the always-cooperate and dominates the first few iterations. However, as the always-cooperate resource becomes exhausted, the tit-for-tat cooperative strategy works together with itself, as well as the remaining always-cooperate, to eliminate the always-defect invaders, who have no one left to exploit. In the end, a few always-defect cells are left in equilibrium, feeding off of always-cooperate neighbors, who themselves have enough cooperating neighbors to hold their ground.

The effect can be seen more easily here. Around the outside is tit-for-tat, in the middle is always-cooperate, and a single always-defect cell is placed in the middle.

(run-matrix (create-three-box) 100 30)

The asymmetric pattern is due to the way that ties are broken.

The lisp code only spits out text, which isn’t very easy to follow whats going on. To generate these gifs, I first used this Octave script to convert the text into images. Just dump the lisp output to a text file and remove the hash table dump at the end. Then run this script on that file:

/download/prison/pd_plot.m

The text file input should look like this:

/download/prison/example.txt

~~You will need Octave-Forge.~~

The script will make PNGs. You can either change the script to make GIFs (didn’t try this myself), or use something like ImageMagick to convert the images afterward. Then, you compile frames into the animated GIF using Gifsicle.

See if you can come up with some different strategies and make some special patterns for them. You may be able to observe some interesting interactions. The image at the beginning of the article uses all of the listed strategies in a random matrix.

I will continue to try out some more to see if I can find something particularly interesting.

Chess AI Idea

2007-10-24T00:00:00Z

So, I had this idea using a genetic algorithm to optimize the parameters of a program that plays chess. Now, the genetic algorithm wouldn’t be used at all during a game, but rather to optimize the board evaluation parameters beforehand. I don’t know much about writing board game AI programs, as I have only written a few of them for fun (tic-tac-toe, connect 4, Pente). For this chess program, I am taking a simple approach because I am more interested in seeing the genetic algorithm at work than seeing the chess playing AI do well against other chess AI or people.

The program would search the game tree using the minimax algorithm, with some possible optimizations added afterward. Tree searching is just a matter of generating all possible moves and looking at them. The hard part is the board evaluation function, which evaluates the a particular board’s score based on the arrangement of the pieces. Parameters to this evaluation function would be, for example, the piece values. The pawn would be locked in at a value of 1, which anchors the other values and provides a base unit to work from.

Again, the parameters would not change during a game. We use the genetic algorithm ahead of time to determine the parameters.

For the genetic algorithm, the set of parameters strung together makes up a single chromosome. We maintain a pool of different chromosomes, i.e. different sets of parameters, and breed these chromosomes together to improve our parameter sets. We start out with a random pool made of parameters that are most likely pretty terrible.

To evaluate the chromosomes, we need a fitness function, which evaluates each chromosome for its level of “fitness” deciding if it breeds or not. To do this we simply play the chromosome we are evaluating against some base chromosome, which may just be parameters chosen intuitively by the programmer. Or, the base chromosome could be random too. Starting with a better base chromosome would be a good head start, though. The fitness of the chromosome is how often it wins against the base chromosome in, say, a few hundred games.

The most fit chromosomes are bred by taking a few parameters from each to make a new chromosome. Mutations are occasionally added in order to keep the chromosome pool from getting stuck in a local maximum. A mutation involves changing one or more parameters in a chromosome slightly in some random way. Mutations are rare and will usually be detrimental to the chromosome quickly killing it off, but will occasionally cause a good change that will be spread to other chromosomes in the next generation, improving the gene pool.

We iterate this until either the maximum fitness level in the pool is stuck for several iterations (we aren’t getting anywhere and mutations aren’t helping), or the chromosomes are so good they always beat the base chromosome, making the fitness algorithm meaningless. When this happens, we replace the base chromosome with the best chromosome in the pool and start over from scratch again with a random, or mostly random, pool.

As you would expect, I have looked into parallelizing this process to take advantage of a cluster. This is easy for several reasons. First, evaluating chromosomes can be done simultaneously. No evaluation depends on another chromosome’s evaluation. Second, the minimax game tree search can be parallelized so that several different processes search the game tree and give their results back to the parent process. This works very well because the data being sent back to the parent will be a single integer. No need to send large amounts of data around the network.

I spent an afternoon hacking at this, but its still too crude to share yet. I got the non-parallel version of the chess engine built but I am still working on the evaluation function. The genetic algorithm hasn’t been started. The only parameters at the moment are piece values. The board evaluation function just adds up the piece values on the board completely ignoring their positions. This makes the computer play extremely aggressively, capturing the opponent’s pieces whenever it can. This makes for a somewhat interesting bloodbath where the board goes empty after just a few moves.

My problem right now is finding a good way to represent piece movements so that it can be recycled. That is, I want to represent movements when generating my search tree, verifying the legality of a move, and evaluating the board all the same way so that I don’t have to program in piece movements several times.