Everything I've learned so far about running local LLMs

November 10, 2024

nullprogram.com/blog/2024/11/10/

This article was discussed on Hacker News.

Over the past month I’ve been exploring the rapidly evolving world of Large Language Models (LLM). It’s now accessible enough to run a LLM on a Raspberry Pi smarter than the original ChatGPT (November 2022). A modest desktop or laptop supports even smarter AI. It’s also private, offline, unlimited, and registration-free. The technology is improving at breakneck speed, and information is outdated in a matter of months. This article snapshots my practical, hands-on knowledge and experiences — information I wish I had when starting. Keep in mind that I’m a LLM layman, I have no novel insights to share, and it’s likely I’ve misunderstood certain aspects. In a year this article will mostly be a historical footnote, which is simultaneously exciting and scary.

In case you’ve been living under a rock — as an under-the-rock inhabitant myself, welcome! — LLMs are neural networks that underwent a breakthrough in 2022 when trained for conversational “chat.” Through it, users converse with a wickedly creative artificial intelligence indistinguishable from a human, which smashes the Turing test and can be wickedly creative. Interacting with one for the first time is unsettling, a feeling which will last for days. When you bought your most recent home computer, you probably did not expect to have a meaningful conversation with it.

I’ve found this experience reminiscent of the desktop computing revolution of the 1990s, where your newly purchased computer seemed obsolete by the time you got it home from the store. There are new developments each week, and as a rule I ignore almost any information more than a year old. The best way to keep up has been r/LocalLLaMa. Everything is hyped to the stratosphere, so take claims with a grain of salt.

I’m wary of vendor lock-in, having experienced the rug pulled out from under me by services shutting down, changing, or otherwise dropping my use case. I want the option to continue, even if it means changing providers. So for a couple of years I’d ignored LLMs. The “closed” models, accessibly only as a service, have the classic lock-in problem, including silent degradation. That changed when I learned I can run models close to the state-of-the-art on my own hardware — the exact opposite of vendor lock-in.

This article is about running LLMs, not fine-tuning, and definitely not training. It’s also only about text, and not vision, voice, or other “multimodal” capabilities, which aren’t nearly so useful to me personally.

To run a LLM on your own hardware you need software and a model.

The software

I’ve exclusively used the astounding llama.cpp. Other options exist, but for basic CPU inference — that is, generating tokens using a CPU rather than a GPU — llama.cpp requires nothing beyond a C++ toolchain. In particular, no Python fiddling that plagues much of the ecosystem. On Windows it will be a 5MB llama-server.exe with no runtime dependencies. From just two files, EXE and GGUF (model), both designed to load via memory map, you could likely still run the same LLM 25 years from now, in exactly the same way, out-of-the-box on some future Windows OS.

Full disclosure: I’m biased because the official Windows build process is w64devkit. What can I say? These folks have good taste! That being said, you should only do CPU inference if GPU inference is impractical. It works reasonably up to ~10B parameter models on a desktop or laptop, but it’s slower. My primary use case is not built with w64devkit because I’m using CUDA for inference, which requires a MSVC toolchain. Just for fun, I ported llama.cpp to Windows XP and ran a 360M model on a 2008-era laptop. It was magical to load that old laptop with technology that, at the time it was new, would have been worth billions of dollars.

The bottleneck for GPU inference is video RAM, or VRAM. These models are, well, large. The more RAM you have, the larger the model and the longer the context window. Larger models are smarter, and longer contexts let you process more information at once. GPU inference is not worth it below 8GB of VRAM. If “GPU poor”, stick with CPU inference. On the plus side, it’s simpler and easier to get started with CPU inference.

There are many utilities in llama.cpp, but this article is concerned with just one: llama-server is the program you want to run. It’s an HTTP server (default port 8080) with a chat UI at its root, and APIs for use by programs, including other user interfaces. A typical invocation:

$ llama-server --flash-attn --ctx-size 0 --model MODEL.gguf

The context size is the largest number of tokens the LLM can handle at once, input plus output. Contexts typically range from 8K to 128K tokens, and depending on the model’s tokenizer, normal English text is ~1.6 tokens per word as counted by wc -w. If the model supports a large context you may run out of memory. If so, set a smaller context size, like --ctx-size $((1<<13)) (i.e. 8K tokens).

I do not yet understand what flash attention is about, and I don’t know why --flash-attn/-fa is not the default (lower accuracy?), but you should always request it because it reduces memory requirements when active and is well worth the cost.

If the server started successfully, visit it (http://localhost:8080/) to try it out. Though of course you’ll need a model first.

The models

Hugging Face (HF) is “the GitHub of LLMs.” It’s an incredible service that has earned that title. “Small” models are around a few GBs, large models are hundreds of GBs, and HF hosts it all for free. With a few exceptions that do not matter in practice, you don’t even need to sign up to download models! (I’ve been so impressed that after a few days they got a penny-pincher like me to pay for pro account.) That means you can immediately download and try any of the stuff I’m about to discuss.

If you look now, you’ll wonder, “There’s a lot of stuff here, so what the heck am I supposed to download?” That was me one month ago. For llama.cpp, the answer is GGUF. None of the models are natively in GGUF. Instead GGUFs are in a repository with “GGUF” in the name, usually by a third party: one of the heroic, prolific GGUF quantizers.

(Note how nowhere does the official documentation define what “GGUF” stands for. Get used that. This is a technological frontier, and if the information exists at all, it’s not in the obvious place. If you’re considering asking your LLM about this once it’s running: Sweet summer child, we’ll soon talk about why that doesn’t work. As far as I can tell, “GGUF” has no authoritative definition (update: the U stands for “Unified”, but the rest is still ambiguous).)

Since llama.cpp is named after the Meta’s flagship model, their model is a reasonable start, though it’s not my personal favorite. The latest is Llama 3.2, but at the moment only the 1B and 3B models — that is, ~1 billion and ~3 billion parameters — work in Llama.cpp. Those are a little too small to be of much use, and your computer can likely to better if it’s not a Raspberry Pi, even with CPU inference. Llama 3.1 8B is a better option. (If you’ve got at least 24GB of VRAM then maybe you can even do Llama 3.1 70B.)

If you search for Llama 3.1 8B you’ll find two options, one qualified “instruct” and one with no qualifier. Instruct means it was trained to follow instructions, i.e. to chat, and that’s nearly always what you want. The other is the “base” model which can only continue a text. (Technically the instruct model is still just completion, but we’ll get to that later.) It would be great if base models were qualified “Base” but, for dumb path dependency reasons, they’re usually not.

You will not find GGUF in the “Files” for the instruct model, nor can you download the model without signing up in order to agree to the community license. Go back to the search, add GGUF, and look for the matching GGUF model: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF. bartowski is one of the prolific and well-regarded GGUF quantizers. Not only will this be in the right format for llama.cpp, you won’t need to sign up.

In “Files” you will now see many GGUFs. These are different quantizations of the same model. The original model has bfloat16 tensors, but for merely running the model we can throw away most of that precision with minimal damage. It will be a tiny bit dumber and less knowledgeable, but will require substantially fewer resources. The general recommendation, which fits my experience, is to use Q4_K_M, a 4-bit quantization. In general, better to run a 4-bit quant of a larger model than an 8-bit quant of a smaller model. Once you’ve got the basics understood, experiment with different quants and see what you like!

My favorite models

Models are trained for different trade-offs and differ in strengths and weaknesses, so no model is best at everything — especially on “GPU-poor” configurations. My desktop system has an RTX 3050 Ti with 8GB VRAM, and its limitations have shaped my choices. I can comfortably run ~10B models, and ~30B models just barely enough to test their capabilities. For ~70B I rely on third-party hosts. My “t/s” numbers are all on this system running 4-bit quants.

This list omits “instruct” from the model name, but assume the instruct model unless I say otherwise. A few are bona fide open source, at least as far as LLMs practically can be, and I’ve noted the license when that’s the case. The rest place restrictions on both use and distribution.

Mistral-Nemo-2407 (12B) [Apache 2.0]

A collaboration between Mistral AI and Nvidia (“Nemo”), the most well-rounded ~10B model I’ve used, and my default. Inference starts at a comfortable 30 t/s. It’s strengths are writing and proofreading, and it can review code nearly as well as ~70B models. It was trained for a context length of 128K, but its effective context length is closer to 16K — a limitation I’ve personally observed.

The “2407” is a date (July 2024) as version number, a versioning scheme I wholeheartedly support. A date tells you about its knowledge cut-off and tech level. It sorts well. Otherwise LLM versioning is a mess. Just as open source is bad with naming, AI companies do not comprehend versioning.
Qwen2.5-14B [Apache 2.0]

Qwen models, by Alibaba Cloud, impressively punch above their weight at all sizes. 14B inference starts at 11 t/s, with capabilities on par with Mistral Nemo. If I could run 72B on my own hardware, it would probably be my default. I’ve been trying it through Hugging Face’s inference API. There’s a 32B model, but it’s impractical for my hardware, so I haven’t spent much time with it.
Gemma-2-2B

Google’s model is popular, perhaps due to its playful demeanor. For me, the 2B model is great for fast translation. It’s amazing that LLMs have nearly obsoleted Google Translate, and you can run it on your home computer. Though it’s more resource-intensive, and refuses to translate texts it finds offensive, which sounds like a plot element from a sci-fi story. In my translation script, I send it text marked up with HTML. Simply asking Gemma to preserve the markup Just Works! The 9B model is even better, but slower, and I’d use it instead of 2B for translating my own messages into another language.
Phi3.5-Mini (4B) [MIT]

Microsoft’s niche is training on synthetic data. The result is a model that does well in tests, but doesn’t work so well in practice. For me, its strength is document evaluation. I’ve loaded the context with up to 40K-token documents — it helps that it’s a 4B model — and successfully queried accurate summaries and data listings.
SmolLM2-360M [Apache 2.0]

Hugging Face doesn’t just host models; their 360M model is unusually good for its size. It fits on my 2008-era, 1G RAM, Celeron, and 32-bit operating system laptop. It also runs well on older Raspberry Pis. It’s creative, fast, converses competently, can write poetry, and a fun toy in cramped spaces.
Mixtral-8x7B (48B) [Apache 2.0]

Another Mistral AI model, and more of a runner up. 48B seems too large, but this is a Mixture of Experts (MoE) model. Inference uses only 13B parameters at a time. It’s reasonably-suited to CPU inference on a machine with at least 32G of RAM. The model retains more of its training inputs, more like a database, but for reasons we’ll see soon, it isn’t as useful as it might seem.
Llama-3.1-70B and Llama-3.1-Nemotron-70B

More models I cannot run myself, but which I access remotely. The latter bears “Nemo” because it’s an Nvidia fine-tune. If I could run 70B models myself, Nemotron might just be my default. I’d need to spent more time evaluating it against Qwen2.5-72B.

Most of these models have abliterated or “uncensored” versions, in which refusal is partially fine-tuned out at a cost of model degradation. Refusals are annoying — such as Gemma refusing to translate texts it dislikes — but doesn’t happen enough for me to make that trade-off. Maybe I’m just boring. Also refusals seem to decrease with larger contexts, as though “in for a penny, in for a pound.”

The next group are “coder” models trained for programming. In particular, they have fill-in-the-middle (FIM) training for generating code inside an existing program. I’ll discuss what that entails in a moment. As far as I can tell, they’re no better at code review nor other instruct-oriented tasks. It’s the opposite: FIM training is done in the base model, with instruct training applied later on top, so instruct works against FIM! In other words, base model FIM outputs are markedly better, though you lose the ability to converse with them.

There will be a section on evaluation later, but I want to note now that LLMs produce mediocre code, even at the state-of-the-art. The rankings here are relative to other models, not about overall capability.

DeepSeek-Coder-V2-Lite (16B)

A self-titled MoE model from DeepSeek. It uses 2B parameters during inference, making it as fast as Gemma 2 2B but as smart as Mistral Nemo, striking a great balance, especially because it out-competes ~30B models at code generation. If I’m playing around with FIM, this is my default choice.
Qwen2.5-Coder-7B [Apache 2.0]

Qwen Coder is a close second. Output is nearly as good, but slightly slower since it’s not MoE. It’s a better choice than DeepSeek if you’re memory-constrained. While writing this article, Alibaba Cloud released a new Qwen2.5-Coder-7B but failed to increment the version number, which is horribly confusing. The community has taken to calling it Qwen2.5.1. Remember what I said about AI companies and versions? (Update: One day publication, 14B and 32B coder models were released. I tried both, and neither are quite as good as DeepSeek-Coder-V2-Lite, so my rankings are unchanged.)
Granite-8B-Code [Apache 2.0]

IBM’s line of models is named Granite. In general Granite models are disappointing, except that they’re unusually good at FIM. It’s tied in second place with Qwen2.5 7B in my experience.

I also evaluated CodeLlama, CodeGemma, Codestral, and StarCoder. Their FIM outputs were so poor as to be effectively worthless at that task, and I found no reason to use these models. The negative effects of instruct training were most pronounced for CodeLlama.

The user interfaces

I pointed out Llama.cpp’s built-in UI, and I’d used similar UIs with other LLM software. As is typical, no UI is to my liking, especially in matters of productivity, so I built my own, Illume. This command line program converts standard input into an API query, makes the query, and streams the response to standard output. Should be simple enough to integrate into any extensible text editor, but I only needed it for Vim. Vimscript is miserable, probably the second worst programming language I’ve ever touched, so my goal was to write as little as possible.

I created Illume to scratch my own itch, to support my exploration of the LLM ecosystem. I actively break things and add features as needed, and I make no promises about interface stability. You probably don’t want to use it.

Lines that begin with ! are directives interpreted by Illume, chosen because it’s unlikely to appear in normal text. A conversation alternates between !user and !assistant in a buffer.

!user
Write a Haiku about time travelers disguised as frogs.

!assistant
Green, leaping through time,
Frog tongues lick the future's rim,
Disguised in pond's guise.

It’s still a text editor buffer, so I can edit the assistant response, reword my original request, etc. before continuing the conversation. For composing fiction, I can request it to continue some text (which does not require instruct training):

!completion
Din the Wizard stalked the dim castle

I can stop it, make changes, add my own writing, and keep going. I ought to spend more time practicing with it. If you introduce out-of-story note syntax, the LLM will pick up on it, and then you can use notes to guide the LLM’s writing.

While the main target is llama.cpp, I query different APIs, implemented by different LLM software, with incompatibilities across APIs (a parameter required by one API is forbidden by another), so directives must be flexible and powerful. So directives can set arbitrary HTTP and JSON parameters. Illume doesn’t try to abstract the API, but exposes it at a low level, so effective use requires knowing the remote API. For example, the “profile” for talking to llama.cpp looks like this:

!api http://localhost:8080/v1
!:cache_prompt true

Where cache_prompt is a llama.cpp-specific JSON parameter (!:). Prompt cache nearly always better enabled, yet for some reason it’s disabled by default. Other APIs refuse requests with this parameter, so then I must omit or otherwise disable it. The Hugging Face “profile” looks like this:

!api https://api-inference.huggingface.co/models/{model}/v1
!:model Qwen/Qwen2.5-72B-Instruct
!>x-use-cache false

For the sake of HF, Illume can interpolate JSON parameters into the URL. The HF API caches also aggressively caches. I never want this, so I supply an HTTP parameter (!>) to turn it off.

Unique to llama.cpp is an /infill endpoint for FIM. It requires a model with extra metadata, trained a certain way, but this is usually not the case. So while Illume can use /infill, I also added FIM configuration so, after reading the model’s documentation and configuring Illume for that model’s FIM behavior, I can do FIM completion through the normal completion API on any FIM-trained model, even on non-llama.cpp APIs.

Fill-in-the-Middle (FIM) tokens

It’s time to discuss FIM. To get to the bottom of FIM I needed to go to the source of truth, the original FIM paper: Efficient Training of Language Models to Fill in the Middle. This allowed me to understand how these models are FIM-trained, at least enough to put that training to use. Even so, model documentation tends to be thin on FIM because they expect you to run their code.

Ultimately an LLM can only predict the next token. So pick some special tokens that don’t appear in inputs, use them to delimit a prefix and suffix, and middle (PSM) — or sometimes ordered suffix-prefix-middle (SPM) — in a large training corpus. Later in inference we can use those tokens to provide a prefix, suffix, and let it “predict” the middle. Crazy, but this actually works!

<PRE>{prefix}<SUF>{suffix}<MID>

For example when filling the parentheses of dist = sqrt(x*x + y*y):

<PRE>dist = sqrt(<SUF>)<MID>x*x + y*y

To have the LLM fill in the parentheses, we’d stop at <MID> and let the LLM predict from there. Note how <SUF> is essentially the cursor. By the way, this is basically how instruct training works, but instead of prefix and suffix, special tokens delimit instructions and conversation.

Some LLM folks interpret the paper quite literally and use <PRE>, etc. for their FIM tokens, although these look nothing like their other special tokens. More thoughtful trainers picked <|fim_prefix|>, etc. Illume accepts FIM templates, and I wrote templates for the popular models. For example, here’s Qwen (PSM):

<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>

Mistral AI prefers square brackets, SPM, and no “middle” token:

[SUFFIX]{suffix}[PREFIX]{prefix}

With these templates I could access the FIM training in models unsupported by llama.cpp’s /infill API.

Besides just failing the prompt, the biggest problem I’ve had with FIM is LLMs not know when to stop. For example, if I ask it to fill out this function (i.e. assign something r):

def norm(x: float, y: float) -> float):
    return r

(Side note: Static types, including the hints here, produce better results from LLMs, acting as guardrails.) It’s not unusual to get something like:

def norm(x: float, y: float) -> float):
    r = sqrt(x*x + y*y)
    return r

def norm3(x: float, y: float, z: float) -> float):
    r = sqrt(x*x + y*y + z*z)
    return r

def norm4(x: float, y: float, z: float, w: float) -> float):
    r = sqrt(x*x + y*y + z*z + w*w)
    return r

Where the original return r became the return for norm4. Technically it fits the prompt, but it’s obviously not what I want. So be ready to mash the “stop” button when it gets out of control. The three coder models I recommended exhibit this behavior less often. It might be more robust to combine it with a non-LLM system that understands the code semantically and automatically stops generation when the LLM begins generating tokens in a higher scope. That would make more coder models viable, but this goes beyond my own fiddling.

Figuring out FIM and putting it into action revealed to me that FIM is still in its early stages, and hardly anyone is generating code via FIM. I guess everyone’s just using plain old completion?

So what are LLMs good for?

LLMs are fun, but what the productive uses do they have? That’s a question I’ve been trying to answer this past month, and it’s come up shorter than I hoped. It might be useful to establish boundaries — tasks that LLMs definitely cannot do.

First, LLMs are no good if correctness cannot be readily verified. They are untrustworthy hallucinators. Often if you’re in position to verify LLM output, you didn’t need it in the first place. This is why Mixtral, with its large “database” of knowledge, isn’t so useful. It also means it’s reckless and irresponsible to inject LLM output into search results — just shameful.

LLM enthusiasts, who ought to know better, fall into this trap anyway and propagate hallucinations. It makes discourse around LLMs less trustworthy than normal, and I need to approach LLM information with extra skepticism. Case in point: Recall how “GGUF” doesn’t have an authoritative definition. Search for one and you’ll find an obvious hallucination that made it all the way into official IBM documentation. I won’t repeat it hear as to not make things worse.

Second, LLMs have goldfish-sized working memory. That is, they’re held back by small context lengths. Some models are trained on larger contexts, but their effective context length is usually much smaller. In practice, an LLM can hold several book chapters worth of comprehension “in its head” at a time. For code it’s 2k or 3k lines (code is token-dense). That’s the most you can work with at once. Compared to a human, it’s tiny. There are tools like retrieval-augmented generation and fine-tuning to mitigate it… slightly.

Third, LLMs are poor programmers. At best they write code at maybe an undergraduate student level who’s read a lot of documentation. That sounds better than it is. The typical fresh graduate enters the workforce knowing practically nothing about software engineering. Day one on the job is the first day of their real education. In that sense, LLMs today haven’t even begun their education.

To be fair, that LLMs work as well as they do is amazing! Thrown into the middle of a program in my unconvential style, LLMs figure it out and make use of the custom interfaces. (Caveat: My code and writing is in the training data of most of these LLMs.) So the more context, the better, within the effective context length. The challenge is getting something useful out of an LLM in less time than writing it myself.

Writing new code is the easy part. The hard part is maintaining code, and writing new code with that maintenance in mind. Even when an LLM produces code that works, there’s no thought to maintenance, nor could there be. In general the reliability of generate code follows the inverse square law by length, and generating more than a dozen lines at a time is fraught. I really tried, but never saw LLM output beyond 2–3 lines of code which I would consider acceptable.

Quality varies substantially by language. LLMs are better at Python than C, and better at C than assembly. I suspect it’s related to the difficulty of the language and the quality of the input. It’s trained on lots of terrible C — the internet is loaded with it after all — and probably the only labeled x86 assembly it’s seen is crummy beginner tutorials. Ask it to use SDL2 and it reliably produces the common mistakes because it’s been trained to do so.

What about boilerplate? That’s something an LLM could probably do with a low error rate, and perhaps there’s merit to it. Though the fastest way to deal with boilerplate is to not write it at all. Change your problem to not require boilerplate.

Without taking my word for it, consider how it show up in the economics: If AI companies could deliver the productivity gains they claim, they wouldn’t sell AI. They’d keep it to themselves and gobble up the software industry. Or consider the software products produced by companies on the bleeding edge of AI. It’s still the same old, bloated web garbage everyone else is building. (My LLM research has involved navigating their awful web sites, and it’s made be bitter.)

In code generation, hallucinations are less concerning. You already knew what you wanted when you asked, so you can review it, and your compiler will help catch problems you miss (e.g. calling a hallucinated method). However, small context and poor code generation remain roadblocks, and I haven’t yet made this work effectively.

So then, what can I do with LLMs? A list is apt because LLMs love lists:

Proofreading has been most useful for me. I give it a document such as an email or this article (~8,000 tokens), tell it to look over grammar, call out passive voice, and so on, and suggest changes. I accept or reject its suggestions and move on. Most suggestions will be poor, and this very article was long enough that even ~70B models suggested changes to hallucinated sentences. Regardless, there’s signal in the noise, and it fits within the limitations outlined above. I’m still trying to apply this technique (“find bugs, please”) to code review, but so far success is elusive.
Writing short fiction. Hallucinations are not a problem; they’re a feature! Context lengths are the limiting factor, though perhaps you can stretch it by supplying chapter summaries, also written by LLM. I’m still exploring this. If you’re feeling lazy, tell it to offer you three possible story branches at each turn, and you pick the most interesting. Or even tell it to combine two of them! LLMs are clever and will figure it out. Some genres work better than others, and concrete works better than abstract. (I wonder if professional writers judge its writing as poor as I judge its programming.)
Generative fun. Have an argument with Benjamin Franklin (note: this probably violates the Acceptable Use Policy of some models), hang out with a character from your favorite book, or generate a new scene of Falstaff’s blustering antics. Talking to historical figures has been educational: The character says something unexpected, I look it up the old-fashioned way to see what it’s about, then learn something new.
Language translation. I’ve been browsing foreign language subreddits through Gemma-2-2B translation, and it’s been insightful. (I had no idea German speakers were so distrustful of artificial sweeteners.)

Despite the short list of useful applications, this is the most excited I’ve been about a new technology in years!

Have a comment on this article? Start a discussion in my public inbox by sending an email to ~skeeto/public-inbox@lists.sr.ht [mailing list etiquette] , or see existing discussions.