Ghost in the Machine

Inspiration

It's Valentines weekend and I'm not able to spend it with my girlfriend. Therefore, I decided to make her a funny little gift: a clone of myself (based on our 100k+ messages). I thought some kind of LLM would be a good place to start, and because she doesn't have much coding experience or access to hardware, I decided to make this LLM run on the cheapest computer around: the Raspberry-Pi Zero (512MB RAM, $15). I don't think you can get much smaller than this.

On a broader scale, this project explores how small we can make language models so they can still learn domain specific knowledge, and what the bare minimum hardware is for serving these language models. I believe these to be important questions because they provide answers to how we can distribute LLMs at fraction of their cost today. This is a key step in democratizing AI technology.

What it does

I put myself in a Raspberry Pi!

You plug in a Raspberry Pi containing an LLM to your computer and you can converse with it, not like a chatbot, but like an actual human:

It will talk to you if you leave it alone for too long.
It won't automatically respond to every message you send.

This is achieved by:

Training a 15M parameter LLM from-scratch on my text messages to mimic my conversations with my girlfriend.
Creating the protocol to inference it (in C) on a bare-metal (no operating system) Raspberry-Pi Zero.
Making a minimalistic messaging webapp (mostly just for demonstration purposes).

How we built it

1. Training the Model

Data Preparation: Preparing 1.5 years of Facebook Messenger messages with my girlfriend.
Synthetic Data: Generating synthetic conversation data using Gemini-2.0-Flash to make more training examples (~20k messages total).
Pipeline: Using training pipelines from the llama2.c repository (Ack) to train a custom tokenizer and the model.
Optimization: Quantizing the model to fp8 for inference speed.

2. Bare-Metal Inference

Bootloader: Making bootloader code to quickly load the model into Pi RAM.
Inference: Adapting inference code from llama2.c to be able to do forward passes on the Pi.
Communication: Creating a UART communication protocol to actually simulate messaging someone with the model outputs.

Challenges we ran into

I can bucket the main challenge of this project into roughly 3 places:

Model Goodness: Getting the model to produce coherent English and proper turn-based dialogue.
Speed of Inference: Getting the model to run fast enough on the Pi.
Connecting everything together: Handling hardware/software integration.

Here are some things I found interesting for each:

Skill Expression in Data: I tested extensively with training 15M and 42M parameter models with many different sweeps of hyperparameters. I ended up feeling like there was not much skill expression here. HOWEVER, I found a lot of skill expression in how the dataset was made. I cleaned 1.5 years of messenger data (80k messages). Initially, just sampling from true messages, I found the model was very incoherent until I refined the cleaning process.
Inference Constraints: This was probably the biggest hurdle. I needed to throw out the idea of models larger than 15M parameters because it didn't make sense to inference them on the Pi. I ended up partially solving this by quantizing my models (into fp8) and training a custom tokenizer (vocab size 2048). One forward pass takes ~2.2s still D:
Race Conditions: Many race conditions appeared regarding the user messages and how the Pi is supposed to respond, especially if user messages are sent while the Pi is in the middle of generating tokens. I solved this by developing a handshaking protocol that standardizes communication between the Pi and my computer.

Accomplishments that we're proud of

Learning about LLMs and pulling everything together.

What we learned

This was my first time training an LLM. It was amazing to learn how developed the model training ecosystem really is, between having to write very little actual training code because PyTorch wraps everything so well, and having unbelievable insight into my training process with just a few lines of code using WandB. These tools really amazed me.

Also, it was very cool to witness firsthand how my data turned randomly initialized weights into something with structure. On one hand, I understand the "black box" sentiment more now, because I literally can't comprehend that the model learned my data so well even though I intentionally chose the simplest training pipeline possible. On the other hand, it was cool to learn that I still have some control. For instance, it was really cool to see my validation loss curve invert itself when I added some dropout.

I also learned a lot regarding how the Transformer actually works. One thing I learned that turns out to be of extreme significance to my project is how the KV cache is computed and saved. When turning an LLM into a chat application, I figured out that you can save a lot on inference time by checkpointing your KV cache (and not recomputing it needlessly). Ultimately there is probably a lot more hiding here in terms of unlocking faster inference.

There are actually so many cool things I learned this weekend. Please come ask me about this if you want to hear more.

What's next for Ghost in the Machine

Model Training

I was unable to download my Facebook Messenger data from the past year in time to train my model with this data. I feel like this is a significant chunk of conversational data I have that's real, so I'm eager to retrain once this finally downloads. I also wonder how training on a more general conversation dataset first and then fine-tuning with only certain layers unfrozen would change things. Also, I'm generally curious what the pitfalls were of my synthetic data generation and how much I can scale this.

Speedier Inference

There's definitely so much to unlock here. I did not rewrite much of Karpathy's original inference code to conform to my use case, and I am aware of the repo llama.cpp which has a lot of speedup tricks which I want to take a look at. In terms of Pi stuff, the Pi Zero does have a GPU (maybe not in the traditional sense but still), and maybe it will be faster at doing the matmuls. I feel like I still need a better understanding of how the transformer actually works, so it will be good to step through this after TreeHacks.