About the project

What inspired us

Weber State students ask the same questions every semester. "How do I register?" "Where is Tracy Hall?" "What's the prereq for CS 2420?" The answers exist on weber.edu but finding them takes 10 minutes. Existing AI chatbots either don't know WSU exists or return hallucinated information. We wanted something that actually knows Weber State, not a wrapper around someone else's model, but a language model trained on WSU data from the ground up.

How we built it

We trained a 628 million parameter decoder-only transformer called WildcatLM from scratch. No pretrained weights. Every parameter came from gradient descent on our own data.

Architecture:

  • 24 transformer layers, 1536-dimensional hidden state
  • Grouped Query Attention: 24 query heads, 8 KV heads (cuts KV-cache memory by 3x)
  • SwiGLU activation, RMSNorm, Rotary Position Embeddings (RoPE)
  • Weight-tied embeddings, 16,384-token BPE vocabulary, 512-token context window

Training ran in 7 phases:

Phase Data Tokens Val Loss
1 Text8 / Wikipedia 1.1 billion -
2 WikiText-103 252M 2.1
3 Scraped weber.edu 86M 1.4
3.5 86K WSU Q&A pairs instruction tuning 0.48
5 49K scraped WSU Q&A real student questions 0.35
6 62K deduplicated Q&A 22.5M tokens, 3 epochs 0.2808
7 GRPO reward training - degraded, reverted

Phase 1 and 2 ran on Google Colab with an A100, data stored on Google Drive. Phases 4 to 6 ran on the Weber State DSRC A100 VM.

The full stack: User (Next.js / Vercel) -> Quality gate (KB lookup, domain filter, repetition check) -> Cloudflare Tunnel -> WildcatAI orchestrator (Rust / Axum, port 3000) -> TF-IDF RAG (scikit-learn, 62K pairs, port 8090) -> Titanium inference server (Rust / HuggingFace candle, bfloat16, port 8000) -> WildcatLM 628M on A100 40GB

The RAG system retrieves the closest WSU Q&A pair using bigram TF-IDF with 50,000 features and cosine similarity. The threshold is 0.15. That pair gets injected as a one-shot example in the same Q:/A: format the model was trained on. Mismatching the prompt format caused most of our hallucinations.

Phase 7 (GRPO): We implemented Group Relative Policy Optimization, the same technique used by DeepSeek-R1. We generated 4 completions per question and trained toward the ones that mentioned real WSU facts. Best training reward was +1.238. We reverted to Phase 6 because 200 steps was not enough to generalize.

Challenges

This was attempt five. We tried four different approaches before the hackathon: a pure Rust transformer, a custom autograd engine called Ferrum, and two more variations. All of them produced incoherent output. We restarted twice during the hackathon itself, once for a tokenizer mismatch and once for training divergence.

The prompt format bug caused the worst hallucinations. The model was trained on Q: ...\nA: format but the orchestrator was sending User: ...\nAssistant: which the model had never seen. This caused responses like "John F. Moench became president... John F. Moench became president..." repeating for 400 tokens. The fix is compiled and ready but SSH to the VM is currently blocked by the university firewall.

SSH died two days before submission. Port 22 refused connection. The VM was still running but unreachable. We got around it with a Cloudflare Tunnel. Cloudflared runs on the VM and exposes the Rust server through Cloudflare's network.

Weight tying caused a crash when saving GRPO checkpoints. The embedding matrix and output projection share the same memory and safetensors rejects duplicate tensors. Fixed by cloning all parameters before saving.

What we learned

Training 628M parameters is the easy part. Getting the full system to behave correctly, right prompt format, no repetition loops, stable serving, domain filtering, took more work than the training itself. We built a quality layer in the frontend to handle what the model cannot do reliably yet. That is what production ML at this scale actually looks like.

Built With

Share this project:

Updates