Mini-GPT: Building a Transformer Language Model From Scratch
A minimalist, deep-learning project implementing a decoder-only Transformer language model based on the GPT architecture. This model was built entirely from scratch using PyTorch, trained on the Tiny Shakespeare dataset via a GPU runtime, and optimized to run lightweight text generations completely on a free CPU runtime.
🧠 What I Learned Doing This Project
Building a custom LLM from scratch provided deep, hands-on engineering insights into how modern foundation models operate under the hood:
1. Tokenization and Data Pipelines
- Learned how text is translated into data by designing a custom character-level tokenizer mapping string vocabularies into numerical tensors.
- Implemented batch data loaders (
get_batch) to slice raw text into parallel 2D contexts for high-throughput GPU training.
2. Attention Mechanics & Architecture
- Built Causal Self-Attention Layers from scratch, mastering the matrix math ($Q \times K^T$) used to calculate character affinities.
- Learned the crucial role of Autoregressive Masking—using a lower-triangular matrix (
tril) to force causal visibility, preventing the model from "looking into the future" during training. - Stacked Multi-Head Attention blocks alongside Feed-Forward Multi-Layer Perceptrons (MLPs) to map both syntax rules and historical text context.
3. Optimization and Training Dynamics
- Configured the AdamW optimizer with a learning rate of
3e-4to handle deep network training stability. - Analyzed loss convergence metrics, successfully reducing cross-entropy loss from a baseline of 4.37 (pure random noise guessing) to an optimized 1.22 over 3,000 steps.
- Monitored the gap between training and validation loss to ensure the network was genuinely generalizing rather than just memorizing data (overfitting).
4. Weights Serialization & CPU Inference Setup
- Mastered state serialization by exporting trained parameters into a reusable
.pthcheckpoint file. - Designed an elegant deployment solution that loads the model framework on a standard CPU runtime, feeding the saved weights safely through an in-memory byte buffer (
io.BytesIO) to run instant, free text generation without retraining.
⚙️ Model Configurations
| Hyperparameter | Value | Description |
|---|---|---|
batch_size |
64 | Independent sequences processed in parallel |
block_size |
256 | Maximum token context window for predictions |
n_embd |
384 | Embedding and hidden layer dimension size |
n_head |
6 | Parallel self-attention heads per layer |
n_layer |
6 | Total Transformer blocks stacked sequentially |
max_iters |
3000 | Total optimization steps |
🎮 Deployment and Running
The project features a decoupled, interactive file portal setup.
When running on a clean CPU session, running the application cell triggers an automated "Choose Files" interface directly within the interface. Simply select the saved mini_gpt_shakespeare.pth weights file, and the model instantly maps the parameters to the blueprint and begins text inference:
--- Mini-GPT Inference Portal ---
Please upload your 'mini_gpt_shakespeare.pth' weight file below:
mini_gpt_shakespeare.pth - 100% done
Initializing network architecture blueprint...
Mapping weights directly from memory buffer...
--- Generating text on CPU ---
Yet thyself are proclaimd their with arms fair.
KING RICHARD II:
France his ranspring of Warwick, what takes Warwickman?
RICHARD: Court! as as we know the same to slaughter?
Built With
- python
- torch
- transformers
Log in or sign up for Devpost to join the conversation.