Mini-GPT: Building a Transformer Language Model From Scratch

A minimalist, deep-learning project implementing a decoder-only Transformer language model based on the GPT architecture. This model was built entirely from scratch using PyTorch, trained on the Tiny Shakespeare dataset via a GPU runtime, and optimized to run lightweight text generations completely on a free CPU runtime.


🧠 What I Learned Doing This Project

Building a custom LLM from scratch provided deep, hands-on engineering insights into how modern foundation models operate under the hood:

1. Tokenization and Data Pipelines

  • Learned how text is translated into data by designing a custom character-level tokenizer mapping string vocabularies into numerical tensors.
  • Implemented batch data loaders (get_batch) to slice raw text into parallel 2D contexts for high-throughput GPU training.

2. Attention Mechanics & Architecture

  • Built Causal Self-Attention Layers from scratch, mastering the matrix math ($Q \times K^T$) used to calculate character affinities.
  • Learned the crucial role of Autoregressive Masking—using a lower-triangular matrix (tril) to force causal visibility, preventing the model from "looking into the future" during training.
  • Stacked Multi-Head Attention blocks alongside Feed-Forward Multi-Layer Perceptrons (MLPs) to map both syntax rules and historical text context.

3. Optimization and Training Dynamics

  • Configured the AdamW optimizer with a learning rate of 3e-4 to handle deep network training stability.
  • Analyzed loss convergence metrics, successfully reducing cross-entropy loss from a baseline of 4.37 (pure random noise guessing) to an optimized 1.22 over 3,000 steps.
  • Monitored the gap between training and validation loss to ensure the network was genuinely generalizing rather than just memorizing data (overfitting).

4. Weights Serialization & CPU Inference Setup

  • Mastered state serialization by exporting trained parameters into a reusable .pth checkpoint file.
  • Designed an elegant deployment solution that loads the model framework on a standard CPU runtime, feeding the saved weights safely through an in-memory byte buffer (io.BytesIO) to run instant, free text generation without retraining.

⚙️ Model Configurations

Hyperparameter Value Description
batch_size 64 Independent sequences processed in parallel
block_size 256 Maximum token context window for predictions
n_embd 384 Embedding and hidden layer dimension size
n_head 6 Parallel self-attention heads per layer
n_layer 6 Total Transformer blocks stacked sequentially
max_iters 3000 Total optimization steps

🎮 Deployment and Running

The project features a decoupled, interactive file portal setup.

When running on a clean CPU session, running the application cell triggers an automated "Choose Files" interface directly within the interface. Simply select the saved mini_gpt_shakespeare.pth weights file, and the model instantly maps the parameters to the blueprint and begins text inference:

--- Mini-GPT Inference Portal ---
Please upload your 'mini_gpt_shakespeare.pth' weight file below:

mini_gpt_shakespeare.pth - 100% done
Initializing network architecture blueprint...
Mapping weights directly from memory buffer...

--- Generating text on CPU ---

Yet thyself are proclaimd their with arms fair.

KING RICHARD II:
France his ranspring of Warwick, what takes Warwickman?

RICHARD: Court! as as we know the same to slaughter?

Built With

Share this project:

Updates