Mini-GPT: Building a Transformer Language Model From Scratch

A minimalist, deep-learning project implementing a decoder-only Transformer language model based on the GPT architecture. This model was built entirely from scratch using PyTorch, trained on the Tiny Shakespeare dataset via a GPU runtime, and optimized to run lightweight text generations completely on a free CPU runtime.

🧠 What I Learned Doing This Project

Building a custom LLM from scratch provided deep, hands-on engineering insights into how modern foundation models operate under the hood:

1. Tokenization and Data Pipelines

Learned how text is translated into data by designing a custom character-level tokenizer mapping string vocabularies into numerical tensors.
Implemented batch data loaders (get_batch) to slice raw text into parallel 2D contexts for high-throughput GPU training.

2. Attention Mechanics & Architecture

Built Causal Self-Attention Layers from scratch, mastering the matrix math ($Q \times K^T$) used to calculate character affinities.
Learned the crucial role of Autoregressive Masking—using a lower-triangular matrix (tril) to force causal visibility, preventing the model from "looking into the future" during training.
Stacked Multi-Head Attention blocks alongside Feed-Forward Multi-Layer Perceptrons (MLPs) to map both syntax rules and historical text context.

3. Optimization and Training Dynamics

Configured the AdamW optimizer with a learning rate of 3e-4 to handle deep network training stability.
Analyzed loss convergence metrics, successfully reducing cross-entropy loss from a baseline of 4.37 (pure random noise guessing) to an optimized 1.22 over 3,000 steps.
Monitored the gap between training and validation loss to ensure the network was genuinely generalizing rather than just memorizing data (overfitting).

4. Weights Serialization & CPU Inference Setup

Mastered state serialization by exporting trained parameters into a reusable .pth checkpoint file.
Designed an elegant deployment solution that loads the model framework on a standard CPU runtime, feeding the saved weights safely through an in-memory byte buffer (io.BytesIO) to run instant, free text generation without retraining.

⚙️ Model Configurations

Hyperparameter	Value	Description
`batch_size`	64	Independent sequences processed in parallel
`block_size`	256	Maximum token context window for predictions
`n_embd`	384	Embedding and hidden layer dimension size
`n_head`	6	Parallel self-attention heads per layer
`n_layer`	6	Total Transformer blocks stacked sequentially
`max_iters`	3000	Total optimization steps

🎮 Deployment and Running

The project features a decoupled, interactive file portal setup.

When running on a clean CPU session, running the application cell triggers an automated "Choose Files" interface directly within the interface. Simply select the saved mini_gpt_shakespeare.pth weights file, and the model instantly maps the parameters to the blueprint and begins text inference:

--- Mini-GPT Inference Portal ---
Please upload your 'mini_gpt_shakespeare.pth' weight file below:

mini_gpt_shakespeare.pth - 100% done
Initializing network architecture blueprint...
Mapping weights directly from memory buffer...

--- Generating text on CPU ---

Yet thyself are proclaimd their with arms fair.

KING RICHARD II:
France his ranspring of Warwick, what takes Warwickman?

RICHARD: Court! as as we know the same to slaughter?

Built With

python
torch
transformers

Updates

Santhosh P started this project — Jun 08, 2026 11:42 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.