Step-by-Step Towards Building a mini-GPT

Abstract

This project demystifies the Transformer architecture by building it incrementally from the ground up. Starting with a simple Bigram model, each stage adds a new component—attention, feed-forward networks, normalization—to empirically measure its impact. The goal is to move the Transformer from a "black-box" to a transparent, understandable system by answering key questions: Why is self-attention needed? Why are residuals and LayerNorm critical? How much do modern optimizations like SwiGLU and Flash Attention truly matter? This repository serves as a living research notebook, providing data-driven answers at each step of the journey.

Experimental Log & Key Insights

Stage 1 & 2: From Bigrams to Self-Attention

The initial stages established a baseline with a simple Bigram model and demonstrated the superiority of a learned, dynamic context. Even a single-head attention mechanism measurably outperformed a fixed-context model and a naive "bag-of-words" averaging approach.

Experiment	Design Choice	Test Loss	Test Accuracy
Bigram	Baseline (Context=1)	2.4640	0.2850
Averaged Context	Bag-of-Words Context	2.4619	0.2853
Single-Head Attention	Learned, Dynamic Context	2.4578	0.2859

Insight: A learned, dynamic context via self-attention is empirically superior to fixed or naive global context methods, even at a very small scale.

Stage 3 & 4: Building and Stacking Blocks Critical Failure

A complete Transformer block (MHA + FFN) was built. However, simply stacking these blocks (4 layers deep) resulted in a critical failure: the model's performance did not improve, showing no signs of learning due to the vanishing gradient problem.

Experiment	Design Choice	Test Loss	Test Accuracy
1-Layer Transformer Block	MHA + FFN (ReLU)	2.4618	0.2853
4-Layer Stacked Blocks	Deep network without stabilization	2.4622	0.2852

Insight: Simply stacking Transformer blocks does not work. This failure perfectly motivates the need for stabilization techniques like residual connections and layer normalization, which are not just minor improvements but essential enablers for deep models.

Stage 5 & 7: Stabilization and The Great Pivot

After adding residual connections and LayerNorm to create a stable, scalable architecture, the project underwent a fundamental pivot. The character-level model was replaced with a modern Byte-Pair Encoding (BPE) tokenizer, resulting in the single most significant performance leap.

Experiment	Design Choice	Test Loss	Note
Baseline Transformer	Character-level Tokenizer	11.7760	High loss due to char-level prediction
Tokenizer Upgrade	BPE Tokenizer (cl100k_base)	7.9622	Massive loss reduction from tokenizer alone

Insight: A good tokenizer is more important than many small architectural tweaks. Switching from character-level to BPE tokens dramatically simplified the learning task and provided a massive quadratic speed-up for the attention mechanism.

Stage 8: Modern Architectural Optimizations

With a solid foundation, this stage tested modern architectural upgrades from models like LLaMA. The goal was to find the optimal micro-architecture by comparing different normalizations, activations, and attention implementations.

Component Tested	Winning Design	Test Loss	Key Benefit
Normalization	RMSNorm	7.9585	Faster and slightly better than LayerNorm.
Activation Function	SwiGLU	7.9543	Clear winner, best performance.
Attention Implementation	Flash Attention	7.9606	Identical performance, huge speed/memory gain.

Insight: Modern components provide clear, measurable benefits. `SwiGLU` is a superior activation, and optimizations like `RMSNorm` and `Flash Attention` offer "free" gains in speed and efficiency with no loss in quality.

Stage 9: Final Pre-training and "mytNano"

The final stage combined all winning components into a single model ("mytNano") and focused on stable pre-training dynamics. Implementing proper weight initialization and a cosine decay learning rate schedule unlocked the final layer of performance.

Model	Design Choice	Final Test Loss	Final Test BPC
mytNano (Final Model)	All optimizations + Training stability	7.4214	10.7068

Insight: Architectural excellence must be paired with stable training dynamics. Proper initialization and learning rate schedules are responsible for a significant drop in final loss, demonstrating their importance in achieving SOTA results.

Conclusion

This project successfully built a Transformer from the ground up, empirically validating the function and impact of each architectural component. The journey revealed several key truths: a powerful tokenizer provides the single largest performance gain, stabilization techniques like residuals and normalization are non-negotiable enablers of depth, and modern optimizations like SwiGLU, RMSNorm, and Flash Attention offer significant, "free" improvements in performance and efficiency.

The final model, "mytNano," achieved a final test loss of 7.4214, a culmination of meticulous, step-by-step additions and tuning. This demonstrates that a deep understanding of each component is crucial for building efficient and powerful language models. The work has been extended by scaling up this architecture to pre-train a 1 billion parameter model in the myT-LLM project.

See All Projects of Mine →