Step-by-Step Towards Building a mini-GPT

From "What if GPT-2 was built on current SOTA architectural design choices?" to "Building my own LLM"

Mahanth Yalla

M.Tech Artificial Intelligence

Indian Institute of Science, Bengaluru

Abstract

This project demystifies the Transformer architecture by building it incrementally from the ground up. Starting with a simple Bigram model, each stage adds a new component—attention, feed-forward networks, normalization—to empirically measure its impact. The goal is to move the Transformer from a "black-box" to a transparent, understandable system by answering key questions: Why is self-attention needed? Why are residuals and LayerNorm critical? How much do modern optimizations like SwiGLU and Flash Attention truly matter? This repository serves as a living research notebook, providing data-driven answers at each step of the journey.

Experimental Log & Key Insights

Stage 1 & 2: From Bigrams to Self-Attention

The initial stages established a baseline with a simple Bigram model and demonstrated the superiority of a learned, dynamic context. Even a single-head attention mechanism measurably outperformed a fixed-context model and a naive "bag-of-words" averaging approach.

Experiment Design Choice Test Loss Test Accuracy
Bigram Baseline (Context=1) 2.4640 0.2850
Averaged Context Bag-of-Words Context 2.4619 0.2853
Single-Head Attention Learned, Dynamic Context 2.4578 0.2859

Insight: A learned, dynamic context via self-attention is empirically superior to fixed or naive global context methods, even at a very small scale.

Stage 3 & 4: Building and Stacking Blocks Critical Failure

A complete Transformer block (MHA + FFN) was built. However, simply stacking these blocks (4 layers deep) resulted in a critical failure: the model's performance did not improve, showing no signs of learning due to the vanishing gradient problem.

Experiment Design Choice Test Loss Test Accuracy
1-Layer Transformer Block MHA + FFN (ReLU) 2.4618 0.2853
4-Layer Stacked Blocks Deep network without stabilization 2.4622 0.2852

Insight: Simply stacking Transformer blocks does not work. This failure perfectly motivates the need for stabilization techniques like residual connections and layer normalization, which are not just minor improvements but essential enablers for deep models.

Stage 5 & 7: Stabilization and The Great Pivot

After adding residual connections and LayerNorm to create a stable, scalable architecture, the project underwent a fundamental pivot. The character-level model was replaced with a modern Byte-Pair Encoding (BPE) tokenizer, resulting in the single most significant performance leap.

Experiment Design Choice Test Loss Note
Baseline Transformer Character-level Tokenizer 11.7760 High loss due to char-level prediction
Tokenizer Upgrade BPE Tokenizer (cl100k_base) 7.9622 Massive loss reduction from tokenizer alone

Insight: A good tokenizer is more important than many small architectural tweaks. Switching from character-level to BPE tokens dramatically simplified the learning task and provided a massive quadratic speed-up for the attention mechanism.

Stage 8: Modern Architectural Optimizations

With a solid foundation, this stage tested modern architectural upgrades from models like LLaMA. The goal was to find the optimal micro-architecture by comparing different normalizations, activations, and attention implementations.

Component Tested Winning Design Test Loss Key Benefit
Normalization RMSNorm 7.9585 Faster and slightly better than LayerNorm.
Activation Function SwiGLU 7.9543 Clear winner, best performance.
Attention Implementation Flash Attention 7.9606 Identical performance, huge speed/memory gain.

Insight: Modern components provide clear, measurable benefits. `SwiGLU` is a superior activation, and optimizations like `RMSNorm` and `Flash Attention` offer "free" gains in speed and efficiency with no loss in quality.

Stage 9: Final Pre-training and "mytNano"

The final stage combined all winning components into a single model ("mytNano") and focused on stable pre-training dynamics. Implementing proper weight initialization and a cosine decay learning rate schedule unlocked the final layer of performance.

Model Design Choice Final Test Loss Final Test BPC
mytNano (Final Model) All optimizations + Training stability 7.4214 10.7068

Insight: Architectural excellence must be paired with stable training dynamics. Proper initialization and learning rate schedules are responsible for a significant drop in final loss, demonstrating their importance in achieving SOTA results.

Conclusion

This project successfully built a Transformer from the ground up, empirically validating the function and impact of each architectural component. The journey revealed several key truths: a powerful tokenizer provides the single largest performance gain, stabilization techniques like residuals and normalization are non-negotiable enablers of depth, and modern optimizations like SwiGLU, RMSNorm, and Flash Attention offer significant, "free" improvements in performance and efficiency.

The final model, "mytNano," achieved a final test loss of 7.4214, a culmination of meticulous, step-by-step additions and tuning. This demonstrates that a deep understanding of each component is crucial for building efficient and powerful language models. The work has been extended by scaling up this architecture to pre-train a 1 billion parameter model in the myT-LLM project.

See All Projects of Mine →