Abstract
Experimental Log & Key Insights
Stage 1 & 2: From Bigrams to Self-Attention
The initial stages established a baseline with a simple Bigram model and demonstrated the superiority of a learned, dynamic context. Even a single-head attention mechanism measurably outperformed a fixed-context model and a naive "bag-of-words" averaging approach.
| Experiment | Design Choice | Test Loss | Test Accuracy |
|---|---|---|---|
| Bigram | Baseline (Context=1) | 2.4640 | 0.2850 |
| Averaged Context | Bag-of-Words Context | 2.4619 | 0.2853 |
| Single-Head Attention | Learned, Dynamic Context | 2.4578 | 0.2859 |
Insight: A learned, dynamic context via self-attention is empirically superior to fixed or naive global context methods, even at a very small scale.
Stage 3 & 4: Building and Stacking Blocks Critical Failure
A complete Transformer block (MHA + FFN) was built. However, simply stacking these blocks (4 layers deep) resulted in a critical failure: the model's performance did not improve, showing no signs of learning due to the vanishing gradient problem.
| Experiment | Design Choice | Test Loss | Test Accuracy |
|---|---|---|---|
| 1-Layer Transformer Block | MHA + FFN (ReLU) | 2.4618 | 0.2853 |
| 4-Layer Stacked Blocks | Deep network without stabilization | 2.4622 | 0.2852 |
Insight: Simply stacking Transformer blocks does not work. This failure perfectly motivates the need for stabilization techniques like residual connections and layer normalization, which are not just minor improvements but essential enablers for deep models.
Stage 5 & 7: Stabilization and The Great Pivot
After adding residual connections and LayerNorm to create a stable, scalable architecture, the project underwent a fundamental pivot. The character-level model was replaced with a modern Byte-Pair Encoding (BPE) tokenizer, resulting in the single most significant performance leap.
| Experiment | Design Choice | Test Loss | Note |
|---|---|---|---|
| Baseline Transformer | Character-level Tokenizer | 11.7760 | High loss due to char-level prediction |
| Tokenizer Upgrade | BPE Tokenizer (cl100k_base) | 7.9622 | Massive loss reduction from tokenizer alone |
Insight: A good tokenizer is more important than many small architectural tweaks. Switching from character-level to BPE tokens dramatically simplified the learning task and provided a massive quadratic speed-up for the attention mechanism.
Stage 8: Modern Architectural Optimizations
With a solid foundation, this stage tested modern architectural upgrades from models like LLaMA. The goal was to find the optimal micro-architecture by comparing different normalizations, activations, and attention implementations.
| Component Tested | Winning Design | Test Loss | Key Benefit |
|---|---|---|---|
| Normalization | RMSNorm | 7.9585 | Faster and slightly better than LayerNorm. |
| Activation Function | SwiGLU | 7.9543 | Clear winner, best performance. |
| Attention Implementation | Flash Attention | 7.9606 | Identical performance, huge speed/memory gain. |
Insight: Modern components provide clear, measurable benefits. `SwiGLU` is a superior activation, and optimizations like `RMSNorm` and `Flash Attention` offer "free" gains in speed and efficiency with no loss in quality.
Stage 9: Final Pre-training and "mytNano"
The final stage combined all winning components into a single model ("mytNano") and focused on stable pre-training dynamics. Implementing proper weight initialization and a cosine decay learning rate schedule unlocked the final layer of performance.
| Model | Design Choice | Final Test Loss | Final Test BPC |
|---|---|---|---|
| mytNano (Final Model) | All optimizations + Training stability | 7.4214 | 10.7068 |
Insight: Architectural excellence must be paired with stable training dynamics. Proper initialization and learning rate schedules are responsible for a significant drop in final loss, demonstrating their importance in achieving SOTA results.
Conclusion
This project successfully built a Transformer from the ground up, empirically validating the function and impact of each architectural component. The journey revealed several key truths: a powerful tokenizer provides the single largest performance gain, stabilization techniques like residuals and normalization are non-negotiable enablers of depth, and modern optimizations like SwiGLU, RMSNorm, and Flash Attention offer significant, "free" improvements in performance and efficiency.
The final model, "mytNano," achieved a final test loss of 7.4214, a culmination of meticulous, step-by-step additions and tuning. This demonstrates that a deep understanding of each component is crucial for building efficient and powerful language models. The work has been extended by scaling up this architecture to pre-train a 1 billion parameter model in the myT-LLM project.