llm from scratch

building an llm from scratch

archived Feb 2026

overview

building every single component of an LLM from scratch using PyTorch.

the goal is to understand how llms actually work from first principles.

covered so far:

1. data & tokenization pipeline

text cleaning and normalization
tokenization pipeline for autoregressive training
dataset integration using hugging face corpora

2. positional encoding strategies

learned (linear) positional embeddings
rotary positional embeddings (rope)

3. attention mechanisms

scaled dot-product attention
multi-head attention
advanced attention variants:
grouped query attention (gqa)
multi-head latent attention (mla)
sliding window attention (swa)

4. transformer architecture design

complete transformer block implementation
layer normalization
mlp / feed-forward networks
residual connections

5. scaling & efficiency techniques

mixture-of-experts (moe) routing
kv caching for optimized inference

6. training & adaptation

autoregressive pretraining on large-scale corpora
classification fine-tuning
supervised instruction fine-tuning