“The real understanding comes when we get our hands dirty and build these things” - says my first AI mentor Richard Feynman.
The goal of this course is a preparation for AI-backend optimization (e.g., CUDA, tensor cores, memory hierarchy tuning).
I use my own mentor prompt generated with the help from my another prompt “mentor_generator v0.24.3” to learn LLMs under the hood by building it from scratch. The idea stems from the Stanford CS336 course but this course is interactive though it needs hallucination checks.
This is already the third mentor I regenerated for the course. I run my current mentor in free Qwen3-Max chat in no reasoning mode, previous ones were in Deepseek.
“What I cannot create, I do not understand.” - Richard Feynman
Hello, and welcome. We are beginning “Building a Small LLM from First Principles”, a depth-first course tailored to your profile:
You are a strong intermediate practitioner (NumPy backprop, CNNs, DL libraries).
Your goal is professional readiness as an AI Architect, with emphasis on CUDA-aware design and bare-metal understanding.
You operate in two environments: a high-VRAM GPU (RTX 4090ti, 16 GB) and a CPU-only Debian 13 system—so VRAM and latency constraints are central to every design decision.
Entire Learning Plan¶
The entire curriculum is structured into five phases, designed to take you from the fundamental neuron to a fully functional, efficient 100M parameter LLM. Here is the roadmap.
| Phase | Focus | Hands-On | Estimated Time (1h/day) |
|---|---|---|---|
| Phase 1 | Foundational Neurons & Backprop | NumPy: single neuron, activation, gradient descent | 5–7 days |
| Phase 2 | Core Transformer Components | PyTorch: tokenization, positional encoding, attention | 10–14 days |
| Phase 3 | Optimization & Architecture | KV Cache, Flash Attention, LayerNorm, AdamW | 10–12 days |
| Phase 4 | Training & Alignment | Dataset prep, training loop, SFT, basic PPO/DPO | 12–15 days |
| Phase 5 | Architectural Review & Deployment | Quantization (INT8), CPU inference, full system audit | 7–10 days |
Total estimated commitment: ~50–60 hours, paced by mastery, not calendar.
Goals & Practical Skills¶
By the end, you will:
Architect a 100M-parameter LLM from scratch.
Explain every component down to memory layout and gradient flow.
Optimize for both VRAM (4090ti) and CPU (Debian 13) constraints.
Defend your design choices in an AI Architect interview with mathematical and hardware-aware reasoning.
Administrative Notes¶
Each session is ~1 hour.
No deadlines—progress is gated solely by verified understanding.
You must demonstrate reasoning, not just recall. Passive acknowledgment (“I get it”) is insufficient.
We use one small step at a time, with micro-validation.
Can I run this course on the weaker laptop?¶
Hardware Requirements Analysis:
| Component | RTX 4090ti | Weaker Laptop | Impact |
|---|---|---|---|
| VRAM | 16GB | Shared system RAM (8GB total) | Limits model size and batch size |
| Compute | High parallel throughput | Limited CPU cores | Slower training times |
| Memory Bandwidth | ~1 TB/s | ~50 GB/s | Significant bottleneck for matrix ops |
| Practical Limits | 100M param model feasible | ~10M param model maximum | Scales final project scope |
If you proceed with the weaker laptop, we’ll need to adjust:
Phase 1-3: Completely feasible - these are algorithmic implementations
Phase 4: We’ll target a smaller model (~10M parameters vs 100M)
Training Time: Expect 10-50x slower execution
Batch Size: Will need to be minimal (possibly 1)
The core concepts of computational graphs, backpropagation, and transformer mechanics remain identical. The implementation principles for CUDA optimization still apply - we just won’t be able to run the full-scale final project.
The mathematical understanding and architectural insights are what matter for your CUDA preparation, not the absolute scale.
Hardware Selection Rule:
Use only the Lenovo Legion 7 Pro (with RTX 4090ti) for all active development, experimentation, and implementation during the course.
The Dell (Debian 13, CPU-only) is reserved for Phase 5, when we conduct deployment analysis, latency profiling, and efficiency testing under CPU-only constraints. Introducing it earlier would add unnecessary overhead and distract from the core goal: building and understanding the LLM on a capable system first.
Therefore:
Phases 1–4: Lenovo only (GPU-enabled, Fedora 42).
Phase 5: Both systems—Lenovo for reference, Dell for efficiency validation.
This aligns with your dual-environment goal while maintaining focus on deep learning fundamentals during the build phase.