Training Techniques
Frontier training recipes — reasoning, alignment, and efficient fine-tuning
Open-source reproducible methods that moved the frontier: GRPO-based reasoning RL (DeepSeek-R1), RLHF replacement via DPO, parameter-efficient adaptation via LoRA.
Generative Adversarial Networks
The adversarial training paradigm — Generator vs Discriminator. 70,000+ citations. Its 'independent evaluator' idea now powers RLHF, self-play, and 2026 agent harnesses.
Scaling Laws for Neural Language Models
Loss as a power law of params × data × compute — spans 7 orders of magnitude. Turned training budget from art into calculation. Enabled GPT-3/4 investment.
LoRA
Low-rank adaptation — trains ~10,000× fewer parameters, no inference overhead, matches full fine-tune quality. The PEFT default.
Chinchilla (Compute-Optimal LLMs)
Overturned Kaplan: params and tokens should scale equally (not params faster). Chinchilla 70B/1.4T beats Gopher 280B/300B at the same compute. Rule of thumb: tokens ≈ 20× params.
Direct Preference Optimization
Replaces RLHF's reward model + PPO with a single cross-entropy loss. The alignment method every major open model now uses.
DeepSeek-R1
Pure-RL reasoning training via GRPO. o1-level open-source model (AIME 79.8%, MATH-500 97.3%). Reshaped the open vs closed balance.