Training Techniques

Frontier training recipes — reasoning, alignment, and efficient fine-tuning

Open-source reproducible methods that moved the frontier: GRPO-based reasoning RL (DeepSeek-R1), RLHF replacement via DPO, parameter-efficient adaptation via LoRA.

2014

Goodfellow et al.

Generative Adversarial Networks

The adversarial training paradigm — Generator vs Discriminator. 70,000+ citations. Its 'independent evaluator' idea now powers RLHF, self-play, and 2026 agent harnesses.

→ 2020

Kaplan et al. (OpenAI)

Scaling Laws for Neural Language Models

Loss as a power law of params × data × compute — spans 7 orders of magnitude. Turned training budget from art into calculation. Enabled GPT-3/4 investment.

→ 2021

Edward Hu et al. (Microsoft)

LoRA

Low-rank adaptation — trains ~10,000× fewer parameters, no inference overhead, matches full fine-tune quality. The PEFT default.

→ 2022

Hoffmann et al. (DeepMind)

Chinchilla (Compute-Optimal LLMs)

Overturned Kaplan: params and tokens should scale equally (not params faster). Chinchilla 70B/1.4T beats Gopher 280B/300B at the same compute. Rule of thumb: tokens ≈ 20× params.

→ 2023

Rafailov et al. (Stanford)

Direct Preference Optimization

Replaces RLHF's reward model + PPO with a single cross-entropy loss. The alignment method every major open model now uses.

→ 2025

DeepSeek-AI

DeepSeek-R1

Pure-RL reasoning training via GRPO. o1-level open-source model (AIME 79.8%, MATH-500 97.3%). Reshaped the open vs closed balance.

→