Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

原文链接: arXiv:2401.01335 作者: Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu 发表: 2024 主题: 自博弈微调，模型通过区分自身生成与人类数据来自我提升

Abstract

This paper introduces Self-Play Fine-Tuning (SPIN), a method that enables large language models to improve themselves by playing a game against their own previous iterations. In this two-player framework, the main player (the new model) learns to distinguish between responses generated by the opponent player (the previous iteration) and responses written by humans. As training progresses, the opponent generates increasingly high-quality responses, raising the bar for the main player. The process converges when the model's output distribution becomes indistinguishable from the human data distribution. Remarkably, SPIN achieves performance comparable to or exceeding Direct Preference Optimization (DPO) methods that rely on GPT-4-generated preference data, all without requiring any additional human or AI annotations beyond the original supervised fine-tuning dataset.

摘要

本文介绍了自博弈微调（SPIN），一种使大语言模型通过与自身先前迭代进行博弈来自我提升的方法。在这个双人博弈框架中，主玩家（新模型）学习区分由对手玩家（先前迭代）生成的响应和人类编写的响应。随着训练的推进，对手生成越来越高质量的响应，从而提高了主玩家的门槛。当模型的输出分布与人类数据分布变得不可区分时，过程收敛。值得注意的是，SPIN 实现了与依赖 GPT-4 生成偏好数据的直接偏好优化（DPO）方法相当甚至超越的性能，而且除了原始监督微调数据集之外，不需要任何额外的人类或 AI 标注。

1. Introduction / 引言

The alignment of large language models with human preferences typically follows a two-stage pipeline: supervised fine-tuning (SFT) on instruction-response pairs, followed by reinforcement learning from human feedback (RLHF) or preference optimization. The second stage requires preference data — pairs of responses where one is labeled as better than the other. Collecting such data is expensive, and many recent methods use GPT-4 to generate synthetic preferences, introducing dependence on a proprietary model.

大语言模型与人类偏好的对齐通常遵循两阶段流程：在指令-响应对上进行监督微调（SFT），然后进行基于人类反馈的强化学习（RLHF）或偏好优化。第二阶段需要偏好数据——成对的响应，其中一个被标记为优于另一个。收集此类数据成本高昂，许多最新方法使用 GPT-4 生成合成偏好，从而引入了对专有模型的依赖。

SPIN asks a provocative question: can a model improve beyond SFT using only the same SFT dataset, with no preference labels at all? The answer turns out to be yes, through the mechanism of self-play. The key insight is that the gap between a model's own outputs and human-written responses provides an implicit preference signal. By iteratively training the model to close this gap, SPIN bootstraps alignment from a single source of human data.

SPIN 提出了一个引人深思的问题：模型能否仅使用相同的 SFT 数据集、完全不使用偏好标签就超越 SFT 的效果？答案是肯定的，通过自博弈机制实现。关键洞察在于，模型自身输出与人类编写的响应之间的差距提供了一个隐式的偏好信号。通过迭代训练模型来缩小这一差距，SPIN 从单一人类数据源引导出对齐效果。

2. The Self-Play Framework / 自博弈框架

SPIN formulates fine-tuning as a two-player game. At each iteration t, two roles are defined:

SPIN 将微调形式化为一个双人博弈。在每次迭代 t 中，定义两个角色：

The Opponent Player is the model from the previous iteration, denoted as the policy at time t. Its job is to generate responses to prompts from the training set. These generated responses serve as "fake" data in the training objective.

对手玩家 是来自上一次迭代的模型，记为时刻 t 的策略。它的任务是对训练集中的提示生成响应。这些生成的响应在训练目标中充当"伪造"数据。

The Main Player is the model being trained at iteration t+1. It must learn to assign higher probability to human-written responses and lower probability to responses generated by the opponent. The training objective is a classification-style loss that maximizes the log-likelihood gap between real (human) and fake (model-generated) responses.

主玩家 是在迭代 t+1 中被训练的模型。它必须学会为人类编写的响应分配更高的概率，为对手生成的响应分配更低的概率。训练目标是一种分类风格的损失函数，最大化真实（人类）响应和伪造（模型生成）响应之间的对数似然差距。

The loss function is derived from the integral probability metric (IPM), specifically using a logistic formulation. For each prompt x, the model sees the human response y_real and a generated response y_fake from the previous iteration, and optimizes to maximize the margin between them.

损失函数源自积分概率度量（IPM），具体使用对数逻辑回归形式。对于每个提示 x，模型看到人类响应 y_real 和来自上一迭代的生成响应 y_fake，并优化以最大化两者之间的差距。

3. Convergence Properties / 收敛特性

A remarkable theoretical property of SPIN is that it has a well-defined fixed point. The game reaches equilibrium when the model's output distribution exactly matches the human data distribution. At this point, the main player cannot distinguish between the opponent's outputs and human responses, because they are drawn from the same distribution. The loss reaches its minimum, and further iterations produce no additional improvement.

SPIN 一个显著的理论特性是它具有明确定义的不动点。当模型的输出分布恰好匹配人类数据分布时，博弈达到均衡。此时，主玩家无法区分对手的输出和人类响应，因为它们来自相同的分布。损失达到最小值，进一步的迭代不会产生额外的改进。

In practice, convergence is observed within 3 to 4 iterations. The largest gains come from the first iteration (t=0 to t=1), with diminishing returns thereafter. This is consistent with the theoretical prediction: early iterations correct the most obvious distributional mismatches, while later iterations refine increasingly subtle differences.

在实践中，收敛在 3 到 4 次迭代内即可观察到。最大的收益来自第一次迭代（t=0 到 t=1），此后收益递减。这与理论预测一致：早期迭代修正最明显的分布不匹配，而后期迭代则精炼越来越微妙的差异。

4. Experimental Results / 实验结果

The authors evaluate SPIN on the Zephyr model (a Mistral-7B fine-tuned variant) using the UltraChat dataset for SFT. Starting from the SFT checkpoint, they run multiple iterations of SPIN and evaluate on the Open LLM Leaderboard, MT-Bench, and HuggingFace benchmarks.

作者在 Zephyr 模型（Mistral-7B 的微调变体）上使用 UltraChat 数据集进行 SFT 来评估 SPIN。从 SFT 检查点开始，他们运行多次 SPIN 迭代，并在 Open LLM Leaderboard、MT-Bench 和 HuggingFace 基准测试上进行评估。

The results show consistent improvement across iterations. On the Open LLM Leaderboard, the score increases from 58.14 (SFT baseline) to 63.16 after just one iteration of SPIN. After three iterations, the score reaches 63.78. These improvements are achieved without any preference data — SPIN uses only the same SFT dataset that produced the baseline.

结果显示各迭代间持续改善。在 Open LLM Leaderboard 上，分数从 58.14（SFT 基线）在仅一次 SPIN 迭代后提升至 63.16。三次迭代后，分数达到 63.78。这些改进完全不需要偏好数据——SPIN 仅使用产生基线的相同 SFT 数据集。

Crucially, SPIN matches or exceeds the performance of DPO trained with GPT-4-generated preference data on multiple benchmarks. This demonstrates that self-play can extract alignment signal that is comparable to external preference annotation, at zero additional data cost.

关键的是，SPIN 在多个基准测试上匹配甚至超越了使用 GPT-4 生成偏好数据训练的 DPO 的性能。这表明自博弈可以提取与外部偏好标注相当的对齐信号，且不需要额外的数据成本。

5. Why Self-Play Works / 为什么自博弈有效

The effectiveness of SPIN can be understood through the lens of distributional matching. The SFT model has learned to approximate the human response distribution, but imperfectly. The gap between the model's distribution and the human distribution contains information about which aspects of language generation the model has not yet mastered.

SPIN 的有效性可以通过分布匹配的视角来理解。SFT 模型已经学会了近似人类响应分布，但并不完美。模型分布与人类分布之间的差距包含了关于模型尚未掌握的语言生成方面的信息。

By generating responses and comparing them against human responses, the model creates a dynamic curriculum. Early iterations produce obviously different responses, providing a clear training signal. As the model improves, its outputs become more similar to human responses, and the training signal becomes more nuanced. This self-adjusting difficulty is a hallmark of self-play methods in game-playing AI, and SPIN successfully transfers this principle to language model alignment.

通过生成响应并将其与人类响应进行比较，模型创建了一个动态课程。早期迭代产生明显不同的响应，提供清晰的训练信号。随着模型的改进，其输出变得与人类响应更加相似，训练信号也变得更加细致。这种自调节难度是博弈 AI 中自博弈方法的标志，SPIN 成功地将这一原则迁移到语言模型对齐中。

SPIN differs from DPO and RLHF in that it requires no preference pairs. DPO needs pairs of responses annotated as preferred or dispreferred, while RLHF requires a trained reward model. SPIN requires only a set of prompt-response pairs from human annotators, the same data used for SFT.

SPIN 与 DPO 和 RLHF 的不同之处在于它不需要偏好对。DPO 需要标注为优选或劣选的响应对，而 RLHF 需要训练好的奖励模型。SPIN 只需要来自人类标注者的提示-响应对集合，即与 SFT 使用的相同数据。

Compared to other self-improvement methods, SPIN is distinguished by its theoretical grounding. The convergence to the human data distribution provides a principled stopping criterion and a clear understanding of what the method optimizes. This stands in contrast to heuristic self-improvement approaches that lack convergence guarantees.

与其他自我改进方法相比，SPIN 的独特之处在于其理论基础。收敛到人类数据分布提供了有原则的停止准则，以及对方法优化目标的清晰理解。这与缺乏收敛保证的启发式自我改进方法形成对比。

7. Limitations and Future Directions / 局限性与未来方向

SPIN is bounded by the quality of the SFT dataset. Since the fixed point is the human data distribution, the model cannot improve beyond the quality of the human responses it trains on. If the SFT dataset contains errors or represents a limited range of capabilities, SPIN will converge to that limited distribution.

SPIN 受限于 SFT 数据集的质量。由于不动点是人类数据分布，模型无法超越其训练所用人类响应的质量。如果 SFT 数据集包含错误或代表有限范围的能力，SPIN 将收敛到该有限分布。

Additionally, the method requires generating a full set of responses at each iteration, which adds computational overhead. For very large datasets, this generation step can be costly. Future work could explore more efficient sampling strategies or adaptive iteration schedules.

此外，该方法需要在每次迭代中生成完整的响应集，这增加了计算开销。对于非常大的数据集，这个生成步骤可能成本高昂。未来的工作可以探索更高效的采样策略或自适应迭代调度。

8. Conclusion / 结论

SPIN demonstrates that self-play is a powerful paradigm for language model alignment. By framing fine-tuning as a game between a model and its previous iteration, SPIN extracts alignment signal from standard SFT data without any additional annotations. The method is theoretically principled, practically effective, and computationally reasonable. It shows that weak language models can be converted to strong ones through the simple mechanism of learning to distinguish their own outputs from human-written text — a compelling demonstration of self-improvement through self-awareness.

SPIN 证明了自博弈是语言模型对齐的一种强大范式。通过将微调构建为模型与其先前迭代之间的博弈，SPIN 从标准 SFT 数据中提取对齐信号，无需任何额外标注。该方法理论有据、实践有效且计算合理。它表明，弱语言模型可以通过学习区分自身输出与人类编写文本这一简单机制转变为强模型——这是通过自我意识实现自我改进的有力证明。