From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

原文链接: arXiv:2308.12032 作者: Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, Jing Xiao 发表: NAACL 2024 主题: 基于困惑度的自引导数据选择，用模型自身过滤训练数据

Abstract

Instruction tuning has become a critical step in aligning large language models (LLMs) with human preferences. However, the quality of instruction-tuning datasets varies enormously, and naively training on all available data often introduces noise that degrades model performance. This paper proposes a self-guided data selection method called Cherry LLM, which enables a language model to autonomously identify the most valuable training samples from a large instruction-tuning corpus. The core innovation is the Instruction-Following Difficulty (IFD) metric, a perplexity-based score that measures how much a given sample challenges the model relative to its current capabilities. By selecting only the top-scoring "cherry" samples, the authors demonstrate that training on as little as 5% of the Alpaca dataset can outperform training on the full dataset, eliminating the need for expensive external model-based curation.

摘要

指令微调已成为将大语言模型（LLM）与人类偏好对齐的关键步骤。然而，指令微调数据集的质量参差不齐，盲目地在所有可用数据上训练往往会引入噪声，从而降低模型性能。本文提出了一种名为 Cherry LLM 的自引导数据选择方法，使语言模型能够自主地从大规模指令微调语料库中识别出最有价值的训练样本。其核心创新是指令遵循难度（IFD）指标——一种基于困惑度的评分，用于衡量给定样本相对于模型当前能力的挑战程度。通过仅选择得分最高的"精华"样本，作者证明仅使用 Alpaca 数据集 5% 的数据进行训练，即可超越在完整数据集上训练的效果，从而消除了对昂贵的外部模型筛选的需求。

1. Introduction / 引言

The explosion of instruction-tuning datasets has created a paradox: more data does not always mean better performance. Datasets like Alpaca, which are generated by prompting GPT-3.5 or GPT-4, inevitably contain low-quality, redundant, or even contradictory samples. Previous approaches to data curation rely on external strong models (such as GPT-4) to score or filter data, which is both expensive and introduces a dependency on proprietary systems. The fundamental question this paper addresses is: can a model serve as its own data curator?

指令微调数据集的爆炸式增长带来了一个悖论：更多的数据并不总是意味着更好的性能。像 Alpaca 这样通过提示 GPT-3.5 或 GPT-4 生成的数据集，不可避免地包含低质量、冗余甚至矛盾的样本。此前的数据筛选方法依赖于外部强模型（如 GPT-4）来评分或过滤数据，这既昂贵又引入了对专有系统的依赖。本文要解决的根本问题是：模型能否充当自己的数据策展人？

The authors draw inspiration from a simple observation: if a model already "knows" how to respond to a certain instruction, training on that sample provides diminishing returns. Conversely, samples where the model struggles — where there is a large gap between what the model would produce and the target response — represent the highest-value training signal. This insight leads directly to the IFD metric.

作者从一个简单的观察中获得灵感：如果模型已经"知道"如何响应某条指令，那么在该样本上训练的收益递减。相反，模型感到困难的样本——即模型生成内容与目标响应之间存在较大差距的样本——代表了最高价值的训练信号。这一洞察直接催生了 IFD 指标。

2. The IFD Metric / IFD 指标

The Instruction-Following Difficulty score is defined as the ratio of two conditional perplexities. First, the authors compute the conditional perplexity of the response given the instruction — this measures how difficult it is for the model to generate the expected answer when it has the instruction as context. Second, they compute the unconditional perplexity of the response alone — this captures the intrinsic linguistic complexity of the response text. The IFD score is the ratio of the first to the second.

指令遵循难度评分被定义为两个条件困惑度的比值。首先，作者计算在给定指令条件下响应的条件困惑度——这衡量了模型在以指令为上下文时生成预期答案的难度。其次，他们计算仅响应本身的无条件困惑度——这捕获了响应文本固有的语言复杂性。IFD 评分是前者与后者的比值。

A high IFD score indicates that the model finds the instruction-response mapping particularly challenging — these are the "cherry" samples. A low IFD score suggests the model can already handle the instruction well, making the sample less valuable for training. Crucially, this metric is entirely self-referential: it uses only the model's own perplexity estimates and requires no external judge.

高 IFD 评分表明模型认为该指令-响应映射特别具有挑战性——这些就是"精华"样本。低 IFD 评分则表明模型已经能够很好地处理该指令，使得该样本对训练的价值较低。关键在于，这个指标完全是自参照的：它仅使用模型自身的困惑度估计，不需要任何外部评判者。

3. Three-Phase Training Pipeline / 三阶段训练流程

The Cherry LLM method operates in three distinct phases:

Cherry LLM 方法分为三个不同的阶段：

Phase 1: Brief Pre-Training. The base model is fine-tuned on a small random subset of approximately 1,000 samples from the instruction-tuning dataset. This brief exposure gives the model enough instruction-following capability to produce meaningful perplexity estimates, without overfitting to any particular data distribution. The resulting model is called the "pre-experienced" model.

第一阶段：简短预训练。 基础模型在指令微调数据集中随机抽取的约 1,000 个样本上进行微调。这种简短的接触使模型获得足够的指令遵循能力以产生有意义的困惑度估计，同时不会对任何特定数据分布过拟合。所得模型被称为"预体验"模型。

Phase 2: IFD Scoring and Selection. The pre-experienced model scores every sample in the full dataset using the IFD metric. Samples are ranked by their IFD scores, and only the top percentage (e.g., 5% or 10%) is retained. These high-IFD samples form the "cherry" dataset — the subset that provides the most learning signal for the model.

第二阶段：IFD 评分与选择。 预体验模型使用 IFD 指标对完整数据集中的每个样本进行评分。样本按 IFD 评分排序，仅保留排名靠前的百分比（如 5% 或 10%）。这些高 IFD 样本构成"精华"数据集——为模型提供最多学习信号的子集。

Phase 3: Full Training on Cherry Data. A fresh copy of the base model is trained from scratch on the cherry dataset. This ensures that the final model benefits from a clean, high-quality training signal without any artifacts from the brief pre-training phase.

第三阶段：在精华数据上完整训练。 从头开始在精华数据集上训练基础模型的一个全新副本。这确保最终模型受益于干净、高质量的训练信号，而不带有简短预训练阶段的任何痕迹。

4. Experimental Results / 实验结果

The results are striking. On the Alpaca dataset with 52,000 samples, training on just 2,600 cherry samples (5%) produces a model that outperforms the model trained on all 52,000 samples across multiple benchmarks. This holds for both LLaMA-7B and LLaMA-13B base models. The cherry-selected data consistently achieves higher scores on MT-Bench, AlpacaEval, and other instruction-following evaluation suites.

实验结果令人瞩目。在包含 52,000 个样本的 Alpaca 数据集上，仅使用 2,600 个精华样本（5%）训练出的模型，在多个基准测试中都优于使用全部 52,000 个样本训练的模型。这一结论对 LLaMA-7B 和 LLaMA-13B 基础模型均成立。精华数据选择在 MT-Bench、AlpacaEval 和其他指令遵循评估套件上始终获得更高分数。

Furthermore, the authors compare their approach against random selection, perplexity-only selection, and external model-based filtering. The IFD-based selection consistently outperforms all baselines, demonstrating that the ratio formulation captures meaningful information beyond raw perplexity.

此外，作者将其方法与随机选择、仅基于困惑度的选择以及外部模型过滤进行了比较。基于 IFD 的选择始终优于所有基线方法，证明了比值公式捕获了超越原始困惑度的有意义信息。

5. Analysis and Insights / 分析与洞察

A key finding is that the cherry samples are not simply the "hardest" samples in an absolute sense. Very high perplexity samples are often noisy or malformed. The IFD ratio normalizes for response complexity, filtering out samples that are difficult purely because of linguistic complexity rather than instruction-following challenge. This normalization is what makes the metric robust.

一个关键发现是，精华样本并不仅仅是绝对意义上"最难"的样本。困惑度非常高的样本往往是嘈杂的或格式不佳的。IFD 比值对响应复杂性进行了归一化处理，过滤掉了那些纯粹因语言复杂性而非指令遵循挑战而显得困难的样本。正是这种归一化使得该指标具有鲁棒性。

The paper also shows that the 1,000-sample pre-training phase is sufficient for reliable IFD estimation. Using more samples in the pre-training phase yields marginal improvements but increases computational cost. The method is therefore highly efficient: the total compute budget is dominated by the final training phase, which operates on a much smaller dataset.

论文还表明，1,000 个样本的预训练阶段足以进行可靠的 IFD 估计。在预训练阶段使用更多样本只能带来边际改善，却增加了计算成本。因此该方法效率很高：总计算预算由最终训练阶段主导，而该阶段在一个小得多的数据集上操作。

6. Broader Implications / 更广泛的影响

Cherry LLM demonstrates a fundamental principle for self-improving systems: a model's own uncertainty signal contains rich information about data quality. Rather than relying on human annotators or stronger external models, the model itself can identify what it needs to learn. This has profound implications for scaling instruction tuning — as datasets grow into the millions of samples, automated self-guided curation becomes essential.

Cherry LLM 展示了自我改进系统的一个基本原则：模型自身的不确定性信号包含了关于数据质量的丰富信息。与其依赖人类标注者或更强的外部模型，模型本身就能识别它需要学习什么。这对扩展指令微调具有深远影响——随着数据集增长到数百万样本，自动化的自引导策展变得至关重要。

The approach is also composable: IFD scoring can be applied iteratively, with each round of training producing a more refined model that can better identify the next set of high-value samples. This iterative refinement connects Cherry LLM to the broader theme of self-improving agents, where each cycle of self-evaluation and re-training leads to compounding gains.

该方法还具有可组合性：IFD 评分可以迭代应用，每轮训练都会产生一个更精炼的模型，能够更好地识别下一批高价值样本。这种迭代精炼将 Cherry LLM 与自我改进智能体的更广泛主题联系起来——每个自我评估和重新训练的循环都会带来复合收益。

7. Limitations / 局限性

The method assumes that the initial base model has sufficient language understanding to produce meaningful perplexity estimates. For very small or poorly pre-trained models, the IFD scores may be unreliable. Additionally, the approach is designed for instruction-tuning datasets and may not directly transfer to other training paradigms such as RLHF or constitutional AI training.

该方法假设初始基础模型具有足够的语言理解能力来产生有意义的困惑度估计。对于非常小或预训练不充分的模型，IFD 评分可能不可靠。此外，该方法是为指令微调数据集设计的，可能无法直接迁移到其他训练范式，如 RLHF 或宪法 AI 训练。

8. Conclusion / 结论

Cherry LLM provides an elegant and practical solution to the data quality problem in instruction tuning. By leveraging the model's own perplexity signals through the IFD metric, it achieves superior performance with a fraction of the training data. The method is simple to implement, computationally efficient, and requires no external dependencies. It represents an important step toward truly self-guided learning, where models take an active role in curating their own training experience.

Cherry LLM 为指令微调中的数据质量问题提供了一个优雅而实用的解决方案。通过 IFD 指标利用模型自身的困惑度信号，它仅使用一小部分训练数据就实现了卓越性能。该方法实现简单、计算高效，且不需要外部依赖。它代表了向真正自引导学习迈出的重要一步，在这种学习中，模型在策展自身训练经验方面发挥了主动作用。