Efficient Long CoT Reasoning in Small Language Models

type

status

date

slug

summary

Open AI o1, QWQ 和 Deepseek-R1 scale up the length of CoT steps 显著提高了推理表现

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular,...

https://arxiv.org/abs/2201.11903

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy,...

https://arxiv.org/abs/2203.11171

(Wei et al., 2022b; Wang et al., 2023a; Kojima et al., 2022) CoT prompting

对SLM提出新的Challenge

They also introduce new challenges to small language models (SLMs) with about 7B parameters which often use distillation methods to learn such long CoT reasoning (Guo et al., 2025; Face, 2025).

open-r1

huggingface • Updated Jun 14, 2025

有redundant reasoning steps

generated long CoT traces often contain many redundant reasoning steps even to the very simple question (Chen et al., 2025; Aggarwal and Welleck, 2025; Yang et al., 2025; Zhang et al., 2025)

L1: Controlling How Long A Reasoning Model Thinks With...

Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer''-that is, by generating longer chain-of-thought sequences and hence using more...

https://arxiv.org/abs/2503.04697

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought...

https://arxiv.org/abs/2412.21187

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via...

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning...

https://arxiv.org/abs/2501.12948

Those redundant reasoning steps may not only bring unnecessary computation burden during test time, but also affect the reasoning performance (Sui et al., 2025; Aggarwal and Welleck, 2025; Wu et al., 2025; Marjanovi ́c et al., 2025)

这些冗余的推理步骤不仅会带来不必要的计算消耗，还会影响推理表现，并且会影响蒸馏过程。

Stop Overthinking: A Survey on Efficient Reasoning for Large...

Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further...

https://arxiv.org/abs/2503.16419

L1: Controlling How Long A Reasoning Model Thinks With...

Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer''-that is, by generating longer chain-of-thought sequences and hence using more...

https://arxiv.org/abs/2503.04697

Effectively Controlling Reasoning Models through Thinking Intervention

Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this...

https://arxiv.org/abs/2503.24370

DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning

Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed...

https://arxiv.org/abs/2504.07128

如何解决这个issue

一般是启发式的方法

minimum reasoning length with correct final answer (Chen et al., 2025), design length based rewards for reinforcement learning (Aggarwal and Welleck, 2025; Yi and Wang, 2025; Yang et al., 2025), or advanced prompting methods (Wu et al., 2025; Munkhbat et al., 2025; Xia et al., 2025; Han et al., 2025; Nayab et al., 2025).

要么依赖于reward的重新设计，或者不考虑目标SLM在选择长COT训练数据时的推理能力。

💡

How can high-quality CoT traces generated by large reasoning models be efficiently distilled into SLMs?

Contributions

思路：

存在冗余部分→对长CoT删减冗余部分（binary cutting）

通过部分思维链SLM就可以推理出正确结果，并且针对不同SLM所需的思维链片段也不相同→on-policy distillation method加强binary cutting，对SLM进行针对性选择partial segments

使用定制化的CoT数据对SLM进行fine-tune（SFT + DPO）

贡献：

观察到LSM的思维链有不必须的推理步骤被证明是对蒸馏有害的

提出一个简单高效的削减冗余推理步骤的方法

实验证明了该方法可以让SLM更高效的推理，在保持推理表现的同时减少了冗余步骤的生成

Related Work

针对长思维链的冗余

overthinking problem 会导致效率低下甚至影响准确率

Stop Overthinking: A Survey on Efficient Reasoning for Large...

Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further...

https://arxiv.org/abs/2503.16419

Effectively Controlling Reasoning Models through Thinking Intervention

Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this...

https://arxiv.org/abs/2503.24370

DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning

Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed...

https://arxiv.org/abs/2504.07128

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought...

https://arxiv.org/abs/2412.21187

解决方法

启发式方法：截断CoT到能够得到正确结果的最小前缀

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought...

https://arxiv.org/abs/2412.21187

强化学习方法：加入长度惩罚的reward function

L1: Controlling How Long A Reasoning Model Thinks With...

Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer''-that is, by generating longer chain-of-thought sequences and hence using more...

https://arxiv.org/abs/2503.04697

Think When You Need: Self-Adaptive Chain-of-Thought Learning

Chain of Thought (CoT) reasoning enhances language models' performance but often leads to inefficient "overthinking" on simple problems. We identify that existing approaches directly penalizing...

https://arxiv.org/abs/2504.03234

ShorterBetter: Guiding Reasoning Models to Find Optimal Inference...

Recent models such as OpenAI o1 and DeepSeek-R1 have demonstrated strong performance on reasoning-intensive tasks by generating extended Chain-of-Thought (CoT) traces. While longer reasoning helps...

https://arxiv.org/abs/2504.21370

alternative prompting techniques：

Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost

Today's large language models (LLMs) can solve challenging question-answering tasks, and prompt engineering techniques, such as chain-of-thought (CoT), have gained attention for enhancing the...

https://arxiv.org/abs/2407.19825

Token-Budget-Aware LLM Reasoning

Reasoning is critical for large language models (LLMs) to excel in a wide range of tasks. While methods like Chain-of-Thought (CoT) reasoning and enhance LLM performance by decomposing problems...

https://arxiv.org/abs/2412.18547

TokenSkip: Controllable Chain-of-Thought Compression in LLMs

Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning capabilities of large language models (LLMs). Recent advancements, such as OpenAI's o1 and DeepSeek-R1, suggest that...

https://arxiv.org/abs/2502.12067

Effectively Controlling Reasoning Models through Thinking Intervention

Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this...

https://arxiv.org/abs/2503.24370

以上解决方法忽略了不同SLM的推理能力不同，这篇文章的方法为目标SLM定制CoT去蒸馏。

模型蒸馏

知识蒸馏：

Distilling the Knowledge in a Neural Network

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately,...

https://arxiv.org/abs/1503.02531

A Survey on Knowledge Distillation of Large Language Models

In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to...

https://arxiv.org/abs/2402.13116

CoT蒸馏

Large Language Models Are Reasoning Teachers

Namgyu Ho, Laura Schmid, Se-Young Yun. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.

https://aclanthology.org/2023.acl-long.830/

Specializing Smaller Language Models towards Multi-Step Reasoning

The surprising ability of Large Language Models (LLMs) to perform well on complex reasoning with only few-shot chain-of-thought prompts is believed to emerge...

https://proceedings.mlr.press/v202/fu23d.html

Teaching Small Language Models to Reason

Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, Aliaksei Severyn. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023.

https://aclanthology.org/2023.acl-short.151/

Democratizing Reasoning Ability: Tailored Learning from Large...

Large language models (LLMs) exhibit impressive emergent abilities in natural language processing, but their democratization is hindered due to huge computation requirements and closed-source...

https://arxiv.org/abs/2310.13332

Distilling Reasoning Capabilities into Smaller Language Models

Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan. Findings of the Association for Computational Linguistics: ACL 2023. 2023.

https://aclanthology.org/2023.findings-acl.441/

以上的蒸馏忽略了冗余和不必要的推理过程的对能力有限的SLM 的不利影响

Method

binary cutting

对CoT文本进行步骤级别的处理

二分法加回溯→高效找到最短前缀

On-Policy Validation

在找到最短前缀的过程中需要一个验证过程，这里使用On-Policy Validation。

现存的裁剪方法（FCS First-Correct Solutions strategy）使用一个另外的判别模型进行验证，假设了一个权威的判别准则→这忽略了“不同SLM具有非常不同的推理偏差（biases）和能力（strenths）”的事实，使用SLM自己作为判别模型：

这个on-policy的方法让裁剪出来的CoT能和目标要训练的SLM的能力相匹配。

❓

为什么前人的judge-model都选用的是一个addtional的大模型呢？他们是什么考虑呢？

Training

SFT + DPO

使用进行SFT

使用偏好数据集进行DPO训练，其中good标签为经过处理的数据、bad标签为原始的思维链

❓

两个标签的数据的长度有显著差异，这样学习到的能力会是好的吗，为什么会有效？只是学到不会冗余思考吗？

Experiment

Datasets

GSM8K
MATH
AIME

Models

Llama-3.18B-Instruct
Qwen2.5-7BInstruct

Implementation Details

使用的数据集为OpenR1-Math-220k、NuminaMath 1.5
3 epochs SFT和1 epoch DPO
lr = 1e-6

We also noticed that single DPO training can also decrease the likelihood of the “good” response, thus we add the SFT loss with a weight of 0.3 into Eq. 3 for stable performance.