GRPO踩坑记录2

张逸群 — Tue, 04 Feb 2025 00:44:27 +0000

这次我更换了任务，结合了MMLUPro、MELD和MATH三个数据集，并统一使用了相同的系统提示，模型仍然采用Qwen2.5-0.5B-Instruct （足够小，且不需要冷启动）。

PROMPT=(
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. "
    "The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed with <think> </think> and <answer> </answer> tags, respectively, i.e.., "
    "<think> reasoning process here </think> <answer> answer here </answer>"
)

奖励函数分为两种：第一种是 format_reward，顾名思义，用于检测输出是否符合指定的格式 <think> </think> <answer> </answer>；第二种是 accuracy_reward，用于检测输出的准确性。
从训练日志中可以清楚地看到，format_reward经历了一次显著的提升（这是否可以被称为一种涌现现象？），而accuracy_reward则一直在大幅波动，没有明显的上升趋势。这表明，模型只是通过“模仿”特定格式来提高得分，而没有通过先思考后行动的方式来显著提升任务性能。
我测试了一道题目，能够看到模型只是在拙劣的模仿这种格式，而没有学会真的思考。

<think> 
First, we need to understand what's happening when we "double" a number. Doubling a number means multiplying it by 2, so if we have a number x, doubling it will be represented as 2x.
</think>

<answer>Let's denote the number as x. According to the problem, doubling the number and then increasing it by 5 gives 15.

According to our formula, it would look like this: 
\[ 2x + 5 = 15 \]

Now we need to solve for x by first subtracting 5 from both sides of the equation:
\[ 2x + 5 - 5 = 15 - 5 \]
\[ 2x = 10 \]

Afterwards, we divide both sides by 2 to solve for x:
\[ \frac{2x}{2} = \frac{10}{2} \]
\[ x = 5 \]

Therefore, the number in question is 5.</answer>

分析了一下，原因可能有两个：（1）模型尺寸太小；（2）奖励函数设置有问题。

GRPO Pitfalls Record

张逸群 — Sun, 02 Feb 2025 14:44:32 +0000

I discovered that Hugging Face has adapted the core technology GRPO of DeepSeek-R1, so I decided to give it a try. I chose the ERC task (Emotion Recognition in Conversations) to see if a smaller model could be cold-start trained on a single task using reinforcement learning and improve task performance.

First, this technology is very memory-intensive. I initially tried to train gemma-2–2b and qwen-2.5–3b-instruct using an A100–80G, but the memory was insufficient.

After switching to qwen-2.5–0.5b-instruct, the memory issue was resolved. Secondly, the inference speed is particularly slow because the same prompt needs to be repeatedly sampled during training.

Fortunately, Hugging Face quickly adapted vLLM, improving efficiency. However, this brought new issues:

Using vLLM to assist GRPO training requires at least two GPUs, which actually increases the resource demand, merely shifting the inference load to a dedicated card.
There was a persistent strange error _assert_memory_footprint_increased_during_profiling. After checking the issues in trl, it seems that upgrading vLLM to version 0.7 is necessary to resolve it.

datasets==3.0.1
trl==0.14.0
transformers==4.48.2
peft==0.14.0
accelerate==1.3.0
deepspeed==0.15.3
torch==2.5.1
vllm==0.7.1

chinese version

发现Hugging Face已经对DeepSeek-R1的核心技术GRPO进行了适配，我决定尝试一下。我选择了ERC任务（对话情绪识别），想看看一个较小的模型能否通过强化学习在单一任务上进行冷启动训练，并提升任务性能。
首先，这项技术非常耗费显存。我最初尝试用A100–80G来训练gemma-2–2b和qwen-2.5–3b-instruct，但显存不够用。调整为qwen-2.5–0.5b-instruct后，显存才不再爆掉。其次，推理速度特别慢，因为训练过程中需要对同一个提示反复进行采样。不过，值得庆幸的是，Hugging Face很快适配了vLLM，提高了效率。但这又带来了新的问题：

使用vLLM来辅助GRPO训练至少需要两张显卡，对资源的需求实际上更高了，只是把推理负载转移到一张专门的卡上。
一直出现奇怪的错误_assert_memory_footprint_increased_during_profiling，查了一下trl的issue，似乎需要把vLLM提升到0.7版本才能解决。

Forem: 张逸群

GRPO踩坑记录2

GRPO Pitfalls Record

chinese version