出人意料!DeepSeek-R1用的GRPO其实没必要?规模化强化学习训练用PPO就够了


-
论文标题:Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model -
论文地址:https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/blob/main/ORZ_paper.pdf -
项目地址:https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero -
Hugging Face:https://huggingface.co/Open-Reasoner-Zero












