信用分配 vllbc 收录于 Reasoning LLM 2025-08-05 约 216 字 预计阅读 1 分钟 次阅读 最近涌现了很多关于信用分配的论文,因此整理一下 First Return, Entropy-Eliciting Explore Good Learners Think Their Thinking:Generative PRM Makes Large Reasoning Model More Efficient Math Learne Group Sequence Policy Optimization PROCESS REINFORCEMENT THROUGH IMPLICIT REWARDS RLVMR:Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents VAPO:Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks Group-in-Group Policy Optimization for LLM Agent Training CAPO:Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment Beyond Policy Optimization:A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning GTPO and GRPO-S:Token and Sequence-Level Reward Shaping with Policy Entropy GTPO:Trajectory-Based Policy Optimization in Large Language Models Please enable JavaScript to view the comments powered by Valine.