/images/logo.pngvllbc02
所有文章 标签 分类 关于
/images/logo.pngvllbc02
取消
所有文章标签分类关于

信用分配

vllbc 收录于 Reasoning LLM
 2025-08-05  约 216 字   预计阅读 1 分钟    次阅读  

最近涌现了很多关于信用分配的论文,因此整理一下

First Return, Entropy-Eliciting Explore

Good Learners Think Their Thinking:Generative PRM Makes Large Reasoning Model More Efficient Math Learne

Group Sequence Policy Optimization

PROCESS REINFORCEMENT THROUGH IMPLICIT REWARDS

RLVMR:Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents

VAPO:Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Group-in-Group Policy Optimization for LLM Agent Training

CAPO:Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment

Beyond Policy Optimization:A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

GTPO and GRPO-S:Token and Sequence-Level Reward Shaping with Policy Entropy

GTPO:Trajectory-Based Policy Optimization in Large Language Models

更新于 2025-08-05
阅读原始文档
 LLM, Reasoning
返回 | 主页
K2 qwen
2020 - 2025