Policy Gradient

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Implicit Bias of Policy Gradient in Linear Quadratic Control: Extrapolation to Unseen Initial States

Vanishing Gradients in Reinforcement Finetuning of Language Models