Alignment

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization