Each DPO training example consists of a shared prompt, a "chosen" response, and a "rejected" response. Instead of computing the shared prompt twice, we combine the prompt and pair of responses into a ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results