Skip to content

liushunyu/awesome-direct-preference-optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Direct Preference Optimization

Awesome Awesome DPO

A list of selected papers in our review paper A Survey of Direct Preference Optimization.

If you find a missing paper or a possible mistake in our survey, please feel free to create an issue or pull a request here. I am more than glad to receive your advice. Thanks!

Table of Contents

A Taxonomy of Direct Preference Optimization

In this survey, we introduce a novel taxonomy that categorizes existing DPO works into four key dimensions based on different components of the DPO loss: data strategy, learning framework, constraint mechanism, and model property.

This taxonomy provides a systematic framework for understanding the methodological evolution of DPO and highlights the key distinctions between different variations.

Basic

  • arXiv arXiv RRHF: Rank responses to align language models with human feedback without tears
  • arXiv SLiC-HF: Sequence Likelihood Calibration with Human Feedback
  • arXiv arXiv Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • arXiv arXiv Preference Ranking Optimization for Human Alignment

Data Strategy

Data Quality - Heterogeneity

  • arXiv MallowsPO: Fine-Tune Your LLM with Preference Dispersions
  • arXiv Direct Preference Optimization With Unobserved Preference Heterogeneity
  • arXiv Group Robust Preference Optimization in Reward-free RLHF
  • arXiv arXiv No Preference Left Behind: Group Distributional Preference Optimization

Data Quality - Distinguishability

  • arXiv arXiv Direct Preference Optimization with an Offset
  • arXiv arXiv Enhancing Alignment using Curriculum Learning & Ranked Preferences
  • arXiv sDPO: Don't Use Your Data All at Once
  • arXiv Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model
  • arXiv arXiv Filtered Direct Preference Optimization
  • arXiv Direct Alignment of Language Models via Quality-Aware Self-Refinement
  • arXiv arXiv Adaptive Preference Scaling for Reinforcement Learning with Human Feedback
  • arXiv arXiv $\beta$-dpo: Direct preference optimization with dynamic $\beta$
  • arXiv arXiv Reward Difference Optimization for Sample Reweighting in Offline RLHF
  • arXiv arXiv Geometric-Averaged Preference Optimization for Soft Preference Labels
  • arXiv α -DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs
  • arXiv Plug-and-Play Training Framework for Preference Optimization
  • OpenReview Gap-Aware Preference Optimization: Enhancing Model Alignment with Perception Margin

Data Quality - Noise

  • arXiv arXiv Provably Robust DPO: Aligning Language Models with Noisy Feedback
  • arXiv ROPO: Robust Preference Optimization for Large Language Models
  • arXiv arXiv Impact of Preference Noise on the Alignment Performance of Generative Language Models
  • arXiv arXiv Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment
  • arXiv arXiv Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization
  • OpenReview Understanding Generalization of Preference Optimization Under Noisy Feedback
  • OpenReview Combating inherent noise for direct preference optimization
  • OpenReview Perplexity-aware Correction for Robust Alignment with Noisy Preferences

Preference Feedback - Point-Wise

  • arXiv ULMA: Unified Language Model Alignment with Human Demonstration and Point-wise Preference
  • arXiv arXiv KTO: Model Alignment as Prospect Theoretic Optimization
  • arXiv arXiv Noise Contrastive Alignment of Language Models with Explicit Rewards
  • arXiv Binary Classifier Optimization for Large Language Model Alignment
  • arXiv Offline Regularised Reinforcement Learning for Large Language Models Alignment
  • arXiv arXiv Distributional Preference Alignment of LLMs via Optimal Transport
  • arXiv General Preference Modeling with Preference Representations for Aligning Language Models
  • arXiv arXiv Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment

Preference Feedback - Pair-Wise

  • arXiv arXiv Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • arXiv arXiv A General Theoretical Paradigm to Understand Learning from Human Preferences
  • arXiv Negating negatives: Alignment without human positive samples via distributional dispreference optimization
  • arXiv arXiv Negative preference optimization: From catastrophic collapse to effective unlearning
  • arXiv Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment
  • arXiv Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization
  • arXiv On Extending Direct Preference Optimization to Accommodate Ties
  • arXiv arXiv Preference Optimization as Probabilistic Inference
  • arXiv Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning
  • arXiv IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

Preference Feedback - List-Wise

  • arXiv LiPO: Listwise Preference Optimization through Learning-to-Rank
  • arXiv arXiv Panacea: Pareto Alignment via Preference Adaptation for LLMs
  • arXiv Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts
  • arXiv arXiv LIRE: listwise reward enhancement for preference alignment
  • arXiv Ordinal Preference Optimization: Aligning Human Preferences via NDCG
  • arXiv Preference Optimization with Multi-Sample Comparisons
  • arXiv Optimizing Preference Alignment with Differentiable NDCG Ranking
  • arXiv arXiv TODO: Enhancing LLM Alignment with Ternary Preferences

Preference Granularity - Token-Level

  • arXiv arXiv Token-level Direct Preference Optimization
  • arXiv arXiv From r to Q∗: Your Language Model is Secretly a Q-Function
  • arXiv DPO Meets PPO: Reinforced Token Optimization for RLHF
  • arXiv Selective Preference Optimization via Token-Level Reward Function Estimation
  • arXiv arXiv EPO: Hierarchical LLM Agents with Environment Preference Optimization
  • arXiv arXiv TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights
  • arXiv SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks
  • arXiv arXiv Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

Preference Granularity - Step-Level

  • arXiv Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning
  • arXiv arXiv Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs
  • arXiv Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
  • arXiv Step-controlled dpo: Leveraging stepwise error for enhanced mathematical reasoning
  • arXiv Data-Centric Human Preference Optimization with Rationales
  • arXiv TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees
  • arXiv Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization

Preference Granularity - Sentence-Level

  • arXiv arXiv Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • arXiv arXiv MAPO: Advancing Multilingual Reasoning through Multilingual-Alignment-as-Preference Optimization
  • arXiv Advancing LLM Reasoning Generalists with Preference Trees
  • arXiv arXiv Iterative Reasoning Preference Optimization
  • arXiv FactAlign: Long-form factuality alignment of large language models

Preference Granularity - Turn-Level

  • arXiv arXiv Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
  • arXiv arXiv Direct Multi-Turn Preference Optimization for Language Agents
  • arXiv Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
  • arXiv arXiv Building Math Agents with Multi-Turn Iterative Preference Learning
  • arXiv SDPO: Segment-Level Direct Preference Optimization for Social Agents

Learning Framework

Paradigm - Offline

  • arXiv arXiv Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • arXiv ULMA: Unified Language Model Alignment with Human Demonstration and Point-wise Preference
  • arXiv arXiv Contrastive preference optimization: Pushing the boundaries of LLM performance in machine translation
  • arXiv arXiv Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback
  • arXiv ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization
  • arXiv arXiv Enhancing Alignment using Curriculum Learning & Ranked Preferences
  • arXiv arXiv ORPO: Monolithic Preference Optimization without Reference Model
  • arXiv sDPO: Don't Use Your Data All at Once
  • arXiv Paft: A parallel training paradigm for effective LLM fine-tuning

Paradigm - Online

  • arXiv arXiv Statistical rejection sampling improves preference optimization
  • arXiv arXiv Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint
  • arXiv Some things are more cringe than others: Iterative preference optimization with the pairwise cringe loss
  • arXiv arXiv Self-Rewarding Language Models
  • arXiv Direct Language Model Alignment from Online AI Feedback
  • arXiv arXiv RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models
  • arXiv arXiv Direct large language model alignment through self-rewarding contrastive prompt distillation
  • arXiv arXiv Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
  • arXiv Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model
  • arXiv ROPO: Robust Preference Optimization for Large Language Models
  • arXiv arXiv Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
  • arXiv arXiv Iterative Reasoning Preference Optimization
  • arXiv Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning
  • arXiv arXiv D2PO: Discriminator-Guided DPO with Response Evaluation Models
  • arXiv Understanding the performance gap between online and offline alignment algorithms
  • arXiv Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF
  • arXiv Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
  • arXiv arXiv Exploratory Preference Optimization: Provably Sample-Efficient Exploration in RLHF with General Function Approximation
  • arXiv arXiv The Importance of Online Data: Understanding Preference Fine-tuning via Coverage
  • arXiv Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing
  • arXiv OPTune: Efficient Online Preference Tuning
  • arXiv arXiv BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM
  • arXiv arXiv Self-training with direct preference optimization improves chain-of-thought reasoning
  • arXiv arXiv Building Math Agents with Multi-Turn Iterative Preference Learning
  • arXiv AIPO: Improving Training Objective for Iterative Preference Optimization
  • arXiv arXiv The Crucial Role of Samplers in Online Direct Preference Optimization
  • arXiv Accelerated Preference Optimization for Large Language Model Alignment
  • arXiv arXiv SeRA: Self-Review & Alignment with Implicit Reward Margins
  • arXiv arXiv CREAM: Consistency Regularized Self-Rewarding Language Models
  • arXiv COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences
  • arXiv arXiv Online Preference Alignment for Language Models via Count-based Exploration

Paradigm - Active

  • arXiv arXiv Active Preference Learning for Large Language Models
  • arXiv Reinforcement Learning from Human Feedback with Active Queries
  • arXiv Active Preference Optimization for Sample Efficient RLHF
  • OpenReview Active Preference Optimization via Maximizing Learning Capacity

Objective - Multi-Objective

  • arXiv arXiv Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization
  • arXiv arXiv Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment
  • arXiv SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling
  • arXiv Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives
  • arXiv Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both
  • OpenReview MOSLIM: Align with diverse preferences in prompts through reward classification

Objective - Self-Play

  • arXiv arXiv Nash learning from human feedback
  • arXiv arXiv Self-play fine-tuning converts weak language models to strong language models
  • arXiv arXiv A minimaximalist approach to reinforcement learning from human feedback
  • arXiv arXiv Human Alignment of Large Language Models through Online Preference Optimisation
  • arXiv Direct nash optimization: Teaching language models to self-improve with general preferences
  • arXiv arXiv Self-Play Preference Optimization for Language Model Alignment
  • arXiv arXiv BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling
  • arXiv arXiv Self-Improving Robust Preference Optimization
  • arXiv Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data

Constraint Mechanism

Reference - Dynamic

  • arXiv arXiv Enhancing Alignment using Curriculum Learning & Ranked Preferences
  • arXiv Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model
  • arXiv Learn Your Reference Model for Real Good Alignment
  • arXiv arXiv Building Math Agents with Multi-Turn Iterative Preference Learning

Reference - Free

  • arXiv arXiv Contrastive preference optimization: Pushing the boundaries of LLM performance in machine translation
  • arXiv arXiv ORPO: Monolithic Preference Optimization without Reference Model
  • arXiv arXiv SimPO: Simple Preference Optimization with a Reference-Free Reward
  • arXiv Understanding reference policies in direct preference optimization
  • arXiv arXiv SimPER: Simple Preference Fine-Tuning without Hyperparameters by Perplexity Optimization

Divergence - Diversity

  • arXiv arXiv Beyond Reverse KL- Generalizing Direct Preference Optimization with Diverse Divergence Constraints
  • OpenReview Diverse Preference Learning for Capabilities and Alignment

Divergence - Generalization

  • arXiv arXiv Towards Efficient Exact Optimization of Language Model Alignment
  • arXiv arXiv Generalized Preference Optimization: A Unified Approach to Offline Alignment
  • arXiv Soft Preference Optimization: Aligning Language Models to Expert Distributions
  • arXiv arXiv Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment
  • arXiv arXiv Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization
  • arXiv arXiv FlipGuard: Defending Preference Alignment against Update Regression with Constrained Optimization
  • arXiv Direct Preference Optimization Using Sparse Feature-level Constraints
  • arXiv DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization

Safety

  • arXiv arXiv A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
  • arXiv arXiv Stepwise Alignment for Constrained Language Model Policy Optimization
  • arXiv Enhancing LLM Safety via Constrained Direct Preference Optimization
  • arXiv Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents
  • arXiv arXiv Backtracking Improves Generation Safety
  • OpenReview SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Model Property

Generation - Distribution

  • arXiv arXiv Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data
  • arXiv Robust Preference Optimization through Reward Model Distillation
  • arXiv Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
  • arXiv arXiv Self-Improving Robust Preference Optimization
  • arXiv Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
  • arXiv arXiv On the limited generalization capability of the implicit reward model induced by direct preference optimization

Generation - Length

  • arXiv arXiv RRHF: Rank responses to align language models with human feedback without tears
  • arXiv arXiv A Long Way to Go: Investigating Length Correlations in RLHF
  • arXiv arXiv Disentangling Length from Quality in Direct Preference Optimization
  • arXiv arXiv SimPO: Simple Preference Optimization with a Reference-Free Reward
  • arXiv arXiv Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence
  • arXiv arXiv Direct Multi-Turn Preference Optimization for Language Agents
  • arXiv Following length constraints in instructions
  • arXiv The Hitchhiker’s Guide to Human Alignment with *PO
  • arXiv Length Desensitization in Direct Preference Optimization
  • arXiv Understanding the Logic of Direct Preference Alignment through Logic
  • arXiv OpenReview](https://openreview.net/forum?id=qTrEq31Shm) LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
  • arXiv arXiv Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

Optimization - Likelihood

  • arXiv Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
  • arXiv Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective
  • arXiv arXiv From r to Q∗: Your Language Model is Secretly a Q-Function
  • arXiv Robust Preference Optimization through Reward Model Distillation
  • arXiv 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward
  • arXiv Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment
  • arXiv Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
  • arXiv Understanding Likelihood Over-optimisation in Direct Alignment Algorithms
  • arXiv arXiv A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Optimization - Alignment

  • arXiv Mitigating the Alignment Tax of RLHF
  • arXiv Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment
  • arXiv arXiv Preference Learning Algorithms Do Not Learn Preference Rankings
  • arXiv arXiv A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques
  • arXiv PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning

Other Analysis

  • arXiv arXiv Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences
  • arXiv Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks
  • arXiv arXiv Discovering Preference Optimization Algorithms with and for Large Language Models
  • arXiv arXiv Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
  • arXiv arXiv RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization
  • OpenReview When is RL better than DPO in RLHF? A Representation and Optimization Perspective

Other Surveys

  • JMLR A Survey of Preference-Based Reinforcement Learning Methods
  • arXiv Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation
  • arXiv Aligning Large Language Models with Human: A Survey
  • arXiv Large Language Model Alignment: A Survey
  • arXiv The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values
  • arXiv AI Alignment: A Comprehensive Survey
  • arXiv A Survey of Reinforcement Learning from Human Feedback
  • arXiv On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models
  • arXiv A Survey on Human Preference Learning for Large Language Models
  • arXiv A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More
  • arXiv Towards a Unified View of Preference Learning for Large Language Models: A Survey
  • arXiv Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey
  • arXiv A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications