Direct Preference Optimization (DPO) is All You Need?

published on 06 December 2023
image-ey2uu

Welcome back to Silatus Weekly. Today, we will explore the power behind zephyr-7b and the new era of smaller but powerful models. In 'Textbook is All You Need', Phi-1.5 outperformed Llama-7b on certain benchmarks. Additionally, Mistral AI's Mistral 7b surpassed Llama-13 billion. However, HuggingFaceH4/zephyr-7b-beta emerges as a leader by beating Llama-2-70b-chat and GPT-3.5 turbo in several critical tasks for machine learning using Direct Preference Optimization (DPO).

image-gtxf1

Fine-Tuning

Using DPO, zephyr-7 b is Mistral 7b fine-tuned on the HuggingFaceH4/ultrachat_200k dataset. Fine-tuning in machine learning is taking a pre-trained model and training towards a specific task, such as summarization. Fine-tuning can be done unsupervised or supervised. You use a labeled dataset to adjust the pre-trained model's parameters in supervised fine-tuning. This is common in tasks like image classification, natural language processing, and others where specific target outputs are associated with each input in your dataset. The fine-tuning process involves continuing the training of the pre-trained model on a new dataset with known labels adjusting the weights to minimize the loss function, which measures the discrepancy between the model's predictions and the actual label.

Unsupervised fine-tuning doesn't rely on labeled data. Instead, it might involve adjusting a model to better capture the structure or distribution of a new dataset without explicit labels. Techniques like autoencoders, which learn to compress and then reconstruct input data, can be used for unsupervised fine-tuning. This approach is useful when you have a large amount of unlabeled data and want the model to learn more generalized features from this data.


Reinforcement Learning and Reinforcement Learning with Human Feedback

Before Addressing DPO, we need to tackle a few more terms(We are almost done!) Reinforcement learning (RL) is a specific type of machine learning concerned with how models take actions in an environment to maximize some notion of cumulative reward. We are rewarding the model for the correct answer choice. The reward tells the model how good or bad its action was in terms of achieving the goal. Then, we have Reinforcement Learning from Feedback (RLHF). RLHF, in the context of LLMs, involves incorporating human feedback directly into the learning process. This feedback can help the model generate more appropriate, accurate, or contextually relevant text.

Direct Preference Optimization

Compared to RLHF, DPO takes a different approach. Standard RLHF methods first learn a reward model from human preference data that reflects what responses humans tend to prefer. Instead, DPO directly optimizes the language model to satisfy the preferences without explicitly modeling a reward function. By eliminating the need to learn a separate reward model, a sample from the policy during training, and tune complex RL hyperparameters, DPO provides a much simpler training paradigm to optimize language models for human preferences. Despite its simplicity, DPO trains policies that satisfy human preferences better than more complex RLHF algorithms across tasks like sentiment control, summarization, and dialogue.

image-da7rt

As for Zephyr 7B, the Hugging Face team Directly fine-tuned the Mistral 7B model (after dSFT) on the UltraFeedback preference pairs to optimize for GPT-4's preferences using the DPO loss. DPO updates the model to assign a higher likelihood to preferred responses than non-preferred ones.

image-3hz3r
image-s9k2a

Since Zephyr 7B, other models have enter the DPO arena. Intel's neural-chat-7b-v3 fine-tuned their model and saw improvements. If you guys read last weeks post, then you know benchmarks should be taken by with a grain of salt. With that said, its still remarkable seeing improvements in model performance with novel training methods.

image-zg9ii

It's great to see that the open-source community is pushing the limits. We are excited to see what they have in stored next year!

Silatus AI helped to write this article.

Read more