2024: The Year of the Large Language Model

updated on 21 May 2024
image-rzuql

2024 was the year of the Large Language Models. From the mighty Claude-Opus to the tiny Phi-3 Mini. Each model brings to the table its own special flavor, while other models are just bigger with bland benchmarks. Welcome back to Silatus Digest, and today we are talking about the latest LLMs.

Claude 3

On March 4, 2024 Anthropic released Claude 3, the first model to "de-throne" GPT-4. Like Claude-2, Claude-3 comes with a 200k context length window. However, Claude-3 comes in three sizes; Haiku, Sonnet, and Opus.

According to anthropic, Haiku is the fastest and most cost-effective model in its intelligence category, processing a dense research paper (~10k tokens) with visuals in under three seconds.

Opus: A Formidable Competitor to OpenAI's GPT-4

Opus stands out as a significant rival to OpenAI's GPT-4, surpassing most of its counterparts across a wide range of common evaluation benchmarks for AI systems. Its areas of excellence include:

  • Undergraduate level expert knowledge (MMLU)
  • Graduate level expert reasoning (GPQA)
  • Basic mathematics (GSM8K)

Demonstrating near-human levels of comprehension and fluency in complex tasks, Opus is at the forefront of advancements in general intelligence.

image-vi71k

Jamba: A Hybrid SSM-Transformer Model

Jamba is a groundbreaking AI model developed by AI21 Labs that combines state space models (SSMs) with transformers. As of now large language models rely on the transformers architecure. The issue with transformers is the compute required. In 2017, training a Transformer model cost $900, but by 2019, the cost to train RoBERTa Large surged to $160,000. This increase aligns with scaling laws indicating that larger models' performance improves predictably with more computational power.

However, Jamba is is a hybrid of transformers and Mamba. Jamba can run inference on 140,000 tokens on a single A100, which is about the equivalent of a 210-page novel. Jamba's efficiency is three times greater than similar-sized transformer models, making it a game-changer in terms of performance on single GPUs.

image-5sxiz

Phi-3: Microsoft's Small Language Model

Phi-3 is a small language model (SLM) developed by Microsoft, designed for use on mobile devices and other compute-limited environments. It is available in different variants, including Phi-3-mini, which is a 3.8 billion parameter model trained on 3.3 trillion tokens. Phi-3 models are instruction-tuned and optimized for ONNX Runtime, supporting cross-platform use. Despite their smaller size, Phi-3 models perform competitively with larger models on key benchmarks. They have been developed in accordance with Microsoft's Responsible AI Standard, ensuring safety and ethical considerations are met. Phi-3 is already being used in practical applications such as agriculture, demonstrating its value in real-world scenarios. On evaluation Phi-3 is comparable to a Llama-3(8 billion parameter).

image-e4jc3

LLama-3 Open-Source King

On April 18, 2024, the Llama Lord open-sourced Llama 3. The talk of the town is that Llama 3's instruct 8 billion model outperforms all other models within its class. On the other hand, the Llama 70b instruct is competing within its weight class. However, if the claims about GPT-4's trillion-parameter size are true, then Llama 70b instruct is aiming above its weight class. Additionally, the Llama Lord announced that a 400 billion-parameter Llama 3 will be released in the near future. If the trend for Llama models continues, then the Llama-3 400 billion will rank supreme on the LLM benchmark.

image-63zbq

Are Benchmarks All You Need?

The benchmarks above indicates that there is a trend in LLMs catching up to GPT-4. But our Silatus readers know that bench marks should be taken with a grain of salt. While benchmarks, are good metrics for LLMs, the applications of these benchmarks aren't very practical.

Training LLMs for Benchmarks

MMLU measures a model's ability to reason. However, LLM providers can increase the code and math data to improve MMLU values. The real question is whether the "reasoning" skills apply to real-world use cases. For example, a model can explain a question step by step, but can it reason its way through an argument? Some models can, and some models cannot. This applies to all the benchmarks. Users should focus on the practiality of the benchmarks, such as Elo ratings and Chatbot arena

Introduction to Chatbot Arena and Elo Ratings

Chatbot Arena is an innovative benchmark platform designed to evaluate large language models (LLMs) through a unique method that involves head-to-head comparisons, much like a competitive game. This platform allows users to interact with two anonymized LLMs, ask questions, and vote on which model provides the better answer. The Elo rating system, originally used in chess, is employed to calculate the relative skill levels of these models based on user votes, providing a dynamic leaderboard of LLM performance.

Elo Ratings in LLM Benchmarking

The Elo rating system is a well-established method for ranking players in competitive games and sports. It has been adapted to the context of LLMs to provide a measure of their performance relative to one another. In Chatbot Arena, the Elo ratings are computed based on user votes from numerous anonymous battles between models. This system ensures that the ratings reflect the community's collective assessment of each model's capabilities.

Crowdsourcing and Community Involvement

Chatbot Arena emphasizes community involvement by inviting users to contribute to the benchmarking process. By engaging with the models and voting for the answers they prefer, users play a crucial role in shaping the leaderboard. This crowdsourced approach not only enhances the benchmarking process but also ensures that the ratings reflect a wide range of perspectives and use cases.

Leaderboard Details and Transparency

The leaderboard in Chatbot Arena provides detailed information about each model, including its name, Elo score, confidence interval, number of votes, the organization behind the model, the license under which it operates, and the knowledge cutoff date. Understanding these details is crucial for interpreting the rankings accurately and recognizing the strengths and limitations of each model.

Future Developments and Enhancements

The platform's developers plan to expand the range of models included in the benchmark, improve the algorithms and systems used to support larger numbers of models, and release periodically updated leaderboards. These enhancements aim to provide a more comprehensive and up-to-date assessment of LLMs in various contexts.

Benchmark Datasets and Manual Evaluation

While benchmark datasets serve as a standardized method for assessing LLMs, they have limitations, particularly in evaluating the safety of these models. Chatbot Arena offers an alternative by allowing manual human evaluation through direct interactions with the models. This approach provides immediate feedback on the LLMs' outputs and complements traditional benchmarking methods.

The Role of Chatbot Arena in the LLM Landscape

Chatbot Arena stands out as a resource for benchmarking LLMs. It not only provides a platform for direct comparison of models but also contributes to the broader AI community by highlighting models that excel in understanding and engaging with users. The platform's use of the Elo rating system ensures a robust and scalable method for ranking LLMs based on real-world performance.

Conclusion

Chatbot Arena represents a novel approach to LLM benchmarking, leveraging the Elo rating system and crowdsourced data to create a dynamic and community-driven leaderboard. Its emphasis on transparency, community involvement, and continuous improvement positions it as a valuable tool for understanding and comparing the capabilities of various LLMs. As the platform evolves, it promises to offer even more granular insights into the performance of LLMs across different use cases and to accommodate a growing range of models for comprehensive benchmarking

Read more