Do Benchmarks make or break the case for LLMs?

published on 28 November 2023

Every week, we hear about a new open-source LLM that outperforms a closed-source model or achieves “state-of-the-art performance” on some task. But how reliable are these claims? Welcome back to Silatus Weekly, where we will discuss the challenges and opportunities of benchmarking open-source LLMs.

image-xar8p

Open up the Benches

The open-source movement is indeed a beautiful and inspiring initiative. Open-source unites developers from large tech companies, startups, and hobbyists, fostering collaboration on various projects. This collaboration extends to tasks like improving source code, evaluating models, and much more. One of the great projects within the open-source community are LLM benchmarks.

LLMs benchmarks are important because models like Chat-GPT, Claude-2, and Bard are closed-source, limiting external developers' ability to directly compare and evaluate them. Open-source LLM benchmarks provide a valuable tool for the broader community, offering a standardized metric to assess and compare the performance of both open and closed-source models. A few benchmarks include Trueful QA, Measuring Massive Multitask Language Understanding Benchmark, (MMLU) and HellaSwag.

Truthful QA

Truthful QA dataset includes 817 questions across 38 categories like health, law, finance, and politics to evaluate an LLMs hallucination rate. Models are known to "hallucinate" and give incorrect answers. Although the answers they give are wrong, LLMs are very convincing. An attorney mentioned, 'I never even considered that this technology could be deceptive,' when Chat-GPT persuaded the lawyer to reference fake cases.

image-3vjqj

A reminder to our readers: When inquiring about complex topics such as the latest developments in AI research, government reports, or legal cases, it's advisable to use retrieval augmentation. This process involves verifying the answers provided by LLMs with additional sources to ensure accuracy and reliability.

Measuring Massive Multitask Language Understanding

Next we have the MMLU benchmark. MMLU stresses the need for models to have broad knowledge and problem-solving skills across a wide range of subjects, pointing out current limitations in areas like law, ethics, and calculation-heavy STEM subjects. The MMLU test models in a zero-shot and few-shot settings. Zero-shot refers to models recognizing or performing tasks on objects or classes it has never seen during training.

image-vmxg0

Adversarial Filtering

Finally, we have HellaSwag. HellaSwag evaluates commonsense reasonng in LLMs using  "Adversarial Filtering" (AF) to create challenging incorrect answers. AF effectively generates questions that are easy for humans but difficult for machines. AF iteratively selects machine-generated wrong answers, creating a dataset where machine performance often dips substantially. Just to clarify, HellaSwag is valid benchmark despite its creative name.

image-78oy3

Benchmarks: Fact or Fiction? 

While benchmarks are import for LLM evaluations, LLMs performance in real-world settings may vary. Benchmarks are conducted in controlled environments with specific criteria. However, real-world applications are characterized by a wide range of unpredictable tasks and conditions, which can significantly differ from those in benchmark scenarios. Take for instance the truthful benchmark. The TruthfulQA benchmark was designed to assess how truthful LLMs are at answering questions. However, its approach to determining truthfulness (similar to standards used in Wikipedia or scientific articles) leaves room for LLMs to provide non-committal or irrelevant but technically accurate answers. For instance, an LLM could earn a perfect score by either expressing uncertainty for every question or by repeating a true but irrelevant fact.

"Data contamination" could be another reason why benchmarks are an accurate. "Data contamination" refers to the issue where the data used to train or evaluate a model includes incorrect, misleading, or irrelevant information. This can happen due to various reasons, such as errors in data collection, processing, or labeling. When benchmarks for evaluating models, like LLMs, contain contaminated data, their results may not accurately reflect the true capabilities of the models. This is because the models might be tested against or trained with data that is not representative of real-world scenarios or contains inaccuracies, leading to skewed or unreliable benchmark outcomes.

A Grain of Salt

image-zarrm

In conclusion, LLM benchmarks like Truthful QA, MMLU, and HellaSwag play a crucial role in the open-source community by providing standardized metrics to evaluate both open and closed-source models. These benchmarks help highlight the strengths and limitations of LLMs in various aspects, from truthfulness and knowledge breadth to commonsense reasoning. However, real-world performance can differ from benchmark results due to factors like controlled environments, unpredictable real-world tasks, and issues like data contamination. Therefore, while benchmarks are valuable tools, their results should be interpreted with caution and supplemented with real-world testing.

Next week on Silatus Weekly, we'll focus on some benchmarks claiming that certain models are outperforming GPT-4. We'll delve into these claims, examining their validity and providing insights into what they mean for the future of LLM development and application. Stay tuned for an in-depth discussion and our thoughts on these developments.

Silatus AI helped to write this article.

Read more