LiveBench is an open LLM benchmark using contamination-free test data
It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More
A team of Nvidia, Abacus.ai, New York University, the University of Maryland and the University of Southern California has developed a new benchmark that addresses “serious limitations” with industry incumbents. LiveBench is a general-purpose LLM benchmark that offers contamination-free test data, which occurs when more models train on the same dataset. It utilizes “frequently-updated questions from recent sources, scoring answers automatically according to objective ground-truth values, and contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis.”
The release of LiveBench is especially notable because one of its contributors is Yann LeCun, a pioneer in the world of AI, Meta’s chief AI scientist, and someone who recently got into a spat with Elon Musk. Joining him are Abacus.ai’s Head of Research Colin White and research scientists Samuel Dooley, Manley Roberts and Arka Pal; Nvidia’s Senior Research Scientist Siddhartha Jain; and academics Ben Feuer, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Tom Goldstein, Willie Neiswanger, and Micah Goldblum.
“Like many in the community, we knew that we needed better LLM benchmarks because existing ones don’t align with our qualitative experience using LLMs,” Goldblum tells VentureBeat in an email. “This project started with the initial thought that we should build a benchmark where diverse questions are freshly generated every time we evaluate a mode, making test set contamination impossible. I chatted with Colin and Samuel from Abacus.ai, and ultimately, with funding and support from Abacus.ai, built this thing out into much more than we initially imagined. We combined forces with folks at NYU, Nvidia, USC and also the University of Maryland folks who had been thinking about instruction following, and the project became a big team effort.”
LiveBench: What you need to know
“As large language models (LLMs) have risen in prominence, it has become increasingly clear that traditional machine learning benchmark frameworks are no longer sufficient to evaluate new models,” the team states in a published whitepaper (PDF). “Benchmarks are typically published on the internet, and most modern LLMs include large swaths of the internet in their training data. If the LLM has seen the questions of a benchmark during training, its performance on that benchmark will be artificially inflated, hence making many LLM benchmarks unreliable.”
VB Transform 2024 Registration is Open
Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now
The whitepaper authors claim that while benchmarks using LLM or human prompting and judging have become increasingly popular, disadvantages include being prone to making mistakes and unconscious biases. “LLMs often favor their own answers over other LLMs, and LLMs favor more verbose answers,” they write. And human evaluators aren’t immune to this either. They can inject biases such as output formatting and when it comes to the tone and formality of the writing. Moreover, humans could influence how questions are generated, offering less diverse queries, favoring specific topics that don’t probe a model’s general capabilities, or simply writing poorly constructed prompts.
“Static benchmarks use the honor rule; anyone can train on the test data and say they achieved 100 percent accuracy, but the community generally doesn’t cheat too bad, so static benchmarks like ImageNet or GLUE have historically been invaluable,” Goldblum explains. “LLMs introduce a serious complication. In order to train them, we scrape large parts of the internet without human supervision, so we don’t really know the contents of their training set, which may very well contain test sets from popular benchmarks. This means that the benchmark is no longer measuring the LLM’s broad abilities but rather its memorization capacity, so we need to built yet another new benchmark, and the cycle goes on every time contamination occurs.”
To counter this, LiveBench is releasing new questions every month that can be used to minimize potential test data contamination. These queries are sourced using recently released datasets and math competitions, arXiv papers, news articles and IMDb movie synopses. Because each question has a verifiable and objective ground-truth answer, it can be scored accurately and automatically without needing LLM judges. 960 questions are now available with newer and harder inquiries being released monthly.
Tasks and categories
An initial set of 18 tasks across the six aforementioned categories is available today. They’re tasks that use “a continuously updated information source for their questions” or are “more challenging or diverse versions of existing benchmark tasks,” such as those from AMPS, Big-Bench Hard, IFEval or bAbl. Here’s the breakdown of tasks by categories:
- Math: questions from high school math competitions from the past 12 months, as well as harder versions of AMPS questions
- Coding: code generation and a novel code completion task
- Reasoning: challenging versions of Big-Bench Hard’s Web of Lies and positional reasoning from bAbl and Zebra Puzzles
- Language Comprehension: three tasks featuring Connections word puzzles, a typo removal task and a movie synopsis unscrambling task from recent movies featured on IMDb and Wikipedia
- Instruction Following: four tasks to paraphrase, simplify, summarize or generate stories about recent articles from The Guardian while adhering to requirements such as word limits or incorporating specific elements in the response
- Data Analysis: three tasks that use recent datasets from Kaggle and Socrata, namely table reformatting, predicting which columns can be used to join two tables, and predicting the correct type annotation of a data column
Each task differs in difficulty level, from easy to most challenging. The idea is that top models will tend to have a 30 percent to 70 percent success rate.
The benchmark’s creators say they have evaluated many “prominent closed-source models, as well as dozens of open-source models” between 500 million and 110 billion tokens in size. Citing LiveBench’s difficulty level, they claim top models have achieved less than 60 percent accuracy. For example, OpenAI’s GPT-4o, which tops the benchmark’s leaderboard, has a global average score of 53.79, followed by GPT-4 Turbo’s 53.34. Anthropic’s Claude 3 Opus is ranked third with 51.92.
What it means for the enterprise
Business leaders have it rough contemplating how to use AI and develop a sound strategy using the technology. Asking them to decide on the right LLMs adds unnecessary stress to the equation. Benchmarks can provide some peace of mind that models have exceptional performance—similar to product reviews. But are executives given the complete picture of what’s under the hood?
“Navigating all the different LLMs out there is a big challenge, and there’s unwritten knowledge regarding what benchmark numbers are misleading due to contamination, which LLM-judge evals are super biased, etc.,” Goldblum states. “LiveBench makes comparing models easy because you don’t have to worry about these problems. Different LLM use-cases will demand new tasks, and we see LiveBench as a framework that should inform how other scientists build out their own evals down the line.”
Comparing LiveBench to other benchmarks
Declaring you have a better evaluation standard is one thing, but how does it compare to benchmarks the AI industry has used for some time? The team looked into it, seeing how LiveBench’s score matched with prominent LLM benchmarks, namely LMSYS’s Chatbot Arena and Arena-Hard. It appears that LiveBench had “generally similar” trends to its industry peers, though some models were “noticeably stronger on one benchmark versus the other, potentially indicating some downsides of LLM judging.”
While these benchmarks show which models perform best, the individual LLM scoring differs. And that metric is not exactly an apples-to-apples comparison, either. As LiveBench points out, it could be attributed to unknown factors such as “known bias.” For example, OpenAI’s GPT-4-0125-preview and GPT-4 Turbo-2024-04-09 performed significantly better on Arena-Hard compared to LiveBench, but this is said to be “due to the known bias from using GPT-4 itself as the LLM judge.”
When asked if LiveBench is a startup or simply a benchmark available to the masses, Dooley remarks it’s “an open-source benchmark that anyone can use and contribute to. We plan to maintain it by releasing more questions every month. Also, over the coming months, we plan on adding more categories and tasks to broaden our ability to evaluate LLMs as their abilities change and adapt. We are all big fans of open science.”
“We find that probing the capabilities of LLMs and choosing a high-performing model is a huge part of designing an LLM-focused product,” White says. “Proper benchmarks are necessary, and LiveBench is a big step forward. But moreover, having good benchmarks accelerates the process of designing good models.”
Developers can download LiveBench’s code from GitHub and its datasets on Hugging Face.