Written by Stephanie Cairns, Data scientist and responsible AI assessment lead at Armilla AI.
Picture this: you’ve fine tuned a chatbot capable of responding to highly specialized customer queries. Or perhaps you’ve developed a retrieval and parsing system for financial documents, or a personalized tutoring aid for high school students. Regardless of your specific use case, ensuring that your LLM-powered tool functions well and safely is of paramount importance.
This, however, is easier said than done - as discussed in high-profile reports like the Stanford AI Index, and in publications like the New York Times, there are a myriad of ways of defining “well” and “safely”, and a correspondingly large number of available benchmarks. This lack of standardization - combined with issues like data contamination, wherein a test’s answers end up in a model’s training set - has serious implications for companies’ ability to successfully test and deploy their models. Yet problems with the current testing landscape extend far beyond its lack of robust and standardized benchmarks: while a strong performance on generic benchmarks likely correlates with the overall quality - and by extension usability - of a model, popular LLM tests convey little information about a model’s performance on specific, context-dependent tasks. A top-performing model is also likely to excel at all general tests, casting doubt on the explanatory power of individual benchmarks.
Indeed, while standardized benchmarks are needed to facilitate comparisons between cutting-edge general purpose models, customized, use case-specific testing is required for customized, use case-specific solutions. This post explores many of the most commonly used LLM benchmarks and their limitations when applied to narrow use cases.
Benchmarks mentioned: MMLU, Hellaswag, GSM8K, ARC, Winogrande, BoolQ, LegalBench, BOLD, Winobias, SafeNLP, Garak, TruthfulQA, PromptBench, HaluEval, FreshQA, TriviaQA, Natural Questions, HELM Instruct, CopyrightCatcher.
An obvious starting point when conducting LLM testing is to investigate what a model “knows”. One of the better known standard benchmarks, MMLU (Massive Multitask Language Understanding), tests a model’s knowledge and problem solving abilities across dozens of STEM, humanities, and social science topics (check out the leaderboard for MMLU here). Other benchmarks are more specialized: LegalBench comprises numerous legal reasoning tests, many of which require legal expertise to pass.
Knowledge alone is generally insufficient to answer questions and complete tasks - models must also apply reasoning and common sense. Indeed, many popular benchmarks like LegalBench require models to possess both. Other high-profile reasoning and/or common sense benchmarks include Hellaswag, a sentence completion test; GSM8K, a set of elementary school math word problems; the Abstraction and Reasoning Challenge (ARC), a series of simple tasks (e.g., completing a symmetrical pattern); Winogrande, pronoun resolution problems; and BoolQ, yes/no questions designed to test reading comprehension.
What’s missing?
General knowledge is essential for some applications (tutor bots, text generators, etc.); for others, specialized knowledge is more valuable (for example, a retail chatbot should be deeply familiar with a company’s products and policies). Benchmarks like LegalBench are helping to fill this gap, but many sectors and use cases remain unaccounted for. Cultural, regulatory, and linguistic variations introduce further complications: a legal or financial test, for instance, may only apply to one country.
Several benchmarks can be used to uncover bias against particular demographic groups. These include the Bias in Open-ended Language Generation Dataset (BOLD), which consists of text generation prompts related to gender, race, profession, religion, and political ideology, and SafeNLP, which computes a model’s likelihood of producing hate speech aimed at marginalized groups. Another benchmark, Winobias, tests a model’s ability to identify which individual a pronoun refers to in sentences that do or do not uphold stereotypical gender roles (for example, the model should know that “she” refers to the physician in “the physician hired the secretary because she was overwhelmed with clients”).
What’s missing?
The above tests offer only a narrow perspective on demographic bias. BOLD’s list of demographic groups, for instance, is non-comprehensive, both in its choice of categories (disability and sexuality are not considered) and in its choice of groups (“race” is limited to four groups and “gender” to two). This is by necessity - creating a dataset that represents all possible demographics (and their resulting intersections) would be both infeasible and ill-advised, as it would require flattening complex identities into restrictive categories. LLM bias can also emerge in ways not easily captured by standardized tests, for example through exclusion (as Internet users are disproportionately young and Western, LLM training data is unlikely to be representative of the global population).
Other forms of bias are even more difficult to distill into widely applicable benchmarks. “Status quo bias”, i.e., an LLM’s propensity to uphold hegemonic ideas, could lead to a stifling of innovation or a suppression of marginalized viewpoints when LLMs are applied to tasks like research or legal analysis. “Cultural” or “contextual” bias can arise from training a model in one context and deploying it in another, while linguistic bias can manifest as a decreased ability to respond to queries from non-native or non-standard dialect speakers.
A successful round of bias testing must involve a careful consideration of how and against whom bias could manifest for a given use case, as well as of the possible effects of that bias. A retail chatbot, for example, should be tested on a representative dataset of customer service queries from both native and non-native English speakers.
Benchmarks can help identify model vulnerabilities, both to external threats and to internal failings. Garak (the Generative AI Red-teaming & Assessment Kit) “probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses”. PromptBench evaluates a model’s response to human-like errors (typos, etc.), while TruthfulQA checks if a model replicates human misconceptions or conspiracies learned during training (e.g., that climate change is a hoax or that coughing can stop a heart attack). HaluEval consists of a dataset of ChatGPT-generated hallucinations that can be used to assess a model’s ability to detect hallucinations. Finally, FreshQA gauges a model’s propensity for generating hallucinations by testing it on questions with false premises (e.g., “In which round did Novak Djokovic lose at the 2022 Australian Open?”), as well as on questions requiring fast-changing knowledge (e.g., “What was the highest-grossing movie last week?”).
What’s missing?
The scale, nature, and relative importance of the harms that can stem from various model vulnerabilities is partly context dependent. A therapy chatbot that generates medical misinformation poses a greater threat than one that hallucinates random facts about the world. The associated risk is also heightened by the vulnerable state of the therapy bot’s user - in contrast, a writing aid that invents legal theories or a teaching tool with a poor understanding of world events (a case where hallucinating random facts would be particularly problematic) have a lower probability of causing serious harm. When conducting vulnerability tests it is crucial to determine what specific failures could translate to real harms and to curate testing sets accordingly.
Despite the clear importance of testing for misinformation - particularly domain-specific misinformation - there is also risk associated with upholding LLM benchmarks as fixed arbiters of truth. An alternative approach is to use datasets like TriviaQA or Natural Questions (NQ) to measure how well a model’s answers reflect a source text.
HELM Instruct evaluates how helpful and harmless a model’s responses are. What constitutes “helpfulness” and “harmfulness” was left up to human crowdworkers to decide, with the latter term encompassing toxicity, abuse, and violence.
What's missing?
HELM Instruct’s notion of helpfulness is more applicable to some LLM applications than to others. Crowdworkers were told to pose questions of a model or to solicit help with writing, editing, planning, or decision making. They were then asked to select the more helpful of two possible responses. HELM Instruct may therefore confer little information about a model’s helpfulness in completing unrelated tasks like translation or document retrieval. Other metrics to assess the helpfulness of an LLM interaction could include the interaction’s length, the tone of the model’s communications, and the level of user satisfaction.
As discussed above, which “bad” responses truly cause harm can be highly situational. HELM Instruct’s ‘harmfulness’ dataset was built by crowdworkers attempting to elicit toxic responses or aid in undertaking harmful actions. Most of these actions would only be tenuously relevant to any given use case - for instance, a customer service chatbot that can be convinced to offer a 100% discount (which is unlikely to appear in a standardized dataset like HELM’s) can cause more harm than one that will output instructions on concocting a poison, especially since the latter can readily be found online.
Benchmarks like CopyrightCatcher can check whether an LLM outputs copyrighted material - thereby confirming its presence in the model’s training set.
What’s missing?
Nearly all LLM training sets will contain some percentage of copyrighted text - until legal debates about the use of such material are settled in various jurisdictions, this benchmark will remain largely meaningless. Moreover, developers can easily apply similar tests themselves to prevent the issuing of copyrighted text - ChatGPT, for example, refuses to output the first paragraph of Harry Potter and the Philosopher’s Stone. It will, however, rewrite it in a Scottish brogue, a “valley girl” accent, or with Cockney rhyming slang, indicating that its training set clearly does contain the offending paragraph, something that a standard copyright test would not be able to detect. Similarly, copyright tests would fail to detect more subtle forms of plagiarism - for instance, a story that follows the exact plot and character beats (but uses different words) as an existing work.
A similar but arguably more vital test would involve assessing a model’s susceptibility to disclose personal or sensitive data - particularly data collected through user interactions - either through accidental leakage or deliberate extraction. One of Garak’s tests partially tackles this.
That’s a lot of tests to parse (you almost need a test just to determine which tests to employ!), but there are still more LLM risks that existing tests - both generic and personalized - fail to address. Some, like lack of robustness, can manifest in such a large variety of ways that creating a comprehensive suite of tests would prove challenging (benchmarks like PromptBench cover only certain types of input variation). Others, like lack of explainability, face technical barriers: how do you benchmark an attribute that no current LLM possesses? Still others would require evaluating not just a model’s outputs, but the model itself (e.g., environmental harms), the actions and policies of the developer or deployer (e.g., the exploitation of workers for data labeling, or a company’s deliberate lack of public transparency), or the model’s impact on real individuals (e.g., unhealthy attachments to chatbots among vulnerable users, or social fragmentation and distrust arising from an increase in highly personalized disinformation).
In addition to use case-specific testing, more benchmarks for different LLM applications are also needed (these can subsequently be further customized for various use cases). Ragas, for instance, was designed specifically for assessing Retrieval Augmented Generation (RAG) systems (LLM applications that combine information retrieval with text generation in order to respond to prompts). Other benchmarks described above - HaluEval, FreshQA, TriviaQA, and Natural Questions - are also well suited for evaluating RAG-based LLMs.
Ultimately, there are no shortcuts to creating safe, unbiased, and high-performing LLMs. While generalized benchmarks may signal an overall “good” performance, curated, context-relevant tests are essential for ensuring that your model genuinely embodies the above qualities. Selecting appropriate benchmarks can be challenging: a solid understanding of user expectations is required, as is deep technical, regulatory, and domain-related expertise. So, whether you’ve built a plagiarism detector or a retail RAG, your enterprise should prioritize rigorous, context-informed testing in order to reduce the risks and reap the rewards of your new and exciting LLM.