Lessons from Implementing the 'Needle in a Haystack' Benchmark

Owen Parsons - February 5, 2024

Header

Introduction

Background

With the rapid development and constant evolution of Large Language Models (LLMs), developing robust evaluation frameworks has become increasingly crucial. As these models simultaneously become more sophisticated and also more ubiquitous, understanding their capabilities and limitations through systematic testing is vital.

I recently had the opportunity to work with the AI Safety Engineering Taskforce (ASET) at Arcadia Impact, where our focus was implementing benchmarks for the Inspect AI framework. Inspect AI, developed by the AI Safety Institute (AISI), represents a comprehensive approach to evaluating and understanding the behavior of language models across various dimensions of performance and safety.

I chose to implement and explore the "Needle in a Haystack" (NIAH) benchmark - a conceptually simple evaluation of long-context retrieval. The benchmark tests an LLM's ability to accurately recall and reference specific information embedded within a larger context, attempting to mimic real-world scenarios where precise information retrieval is crucial. The original implementation of this benchmark was pioneered by Greg Kamradt, who proposed the benchmark and provided an insightful walkthrough of the concept.

While the basic premise of the benchmark might seem straightforward - and, in theory, less likely to suffer from issues of brittleness - the process of implementation the benchmark unearthed a few nuances of language models and considerations for designing benchmarks.

I'll use this post tocapture my main observations and insights from implementing and analysing this benchmark. The aim is to explore not just the technical aspects of this specific implementation, but also broader implications for LLM eval design considerations. I'll discuss some of the challenges I encountered while implementing this benchmarking, highlight some of the limitations of the NIAH benchmark, and suggest some potential improvements.

Inspect AI

The Inspect AI framework - developed and open-sourced by the UK AI Safety Institute - is a great tool for taking a more structured approach to evaluating LLMs. It has built-in components for supporting development of evaluations; from basic prompt engineering to more complex scenarios involving tool usage and multi-turn dialogues.

Inspect is built around three core components that form the building blocks of an evaluation:

  1. Datasets: These form the key part of most evaluations, consisting of labeled samples organized in a straightforward tabular format. Each sample typically contains an `input` (the prompt we want to test) and a `target` (what we expect or hope to see). The target can be either a specific expected response or guidelines for grading the model's output with a scoring model.
  2. Solvers: These are used for processing the inputs and generating results. Solvers can be chained together to create more complex evaluation flows. The most basic solver, `generate()`, simply passes a prompt to the model and collects its response. More sophisticated solvers can implement prompt engineering techniques, manage multi-turn conversations, or scaffolding for agent-based evaluations.
  3. Scorers: These evaluate how well the model's output matches our expectations. These can range from simple exact-match comparisons to model-based grading systems that assess responses based on specific criteria.

There's a "Hello World" example on the Inspect AI website that gives you an idea of how these components work together:


        from inspect_ai import Task, task
        from inspect_ai.dataset import Sample
        from inspect_ai.scorer import exact
        from inspect_ai.solver import generate
        
        @task
        def hello_world():
            return Task(
                dataset=[
                    Sample(
                        input="Just reply with Hello World",
                        target="Hello World",
                    )
                ],
                solver=[generate()],
                scorer=exact(),
            )
          
To run this evaluation, you need to install inspect first:

          pip install inspect-ai
        
Then you can run the evaluation on GPT-4 with:

        inspect eval hello_world.py --model openai/gpt-4
        
The Inspect AI website has a lot of other examples and documentation for how to use the framework. It's really well written and I recommend checking it out if you're interested in developing your own evaluations!