Testing AI

Testing math is easy, it’s right or wrong. Testing spelling is easy too, but testing if a joke is funny - now that’s tough. Let’s talk about how to test AI.

Concepts

Tracing

I started by using LangSmith, but now I use ell, which has it built in.

LLM as Judge

Fantastic post on using an LLM as a judge.

This is the future, also promptfoo supports this well.

Testing across prompts

I use PromptFoo, but haven’t really done this.

Testing across models

  1. PromptFoo - For heavy lifts
  2. https://nat.dev - For ad-hoc experimentation
  3. Just keep the output of both - The cheater’s way

Here’s an example, where I use multiple LLMs to generate output and just keep both.

For example, here’s the explanation of My commit that does this.

Notice that Claude does a better job getting the gist of it.

– claude-3-opus-20240229 – Add support for generating commit messages from multiple LLMs concurrently

  • Allows generating commit messages from multiple LLMs (OpenAI and Anthropic Claude) in parallel to compare and choose the best one

– gpt-4-turbo-2024-04-09 – Refactor build_commit function for asynchronous operation and enhance instruction formatting

  • Shift towards asynchronous programming for handling multiple language models concurrently, improving scalability and performance.
  • Update the instruction documentation to make it more structured and clear, following a markdown format for better readability.

Examples

Who is the funnier LLM

One of my AI creations is an improv coach. A key requirement of said coach is that it is funny. How the heck can I tell if GPT-3.5, GPT-4, or Claude is funnier?

I found a package called PromptFoo, which is most of what I want.

See the code here, and you can interact with a live file here.

Who is the better git summarizer

You can see my PromptFoo test cases here.

How to get the test case? I’d recommend recording a trace with LangSmith, then copy/export them from a LangChain trace and write to JSON (todo add a script for this).

And the output of the run in PromptFoo.

Note, you can both assess these as a human, also have the LLM look at criteria

tests:
  assert:
    - type: llm-rubric
      value: ensure the diff is described well

Eval Systems

Human-based blind taste tests Chatbot arena

The gold standard for what LLM is best is asking users to judge. Chatbot Arena does this, from their paper:

Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at \url{this https URL}.

Note, Elo rating is better than a straight rank. It’s what’s used in chess scores, TL;DR from GPT:

The Elo rating system provides a more dynamic and precise measurement of a player’s skill level compared to a strict ranking system. In a strict rank system, ranks are usually assigned based on the order of finish in competitions or through a simple win/loss record without considering the strength of the opponents. This can sometimes lead to misleading ranks when players have not played opponents of equal skill.

The Elo system, however, adjusts a player’s rating based on the expected outcome of each game, taking into account the skill levels of the opponents. This means that beating a higher-rated player will gain you more points than beating a lower-rated one, and losing to a lower-rated player will cost you more points. As a result, the Elo rating is a more accurate reflection of a player’s true skill level and provides a more nuanced understanding of how players compare to each other.

Eval Data Sets

Building good “generic” eval data sets is hard, here are some:

  • Big Bench - a bunch of hard question prompts

Testing Theory

Good books

Simplest form of testing

Before testing:

  • Come up with a list of tasks (questions) and answers

Test Time:

  • Have the system perform those tasks and write down the answer

Eval Time:

  • Check tasks against answers
  • Print score

Wrinkle - A/B Testing

Sometimes we want to see what

At test time:

  • Have both A and B attempt the task

Eval Time:

  • See who did better, A or B

Wrinkle - No Known Answer

Sometimes there isn’t a known answer - in that case, we can have a judge do the answers.

Eval Time:

Have a judge give a subjective score.

Judges are subjective, so we can have multiple judges and average their answers - like we do in boxing matches or work performance reviews.

Wrinkle - No clear questions

Before testing:

  • Have an expert create a list of tasks