Testing AI

Testing math is easy, it’s right or wrong. Testing spelling is easy too, but testing if a joke is funny - now that’s tough. Let’s talk about how to test AI.

Concepts
Examples
Eval Systems
- Human-based blind taste tests Chatbot arena
- Eval Data Sets
Testing Theory

Concepts

Tracing

I started by using LangSmith, but now I use ell, which has it built in.

LLM as Judge

Fantastic post on using an LLM as a judge.

This is the future, also promptfoo supports this well.

Testing across prompts

I use PromptFoo, but haven’t really done this.

Testing across models

PromptFoo - For heavy lifts
https://nat.dev - For ad-hoc experimentation
Just keep the output of both - The cheater’s way

Here’s an example, where I use multiple LLMs to generate output and just keep both.

For example, here’s the explanation of My commit that does this.

Notice that Claude does a better job getting the gist of it.

– claude-3-opus-20240229 – Add support for generating commit messages from multiple LLMs concurrently

Allows generating commit messages from multiple LLMs (OpenAI and Anthropic Claude) in parallel to compare and choose the best one

– gpt-4-turbo-2024-04-09 – Refactor build_commit function for asynchronous operation and enhance instruction formatting

Shift towards asynchronous programming for handling multiple language models concurrently, improving scalability and performance.
Update the instruction documentation to make it more structured and clear, following a markdown format for better readability.

Examples

Who is the funnier LLM

One of my AI creations is an improv coach. A key requirement of said coach is that it is funny. How the heck can I tell if GPT-3.5, GPT-4, or Claude is funnier?

I found a package called PromptFoo, which is most of what I want.

See the code here, and you can interact with a live file here.

Who is the better git summarizer

You can see my PromptFoo test cases here.

How to get the test case? I’d recommend recording a trace with LangSmith, then copy/export them from a LangChain trace and write to JSON (todo add a script for this).

And the output of the run in PromptFoo.

Note, you can both assess these as a human, also have the LLM look at criteria

tests:
  assert:
    - type: llm-rubric
      value: ensure the diff is described well

Git commit message for blog content avoid ToC updates

I use a tool to write my git commit messages. It can get slow, so I like to run it with maverik when it’s not that important. However, when summarizing changes to my blog there are always changes to the auto generated ToC, which I don’t want to be included, I have trouble getting Maverik to honor this.

- **Do not** mention discuss changes to the table of content.

Teting this is normally a PITA with me doing manual testing, lets see if I can make some evals to fix this. I think I have a few choices

Prompt Foo
DeepEval
Giskard

Ltes start with promptfoo

Eval Systems

The gold standard for what LLM is best is asking users to judge. Chatbot Arena does this, from their paper:

Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at \url{this https URL}.

Note, Elo rating is better than a straight rank. It’s what’s used in chess scores, TL;DR from GPT:

The Elo rating system provides a more dynamic and precise measurement of a player’s skill level compared to a strict ranking system. In a strict rank system, ranks are usually assigned based on the order of finish in competitions or through a simple win/loss record without considering the strength of the opponents. This can sometimes lead to misleading ranks when players have not played opponents of equal skill.

The Elo system, however, adjusts a player’s rating based on the expected outcome of each game, taking into account the skill levels of the opponents. This means that beating a higher-rated player will gain you more points than beating a lower-rated one, and losing to a lower-rated player will cost you more points. As a result, the Elo rating is a more accurate reflection of a player’s true skill level and provides a more nuanced understanding of how players compare to each other.

Eval Data Sets

Building good “generic” eval data sets is hard, here are some:

Big Bench - a bunch of hard question prompts

Testing Theory

Good books

Taming LLMs on Evals

Simplest form of testing

Before testing:

Come up with a list of tasks (questions) and answers

Test Time:

Have the system perform those tasks and write down the answer

Eval Time:

Check tasks against answers
Print score

Wrinkle - A/B Testing

Sometimes we want to see what

At test time:

Have both A and B attempt the task

Eval Time:

See who did better, A or B

Wrinkle - No Known Answer

Sometimes there isn’t a known answer - in that case, we can have a judge do the answers.

Eval Time:

Have a judge give a subjective score.

Judges are subjective, so we can have multiple judges and average their answers - like we do in boxing matches or work performance reviews.

Wrinkle - No clear questions

Before testing:

Have an expert create a list of tasks

Testing AI

Concepts

Tracing

LLM as Judge

Testing across prompts

Testing across models

Examples

Who is the funnier LLM

Who is the better git summarizer

Git commit message for blog content avoid ToC updates

Eval Systems

Human-based blind taste tests Chatbot arena

Eval Data Sets

Testing Theory

Good books

Simplest form of testing

Wrinkle - A/B Testing

Wrinkle - No Known Answer

Wrinkle - No clear questions