BenchLLM by
Introducing BenchLLM
The best way to evaluate LLM-powered apps
The best way to evaluate LLM-powered apps
Use BenchLLM to evaluate your code on the fly. Build test suites for your models and generate quality reports. Choose between automated, interactive or custom evaluation strategies.
Use BenchLLM to evaluate your code on the fly. Build test suites for your models and generate quality reports. Choose between automated, interactive or custom evaluation strategies.
BenchLLM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from benchllm import SemanticEvaluator, Test, Tester
from langchain.agents import AgentType, initialize_agent
from langchain.llms import OpenAI
# Keep your code organized in the way you like
def run_agent(input: str):
llm=OpenAI(temperature=0)
agent = initialize_agent(
load_tools(["serpapi", "llm-math"], llm=llm),
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
)
return agent(input)["output"]
# Instantiate your Test objects
tests = [
Test(
input="When was V7 founded? Divide it by 2",
expected=["1009", "That would be 2018 / 2 = 1009"]
)
]
# Use a Tester object to generate predictions
tester = Tester(run_agent)
tester.add_tests(tests)
predictions = tester.run()
# Use an Evaluator object to evaluate your model
evaluator = SemanticEvaluator(model="gpt-3")
evaluator.load(predictions)
evaluator.run()
Introducing BenchLLM
The best way to evaluate LLM-powered apps
Use BenchLLM to evaluate your code on the fly. Build test suites for your models and generate quality reports. Choose between automated, interactive or custom evaluation strategies.
BenchLLM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from benchllm import SemanticEvaluator, Test, Tester
from langchain.agents import AgentType, initialize_agent
from langchain.llms import OpenAI
# Keep your code organized in the way you like
def run_agent(input: str):
llm=OpenAI(temperature=0)
agent = initialize_agent(
load_tools(["serpapi", "llm-math"], llm=llm),
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
)
return agent(input)["output"]
# Instantiate your Test objects
tests = [
Test(
input="When was V7 founded? Divide it by 2",
expected=["1009", "That would be 2018 / 2 = 1009"]
)
]
# Use a Tester object to generate predictions
tester = Tester(run_agent)
tester.add_tests(tests)
predictions = tester.run()
# Use an Evaluator object to evaluate your model
evaluator = SemanticEvaluator(model="gpt-3")
evaluator.load(predictions)
evaluator.run()
Built by AI engineers
for AI engineers
By AI engineers
for AI engineers
We are a team of engineers who love building AI products. We don't want to compromise between the power and flexibility of AI and predictable results. We have built the open and flexible LLM evaluation tool that we have always wished we had.
We are a team of engineers who love building tools for other engineers. Our goal is to create the LLM Benchmark we've always wished we had.
BenchLLM
$ bench run tests/hallucinations
================================== Run Tests ==================================
tests/hallucinations/eval.py:22 ..
=============================== Evaluate Tests ================================
tests/hallucinations/eval.py:22 .F
================================== Failures ===================================
------- tests/hallucinations/eval.py:22 :: tests/hallucinations/1.yml ---------
┌─────────────┬────────────────────────────────────────────────────────────┐
│ Input │ Which country won the most medals at the Olympics 2024? │
├─────────────┼────────────────────────────────────────────────────────────┤
│ Output │ It's likely to be either USA or China. │
├─────────────┼────────────────────────────────────────────────────────────┤
│ Expected #1 │ I don't know │
├─────────────┼────────────────────────────────────────────────────────────┤
│ Expected #2 │ I don't know because the event hasn't taken place yet │
└─────────────┴────────────────────────────────────────────────────────────┘
========= 1 failed, 1 passed, in 2.4s (cached hits 1, cached misses 1) ========
$ bench run tests/hallucinations
========================== Run Tests ==========================
tests/hallucinations/eval.py:22 ..
======================== Evaluate Tests =======================
tests/hallucinations/eval.py:22 .F
=========================== Failures ==========================
tests/hallucinations/eval.py:22 :: tests/hallucinations/1.yml
┌─────────────┬───────────────────────────────────────────┐
│ Input │ Which country won the most Olympic medals │
├─────────────┼───────────────────────────────────────────┤
│ Output │ It's likely to be either USA or China. │
├─────────────┼───────────────────────────────────────────┤
│ Expected #1 │ I don't know │
├─────────────┼───────────────────────────────────────────┤
│ Expected #2 │ The event hasn't taken place yet │
└─────────────┴───────────────────────────────────────────┘
= 1 failed, 1 passed, in 2.4s (cached hits 1, cached misses 1) =
BenchLLM
$ bench run tests/hallucinations
================================== Run Tests ==================================
tests/hallucinations/eval.py:22 ..
=============================== Evaluate Tests ================================
tests/hallucinations/eval.py:22 .F
================================== Failures ===================================
------- tests/hallucinations/eval.py:22 :: tests/hallucinations/1.yml ---------
┌─────────────┬────────────────────────────────────────────────────────────┐
│ Input │ Which country won the most medals at the Olympics 2024? │
├─────────────┼────────────────────────────────────────────────────────────┤
│ Output │ It's likely to be either USA or China. │
├─────────────┼────────────────────────────────────────────────────────────┤
│ Expected #1 │ I don't know │
├─────────────┼────────────────────────────────────────────────────────────┤
│ Expected #2 │ I don't know because the event hasn't taken place yet │
└─────────────┴────────────────────────────────────────────────────────────┘
========= 1 failed, 1 passed, in 2.4s (cached hits 1, cached misses 1) ========
BenchLLM
$ bench run tests/hallucinations
================================== Run Tests ==================================
tests/hallucinations/eval.py:22 ..
=============================== Evaluate Tests ================================
tests/hallucinations/eval.py:22 .F
================================== Failures ===================================
------- tests/hallucinations/eval.py:22 :: tests/hallucinations/1.yml ---------
┌─────────────┬────────────────────────────────────────────────────────────┐
│ Input │ Which country won the most medals at the Olympics 2024? │
├─────────────┼────────────────────────────────────────────────────────────┤
│ Output │ It's likely to be either USA or China. │
├─────────────┼────────────────────────────────────────────────────────────┤
│ Expected #1 │ I don't know │
├─────────────┼────────────────────────────────────────────────────────────┤
│ Expected #2 │ I don't know because the event hasn't taken place yet │
└─────────────┴────────────────────────────────────────────────────────────┘
========= 1 failed, 1 passed, in 2.4s (cached hits 1, cached misses 1) ========
Powerful CLI
Run and evaluate models with simple and elegant CLI commands. Use the CLI as a testing tool for your CI/CD pipeline. Monitor models performance and detect regressions in production.
BenchLLM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import benchllm
from benchllm.input_types import ChatInput
import openai
def chat(messages: ChatInput):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=messages
)
return response.choices[0].message.content.strip()
@benchllm.test(suite=".")
def run(input: ChatInput):
return chat(input)
BenchLLM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import benchllm
from benchllm.input_types import ChatInput
import openai
def chat(messages: ChatInput):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=messages
)
return response.choices[0].message.content.strip()
@benchllm.test(suite=".")
def run(input: ChatInput):
return chat(input)
BenchLLM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import benchllm
from benchllm.input_types import ChatInput
import openai
def chat(messages: ChatInput):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=messages
)
return response.choices[0].message.content.strip()
@benchllm.test(suite=".")
def run(input: ChatInput):
return chat(input)
Flexible API
Test your code on the fly. BenchLLM supports OpenAI, Langchain, and any other API out of the box. Use multiple evaluation strategies and visualize insightful reports.
Easy evaluation
for your LLM apps
Define your tests intuitively
Define your tests intuitively in JSON or YAML format.
Organize tests
Organize your tests into suites that can be easily versioned.
Support for OpenAI
Support for OpenAI, Langchain, or any other API out of the box.
Automation
Automate your evaluations in a CI/CD pipeline.
Generate reports
Generate evaluation reports and share them with your team.
Monitor model performance
Monitor models performance and detect regressions in production.
Easy evaluation
for your LLM apps
Define your tests intuitively
Define your tests intuitively in JSON or YAML format.
Organize tests
Organize your tests into suites that can be easily versioned.
Support for OpenAI
Support for OpenAI, Langchain, or any other API out of the box.
Automation
Automate your evaluations in a CI/CD pipeline.
Generate reports
Generate evaluation reports and share them with your team.
Monitor model performance
Monitor models performance and detect regressions in production.
Start evaluating today
Built and maintained with ♥ by V7
Share your feedback, ideas and contributions with Simon Edwardsson or Andrea Azzini
Share your feedback, ideas and contributions
with Simon Edwardsson or Andrea Azzini
Share your feedback, ideas and contributions
with Simon Edwardsson or Andrea Azzini