BenchLLM by

Introducing BenchLLM

The best way to evaluate LLM-powered apps

The best way to evaluate LLM-powered apps

Use BenchLLM to evaluate your code on the fly. Build test suites for your models and generate quality reports. Choose between automated, interactive or custom evaluation strategies.

Use BenchLLM to evaluate your code on the fly. Build test suites for your models and generate quality reports. Choose between automated, interactive or custom evaluation strategies.

BenchLLM

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

from benchllm import SemanticEvaluator, Test, Tester

from langchain.agents import AgentType, initialize_agent

from langchain.llms import OpenAI


# Keep your code organized in the way you like

def run_agent(input: str):

llm=OpenAI(temperature=0)

agent = initialize_agent(

load_tools(["serpapi", "llm-math"], llm=llm),

llm=llm,

agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION

)

return agent(input)["output"]


# Instantiate your Test objects

tests = [

Test(

input="When was V7 founded? Divide it by 2",

expected=["1009", "That would be 2018 / 2 = 1009"]

)

]


# Use a Tester object to generate predictions

tester = Tester(run_agent)

tester.add_tests(tests)

predictions = tester.run()


# Use an Evaluator object to evaluate your model

evaluator = SemanticEvaluator(model="gpt-3")

evaluator.load(predictions)

evaluator.run()

Introducing BenchLLM

The best way to evaluate LLM-powered apps

Use BenchLLM to evaluate your code on the fly. Build test suites for your models and generate quality reports. Choose between automated, interactive or custom evaluation strategies.

BenchLLM

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

from benchllm import SemanticEvaluator, Test, Tester

from langchain.agents import AgentType, initialize_agent

from langchain.llms import OpenAI


# Keep your code organized in the way you like

def run_agent(input: str):

llm=OpenAI(temperature=0)

agent = initialize_agent(

load_tools(["serpapi", "llm-math"], llm=llm),

llm=llm,

agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION

)

return agent(input)["output"]


# Instantiate your Test objects

tests = [

Test(

input="When was V7 founded? Divide it by 2",

expected=["1009", "That would be 2018 / 2 = 1009"]

)

]


# Use a Tester object to generate predictions

tester = Tester(run_agent)

tester.add_tests(tests)

predictions = tester.run()


# Use an Evaluator object to evaluate your model

evaluator = SemanticEvaluator(model="gpt-3")

evaluator.load(predictions)

evaluator.run()

Built by AI engineers
for AI engineers

By AI engineers
for AI engineers

We are a team of engineers who love building AI products. We don't want to compromise between the power and flexibility of AI and predictable results. We have built the open and flexible LLM evaluation tool that we have always wished we had.

We are a team of engineers who love building tools for other engineers. Our goal is to create the LLM Benchmark we've always wished we had.

BenchLLM

$ bench run tests/hallucinations

================================== Run Tests ==================================

tests/hallucinations/eval.py:22 ..

=============================== Evaluate Tests ================================

tests/hallucinations/eval.py:22 .F

================================== Failures ===================================

------- tests/hallucinations/eval.py:22 :: tests/hallucinations/1.yml ---------

┌─────────────┬────────────────────────────────────────────────────────────┐

│ Input │ Which country won the most medals at the Olympics 2024? │

├─────────────┼────────────────────────────────────────────────────────────┤

│ Output │ It's likely to be either USA or China. │

├─────────────┼────────────────────────────────────────────────────────────┤

│ Expected #1 │ I don't know │

├─────────────┼────────────────────────────────────────────────────────────┤

│ Expected #2 │ I don't know because the event hasn't taken place yet │

└─────────────┴────────────────────────────────────────────────────────────┘

========= 1 failed, 1 passed, in 2.4s (cached hits 1, cached misses 1) ========

$ bench run tests/hallucinations

========================== Run Tests ==========================

tests/hallucinations/eval.py:22 ..

======================== Evaluate Tests =======================

tests/hallucinations/eval.py:22 .F

=========================== Failures ==========================

tests/hallucinations/eval.py:22 :: tests/hallucinations/1.yml

┌─────────────┬───────────────────────────────────────────┐

│ Input │ Which country won the most Olympic medals │

├─────────────┼───────────────────────────────────────────┤

│ Output │ It's likely to be either USA or China. │

├─────────────┼───────────────────────────────────────────┤

│ Expected #1 │ I don't know │

├─────────────┼───────────────────────────────────────────┤

│ Expected #2 │ The event hasn't taken place yet │

└─────────────┴───────────────────────────────────────────┘

= 1 failed, 1 passed, in 2.4s (cached hits 1, cached misses 1) =

BenchLLM

$ bench run tests/hallucinations

================================== Run Tests ==================================

tests/hallucinations/eval.py:22 ..

=============================== Evaluate Tests ================================

tests/hallucinations/eval.py:22 .F

================================== Failures ===================================

------- tests/hallucinations/eval.py:22 :: tests/hallucinations/1.yml ---------

┌─────────────┬────────────────────────────────────────────────────────────┐

│ Input │ Which country won the most medals at the Olympics 2024? │

├─────────────┼────────────────────────────────────────────────────────────┤

│ Output │ It's likely to be either USA or China. │

├─────────────┼────────────────────────────────────────────────────────────┤

│ Expected #1 │ I don't know │

├─────────────┼────────────────────────────────────────────────────────────┤

│ Expected #2 │ I don't know because the event hasn't taken place yet │

└─────────────┴────────────────────────────────────────────────────────────┘

========= 1 failed, 1 passed, in 2.4s (cached hits 1, cached misses 1) ========

BenchLLM

$ bench run tests/hallucinations

================================== Run Tests ==================================

tests/hallucinations/eval.py:22 ..

=============================== Evaluate Tests ================================

tests/hallucinations/eval.py:22 .F

================================== Failures ===================================

------- tests/hallucinations/eval.py:22 :: tests/hallucinations/1.yml ---------

┌─────────────┬────────────────────────────────────────────────────────────┐

│ Input │ Which country won the most medals at the Olympics 2024? │

├─────────────┼────────────────────────────────────────────────────────────┤

│ Output │ It's likely to be either USA or China. │

├─────────────┼────────────────────────────────────────────────────────────┤

│ Expected #1 │ I don't know │

├─────────────┼────────────────────────────────────────────────────────────┤

│ Expected #2 │ I don't know because the event hasn't taken place yet │

└─────────────┴────────────────────────────────────────────────────────────┘

========= 1 failed, 1 passed, in 2.4s (cached hits 1, cached misses 1) ========

Powerful CLI

Run and evaluate models with simple and elegant CLI commands. Use the CLI as a testing tool for your CI/CD pipeline. Monitor models performance and detect regressions in production.

BenchLLM

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16
17

import benchllm

from benchllm.input_types import ChatInput


import openai


def chat(messages: ChatInput):

response = openai.ChatCompletion.create(

model="gpt-3.5-turbo",

messages=messages

)


return response.choices[0].message.content.strip()



@benchllm.test(suite=".")

def run(input: ChatInput):

return chat(input)

BenchLLM

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16
17

import benchllm

from benchllm.input_types import ChatInput


import openai


def chat(messages: ChatInput):

response = openai.ChatCompletion.create(

model="gpt-3.5-turbo",

messages=messages

)


return response.choices[0].message.content.strip()



@benchllm.test(suite=".")

def run(input: ChatInput):

return chat(input)

BenchLLM

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16
17

import benchllm

from benchllm.input_types import ChatInput


import openai


def chat(messages: ChatInput):

response = openai.ChatCompletion.create(

model="gpt-3.5-turbo",

messages=messages

)


return response.choices[0].message.content.strip()



@benchllm.test(suite=".")

def run(input: ChatInput):

return chat(input)

Flexible API

Test your code on the fly. BenchLLM supports OpenAI, Langchain, and any other API out of the box. Use multiple evaluation strategies and visualize insightful reports.

Easy evaluation
for your LLM apps

Define your tests intuitively

Define your tests intuitively in JSON or YAML format.

Organize tests

Organize your tests into suites that can be easily versioned.

Support for OpenAI

Support for OpenAI, Langchain, or any other API out of the box.

Automation

Automate your evaluations in a CI/CD pipeline.

Generate reports

Generate evaluation reports and share them with your team.

Monitor model performance

Monitor models performance and detect regressions in production.

Start evaluating today

Built and maintained with ♥ by V7

Share your feedback, ideas and contributions with Simon Edwardsson or Andrea Azzini

Share your feedback, ideas and contributions

with Simon Edwardsson or Andrea Azzini

Share your feedback, ideas and contributions

with Simon Edwardsson or Andrea Azzini