DeepEval
Evaluate your LLM applications using DeepEval and Clarifai
DeepEval is an open-source evaluation framework for LLM applications. It provides a rich set of metrics to evaluate hallucination, answer relevancy, contextual precision, faithfulness, toxicity, and other qualities that matter in production LLM systems.
By integrating Clarifai with DeepEval, you can use Clarifai-hosted models as both the models being evaluated and as judge models that score evaluation metrics — giving you a fully self-contained evaluation pipeline on the Clarifai platform.
Prerequisites
Before getting started, make sure you have the following:
- A Clarifai account with a valid Personal Access Token (PAT)
- Python 3.8 or later
Installation
You can install the following necessary packages:
deepeval— The DeepEval framework, which provides evaluation metrics and test case utilities.openai— The OpenAI client library used for the OpenAI-compatible connection to Clarifai.
This is the combined command for installing them:
- Bash
pip install deepeval openai
Create a Clarifai LLM Wrapper
DeepEval requires a custom model class that conforms to its DeepEvalBaseLLM interface. The following wrapper connects to any Clarifai-hosted model via the OpenAI-compatible endpoint:
- Python
import asyncio
import os
from deepeval.models.base_model import DeepEvalBaseLLM
from openai import OpenAI
class ClarifaiLLM(DeepEvalBaseLLM):
def __init__(self, model_url, pat):
self.model_url = model_url
self.client = OpenAI(
base_url="https://api.clarifai.com/v2/ext/openai/v1",
api_key=pat
)
def load_model(self):
return self.client
def generate(self, prompt: str) -> str:
response = self.client.chat.completions.create(
model=self.model_url,
messages=[{"role": "user", "content": prompt}],
max_tokens=1024
)
return response.choices[0].message.content
def generate_raw_response(self, prompt: str, **kwargs):
return self.generate(prompt), 0
async def a_generate(self, prompt: str) -> str:
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, self.generate, prompt)
async def a_generate_raw_response(self, prompt: str, **kwargs):
result = await self.a_generate(prompt)
return result, 0
def get_model_name(self):
return f"Clarifai: {self.model_url}"
You can use any model from the Clarifai Community as the judge model by changing the model_url parameter to any publicly accessible Clarifai model URL.
Run an Evaluation
Once the wrapper is in place, you can define evaluation metrics and run them against test cases. The example below uses GEval — a flexible, criteria-based metric — to evaluate the factual correctness of a model's response:
- Python
import asyncio
import os
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from openai import OpenAI
class ClarifaiLLM(DeepEvalBaseLLM):
def __init__(self, model_url, pat):
self.model_url = model_url
self.client = OpenAI(
base_url="https://api.clarifai.com/v2/ext/openai/v1",
api_key=pat
)
def load_model(self):
return self.client
def generate(self, prompt: str) -> str:
response = self.client.chat.completions.create(
model=self.model_url,
messages=[{"role": "user", "content": prompt}],
max_tokens=1024
)
return response.choices[0].message.content
def generate_raw_response(self, prompt: str, **kwargs):
return self.generate(prompt), 0
async def a_generate(self, prompt: str) -> str:
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, self.generate, prompt)
async def a_generate_raw_response(self, prompt: str, **kwargs):
result = await self.a_generate(prompt)
return result, 0
def get_model_name(self):
return f"Clarifai: {self.model_url}"
# Initialize Clarifai as the judge model
clarifai_llm = ClarifaiLLM(
model_url="https://clarifai.com/openai/chat-completion/models/gpt-oss-120b",
pat=os.environ["CLARIFAI_PAT"]
)
# Define a correctness metric
metric = GEval(
model=clarifai_llm,
name="Correctness",
criteria="Determine whether the actual output is factually correct.",
evaluation_steps=[
"Check whether facts in 'actual output' contradict 'expected output'",
"Penalize omission of important details",
],
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT,
],
)
# Create a test case
test_case = LLMTestCase(
input="Who ran up the tree?",
actual_output="The cat ran up the tree.",
expected_output="The cat."
)
# Run the evaluation
metric.measure(test_case)
print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")
Example Output
Score: 0.8
Reason: The actual output correctly identifies the cat, so it does not contradict the expected answer. It adds extra information ('ran up the tree') which is not required, but no important details are omitted, resulting in a high but not perfect score.
The metric.score is a float between 0 and 1, and metric.reason provides a natural-language explanation of the score produced by the judge model.
Supported Metrics
The following DeepEval metrics are compatible with Clarifai models as the judge:
| Metric | What It Measures |
|---|---|
| Answer Relevancy | How relevant the generated answer is to the input question |
| Faithfulness | Whether the response is grounded in the provided context without hallucinating |
| Contextual Precision | Whether the retrieved context ranks relevant items higher than irrelevant ones |
| Contextual Recall | How much of the expected output is supported by the retrieved context |
| Toxicity | Presence of harmful, offensive, or inappropriate content in responses |
| Bias | Whether the response reflects unfair bias toward a group or viewpoint |
| GEval | A customizable metric scored against user-defined criteria and evaluation steps |
For a complete walkthrough including knowledge bases and RAG evaluation, visit the Clarifai Examples Repository.