Skip to main content

DeepEval

Evaluate your LLM applications using DeepEval and Clarifai


DeepEval is an open-source evaluation framework for LLM applications. It provides a rich set of metrics to evaluate hallucination, answer relevancy, contextual precision, faithfulness, toxicity, and other qualities that matter in production LLM systems.

By integrating Clarifai with DeepEval, you can use Clarifai-hosted models as both the models being evaluated and as judge models that score evaluation metrics — giving you a fully self-contained evaluation pipeline on the Clarifai platform.

Prerequisites

Before getting started, make sure you have the following:

Installation

You can install the following necessary packages:

  • deepeval — The DeepEval framework, which provides evaluation metrics and test case utilities.
  • openai — The OpenAI client library used for the OpenAI-compatible connection to Clarifai.

This is the combined command for installing them:

 pip install deepeval openai 

Create a Clarifai LLM Wrapper

DeepEval requires a custom model class that conforms to its DeepEvalBaseLLM interface. The following wrapper connects to any Clarifai-hosted model via the OpenAI-compatible endpoint:

import asyncio
import os
from deepeval.models.base_model import DeepEvalBaseLLM
from openai import OpenAI

class ClarifaiLLM(DeepEvalBaseLLM):
def __init__(self, model_url, pat):
self.model_url = model_url
self.client = OpenAI(
base_url="https://api.clarifai.com/v2/ext/openai/v1",
api_key=pat
)

def load_model(self):
return self.client

def generate(self, prompt: str) -> str:
response = self.client.chat.completions.create(
model=self.model_url,
messages=[{"role": "user", "content": prompt}],
max_tokens=1024
)
return response.choices[0].message.content

def generate_raw_response(self, prompt: str, **kwargs):
return self.generate(prompt), 0

async def a_generate(self, prompt: str) -> str:
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, self.generate, prompt)

async def a_generate_raw_response(self, prompt: str, **kwargs):
result = await self.a_generate(prompt)
return result, 0

def get_model_name(self):
return f"Clarifai: {self.model_url}"
tip

You can use any model from the Clarifai Community as the judge model by changing the model_url parameter to any publicly accessible Clarifai model URL.

Run an Evaluation

Once the wrapper is in place, you can define evaluation metrics and run them against test cases. The example below uses GEval — a flexible, criteria-based metric — to evaluate the factual correctness of a model's response:

import asyncio
import os
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from openai import OpenAI

class ClarifaiLLM(DeepEvalBaseLLM):
def __init__(self, model_url, pat):
self.model_url = model_url
self.client = OpenAI(
base_url="https://api.clarifai.com/v2/ext/openai/v1",
api_key=pat
)

def load_model(self):
return self.client

def generate(self, prompt: str) -> str:
response = self.client.chat.completions.create(
model=self.model_url,
messages=[{"role": "user", "content": prompt}],
max_tokens=1024
)
return response.choices[0].message.content

def generate_raw_response(self, prompt: str, **kwargs):
return self.generate(prompt), 0

async def a_generate(self, prompt: str) -> str:
loop = asyncio.get_event_loop()
return await loop.run_in_executor(None, self.generate, prompt)

async def a_generate_raw_response(self, prompt: str, **kwargs):
result = await self.a_generate(prompt)
return result, 0

def get_model_name(self):
return f"Clarifai: {self.model_url}"

# Initialize Clarifai as the judge model
clarifai_llm = ClarifaiLLM(
model_url="https://clarifai.com/openai/chat-completion/models/gpt-oss-120b",
pat=os.environ["CLARIFAI_PAT"]
)

# Define a correctness metric
metric = GEval(
model=clarifai_llm,
name="Correctness",
criteria="Determine whether the actual output is factually correct.",
evaluation_steps=[
"Check whether facts in 'actual output' contradict 'expected output'",
"Penalize omission of important details",
],
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT,
],
)

# Create a test case
test_case = LLMTestCase(
input="Who ran up the tree?",
actual_output="The cat ran up the tree.",
expected_output="The cat."
)

# Run the evaluation
metric.measure(test_case)
print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")
Example Output
Score: 0.8
Reason: The actual output correctly identifies the cat, so it does not contradict the expected answer. It adds extra information ('ran up the tree') which is not required, but no important details are omitted, resulting in a high but not perfect score.

The metric.score is a float between 0 and 1, and metric.reason provides a natural-language explanation of the score produced by the judge model.

Supported Metrics

The following DeepEval metrics are compatible with Clarifai models as the judge:

MetricWhat It Measures
Answer RelevancyHow relevant the generated answer is to the input question
FaithfulnessWhether the response is grounded in the provided context without hallucinating
Contextual PrecisionWhether the retrieved context ranks relevant items higher than irrelevant ones
Contextual RecallHow much of the expected output is supported by the retrieved context
ToxicityPresence of harmful, offensive, or inappropriate content in responses
BiasWhether the response reflects unfair bias toward a group or viewpoint
GEvalA customizable metric scored against user-defined criteria and evaluation steps
info

For a complete walkthrough including knowledge bases and RAG evaluation, visit the Clarifai Examples Repository.