Skip to main content

LLM Evaluation

Learn how to evaluate your fine-tuned LLMs

Fine-tuning large language models (LLMs) is a powerful strategy that lets you take a pre-trained language model and further train it on a specific dataset or task to adapt it to that particular domain or application.

After specializing the model for a specific task, it’s important to evaluate its performance and assess its effectiveness when provided with real-world scenarios. By running an LLM evaluation, you can gauge how well the model has adapted to the target task or domain.

At Clarifai, we provide the LLM Eval module to help you evaluate the strengths and weaknesses of your LLMs against standardized benchmarks alongside custom criteria.

How the LLM Eval Module Works

The LLM Eval module evaluates the performance of language models by comparing the predicted string to a reference string or an input.

A predicted string refers to the output generated by the fine-tuned model based on a given input text, such as a question or a prompt. A reference or context string, on the other hand, is the ground truth or the correct output for the input. It could be a human-written answer, translation, summary, or any other benchmark considered the "correct" response.

The module undertakes the string evaluation by comparing the predicted string with the reference string to measure the accuracy or performance of the model.

For example, in a question-answering task, if the input is a question, the predicted string would be an answer generated by the fine-tuned model in response to that question, and the reference string would be a human-generated answer that’s considered correct or highly accurate for the given question.

Different metrics, such as Exact Match or F1 Score, would then be used to assess how closely the predicted answer aligns with the reference answer.


The LLM Eval module allows you to evaluate across 100+ tasks covering diverse use cases like retrieval-augmented generation (RAG), classification, casual chat, content summarization, and more.

Evaluation Templates

You can choose a variety of templates for evaluating your fine-tuned large language model using the LLM Eval module.

1. LLM-as-a-Judge

The LLM-as-a-Judge template uses a strong LLM to evaluate the outputs of another LLM. It involves leveraging the capabilities of an AI model to perform judgment-based tasks on another AI model’s work.

This template employs a selected LLM to perform string evaluation on a model’s predicted response based on an input question and a ground truth — as explained earlier. Typically, the LLM works as a judge and determines the quality of the model’s predicted output against the ideal or expected output.

After the judgment process, the evaluation results would then be categorized into different classes and given a score. The LangChain’s CriteriaEvalChain method is used to compute the classes.

These classes include:

  • Relevance – Is the submission referring to a real quote from the text?
  • Depth – Does the submission illustrate depth of thought?
  • Creativity – Does the submission illustrate novelty or unique ideas?
  • Correctness – Is the submission correct, accurate, and factual?
  • Helpfulness – Is the submission helpful, insightful, and appropriate?

Each of the specified classes will be given a binary score between 0 and 1, where 1 represents the highest level of confidence or agreement with the judgment provided by the LLM-as-a-judge. For example, if Relevance is scored at 0.80, it implies that the LLM-as-a-judge is 80% confident that the predicted response is relevant to the specified scenario.

Assign user-defined weights

You can also assign user-defined weights to each class. This lets you measure customized business metrics for specific use cases. For example, for RAG-related evaluation cases, where reading comprehension and instruction following are desired, you may want to give zero weight to Creativity and more weights for Accuracy, Helpfulness, and Relevance.

2. TruthfulQA

The TruthfulQA template evaluates a model’s performance based on the TruthfulQA benchmark. The benchmark assesses how models imitate human falsehoods.

With this template, you can evaluate if a model is truthful in generating answers to questions. If a model performs well, it will desist from generating false responses learned from mimicking human texts. If a model does not perform well, it will generate false answers that imitate popular misconceptions, which could potentially deceive people.

Specifically, this template employs the zero-shot generative task methodology within the TruthfulQA framework to compute standard metrics that evaluate the quality of generated responses or answers.


In the zero-shot approach, the model is not provided with specific training examples or labeled data for the task at hand. Instead, it is expected to generate responses based on its understanding of the prompt or question without any prior training on similar examples.

The metrics used include:

  • BLEU (Bilingual Evaluation Understudy) — It measures the similarity between a machine-generated text and a reference human translation.
  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) — It is essentially a set of metrics for evaluating automatic summarization of texts as well as machine translations.

These metrics provide valuable insights into the model's fluency, coherence, and ability to capture the essence of the input prompt or question. Both BLEU and ROUGE scores range from 0 to 10, with higher values indicating better performance.

You can also set up weights for each score to adjust their relative importance in the evaluation process. This allows you to customize the evaluation criteria based on specific priorities or preferences.

For example, you may assign higher weights to BLEU if you prioritize lexical similarity and phrase matching, while assigning higher weights to ROUGE if you value the preservation of longer sequences and coherence.

3. General

The General template is a standardized framework that evaluates the performance of language models by assessing the fine-tuned model’s response and the ground truth (reference) scores using some common natural language processing (NLP) metrics.

Some of these metrics include:

  • F1— This is a combined metric that considers both precision (proportion of correctly identified positive cases) and recall (proportion of actual positive cases identified) of the model's response. It measures the model's ability to accurately identify relevant information.
  • Exact Match— It measures the percentage of model responses that exactly match the ground truth responses. You can also set up weights for the metrics to adjust their relative importance in the evaluation process. This allows you to customize the evaluation criteria based on specific priorities or preferences.

For example, you may assign higher weights to the F1 score if you prioritize precision and recall equally, while assigning lower weights to BLEU and ROUGE if they are less relevant to your task.

4. Custom

You can also create your own custom template for performing evaluations tailored to your specific needs and objectives. Custom templates offer flexibility in defining evaluation criteria, metrics, and workflows according to the unique requirements of your task or domain.

For example, you can use the lm-evaluation-harness framework to create your own custom template.

Inference Parameters

The LLM Eval module allows you to adjust inference parameters to limit or influence the model response, providing greater control over the output generated by the model.

These parameters can shape various aspects of the model's behavior, such as response length, style, complexity, or specificity. Some of the inference parameters you can specify are temperature, max_tokens, and top-k. You can learn more about them here.

Prompt Template

The LLM Eval module allows you to evaluate your fine-tuned model using prompt templates.

A prompt template serves as a pre-configured piece of text used to instruct an LLM. It acts as a structured query or input that guides the model in generating the desired response.

After running the evaluation, the prompt templates will be ranked, allowing you to select the best-performing prompt-model combinations to use for creating workflows.

How to Fine-Tune a Model

These are the steps you need to follow to create and fine-tune your LLM model.

Step 1: Prepare training data

Let’s start by preparing the training data in a format that Clarifai accepts.

Let’s say that you want to use the LLM-as-a-Judge or the General template to evaluate the performance of a fine-tuned model on a question-answering dataset.

You’d present it with a set of input questions wherein the LLM-as-a-Judge would gauge the model’s predictions against the predefined answers — as explained earlier. As such, you’d require a dataset with at least two fields: question and answer.

However, the Clarifai platform only permits single-field datasets for such cases. Therefore, you need to pre-process the text in the dataset to obtain these two fields.

You can use either of the following approaches to achieve that:

  • JSON format — Structure the dataset in JSON format that includes separate 'question' and 'answer' fields.

  • Prompt template format — Distinguish between questions and answers within the text by using special characters or markers. For example, you might structure the text as follows: ###Instruct: What is a cat? ###Response: A cat is an animal. Here, the question and answer are separated by '###Response'. By identifying and utilizing such a marker (referred to as the split word), the module can extract the two desired fields for evaluation.


If you're using a CSV file, you need to ensure that all your data is kept within a single column.

For this example, we’ll use the following dataset of five samples structured in JSON format.

"question": "What are the names of some famous actors that started their careers on Broadway?",
"answer": "Some famous actors that started their careers on Broadway include: \n1. Hugh Jackman \n2. Meryl Streep \n3. Denzel Washington \n4. Julia Roberts \n5. Christopher Walken \n6. Anthony Rapp \n7. Audra McDonald \n8. Nathan Lane \n9. Sarah Jessica Parker \n10. Lin-Manuel Miranda"
"question": "What is some cool music from the 1920s?",
"answer": "Some cool music from the 1920s includes jazz classics like Duke Ellington's \"Mood Indigo,\" Louis Armstrong's \"West End Blues,\" and Bessie Smith's \"Down Hearted Blues.\" Other popular tunes from the era include Fats Waller's \"Ain't Misbehavin,\" George Gershwin's \"Rhapsody in Blue,\" and Irving Berlin's \"Puttin' On the Ritz.\""
"question": "How did US states get their names?",
"answer": "US states get their names from a variety of sources, including Native American tribes, Spanish explorers, British colonists, and even presidents. For example, the state of Alabama was named after the Native American tribe that lived in the area, while the state of Florida gets its name from the Spanish explorer, Ponce de Leon, who explored the area in the 1500s. Other states are named after English kings (like Virginia, named after England's \"Virgin Queen,\" Queen Elizabeth I) or presidents (like Washington, named after George Washington)."
"question": "How do I wrap a present neatly?",
"answer": "1. Start by gathering the supplies you will need to wrap the present, such as wrapping paper, tape, scissors, and ribbon.\n\n2. Cut the wrapping paper to the size of the gift, ensuring that there is enough paper to cover the entire gift.\n\n3. Place the gift in the center of the wrapping paper, and fold the sides in, tucking the excess paper at the bottom.\n\n4. Secure the folded paper with tape.\n\n5. Cut a length of ribbon and tie it around the gift.\n\n6. Trim the excess ribbon with scissors.\n\n7. Add a decorative bow or other embellishment to the top of the gift."
"question": "Hi, my sister and her girlfriends want me to play kickball with them. Can you explain how the game is played, so they don't take advantage of me?",
"answer": "Kickball is a game similar to baseball, but with a large rubber ball instead of a bat and a ball. The game is usually played with two teams of six players each. Each team has three bases and a home plate. The players on the kicking team line up at home plate and take turns kicking the ball. The object of the game is to score runs by running around all three bases and back to home plate without being tagged out by the defense. The team with the most runs at the end of the game is the winner."

For evaluating a production-ready model, we recommend using a dataset containing at least 100 to 2,000 inputs — depending on your specific needs. This ensures a sufficiently large and diverse sample size to accurately assess the model's performance across various scenarios and edge cases.

TruthfulQA template format

If you want to use the TruthfulQA template to evaluate your model’s performance, then you need to prepare your data based on that JSON format. Here is an example.

Step 2: Create an app

Click here to learn how to create an application on the Clarifai portal.


When creating the application, choose the Text/Document option as the primary input type. And in the collapsible Advanced Settings field, select Universal as the base workflow.

Step 3: Add inputs to app

Select the Inputs option on your app’s collapsible left sidebar, and use the inputs uploader pop-up window to upload the text data you prepared to a dataset within your application.

Step 4: Choose a model type

Choose the Models option on your app’s collapsible left sidebar. Click the Add Model button on the upper-right corner of the page. On the window that pops up, select the Build a Custom Model option and click the Continue button.

You’ll be redirected to a page where you can choose the type of model you want to create and fine-tune.

Select the Text Generator option.

Step 5: Create the model​

The ensuing page allows you to create and fine-tune a text-to-text model for generation or conversion purposes.

  • Model Id —Provide an ID for your model.
  • Dataset — Select the dataset you want to use to fine-tune the model. Also, select the version of your dataset.
  • Invalid data_tolerance_percent — Optionally, you can set a tolerance threshold (0 to 100) for the percentage of invalid inputs during training, and if this threshold is exceeded, training is stopped with an error.
  • Template — Select a pre-configured model template you want to use to train on your data. For this example, we’ll go with the recommended template: HF_Llama_2_7b_chat_GPTQ_lora.
  • Training settings — Optionally, you may configure the training settings to enhance the performance of your model.

For this example, we'll modify the num_train_epochs parameter, located in the Trainer config option, to a value of 50. Increasing the epoch from 1 to 50 means that each input text will be processed 50 times during training.

Click here to learn more about the hyperparameters that each template supports.

Step 6: Train the model​

Finally, click the Train button.

How to Evaluate a Fine-Tuned Model

After successfully training your language model, you may want to test its performance before using it in a production environment.

On the model’s versions table:

  • Select the version you want to evaluate its performance;
  • Select the evaluation dataset you want to use;
  • And click the Evaluate button.

You’ll be redirected to the LLM Eval module page, allowing you to evaluate the performance of your model version.

These are the steps you need to follow to evaluate your fine-tuned LLM model.

Step 1: Select a dataset

If a holdout dataset hasn't been chosen yet, select one for evaluating your model's performance.

Step 2: Choose a template

Choose an evaluation template from the left sidebar. For this example, we’ll choose the llm-as-a-judge template and select the Llama2-chat-70B model as the judge to use.

Note that you can also add your own model found in the Clarifai platform by providing its publicly accessible URL.

Step 3: Customize weights

Optionally, you can set up weights to adjust their relative importance in the evaluation process — as explained earlier.

Step 4: Add inference parameters

You can input your inference parameters using the format of comma-separated keyword arguments, for example: max_new_tokens=512, return_full_text=False.

Step 5: Add prompt templates

You can add up to five prompt templates to use for evaluating your model. Ensure that each of them follows prompter rules.


Your prompt template should include at least one instance of the placeholder {{question}}. When you input your text data at inference time, all occurrences of {{question}} within the template will be replaced with the provided text.

Step 6: Additional options

Optionally, you could:

  • Provide a regex code. For instance, to extract all text following the phrase ### Response:, you may add ### Response: (\w+). You may leave the field empty if you prefer not to apply any filtering to your results. The filtered outputs will be displayed in the filtered_prediction column of the evaluation results table.

  • If your dataset is in the prompt template format, add a split word. Leave it empty if your dataset is in JSON format.

Step 7: Evaluate

Click the Evaluate button to begin the evaluation process.

Step 8: View results

After running an evaluation, you can view the results under the Evaluation Result section. You can also create workflows with your prompt templates.

The results will include the average value of your chosen metrics, as well as the individual values of each metric. Additionally, a detailed table will display extra information, such as the provided data for evaluation, model predicted output, filtered prediction, metric values, and more.

Note that the results will persist on the page, and will be populated anytime you select the previously evaluated holdout dataset.

That’s it!