Evaluating Classification Models
Learn how to interpret the evaluation results of classification models
As mentioned in the previous Evaluating Models section, once the evaluation process is complete, the Calculate button will become a View Results button.
You can click the View Results button to see and interpret the evaluation results of your classification models.
You will then be redirected to the Evaluation results page, where you can analyze the outcomes of the evaluation process.
On the page, you will generally see results that represent either:
-
The average across K-splits; or
-
The test set of a single split, which is approximately 1/K of your original training set.
Note that a single split will be capped at 1,000 inputs.
Let's talk about how to interpret the evaluation results.
Evaluation Highlights
At the top of the page, you will find details about the evaluated model, including:
- Version of the model evaluated
- Evaluation date
- Number of concepts used for evaluation
- App name
- Dataset name
- Dataset version ID
The Evaluation highlights section provides key numerical metrics that offer an overview of the model's prediction performance, allowing you to assess the model's effectiveness at a glance.
ROC/AUC (Accuracy Score)
The ROC/AUC is a key metric for assessing the prediction performance of a classification model. ROC stands for Receiver Operating Characteristic, and AUC stands for Area Under the Curve. The AUC is defined as the macro average of the areas under the ROC curve for every concept, averaged across K-splits.
This metric provides a concise summary of how well each model discriminates between classes or concepts, making it easier to compare models based on their ROC characteristics. It allows you to quickly identify models that excel in distinguishing between concepts, aiding in the selection of those with superior discrimination capabilities and better performance.
A higher ROC/AUC score generally indicates better discrimination between classes or concepts. A score of 1 represents a perfect model, while a score of 0.5 represents a model with no discrimination ability, essentially performing no better than random chance. Generally, a score above 0.9 is considered excellent.
It's important to note that the ROC/AUC is not dependent on the specified prediction threshold.
The ROC curve is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary the threshold for assigning observations to a given class. The AUC is the area under this curve and is arguably the best way to summarize a model’s performance in a single number.
The AUC can be thought of as representing the probability that a classifier will rank a randomly chosen positive observation higher than a randomly chosen negative observation. This makes it a useful metric even for datasets with highly unbalanced classes.
We discourage users from making a final assessment of a classification model's accuracy based solely on the ROC/AUC score.
Total Labeled
The total number of inputs that were originally labeled as the concept in the test set. Note that the Total Labeled value is not dependent on the prediction threshold.
It is calculated as: True Positives (correct predictions) + False Negatives (incorrect predictions)
.
Total Predicted
The total number of inputs that were predicted as the concept in the test set. This means these inputs were predicted as a concept with a probability greater than the prediction threshold value.
It is calculated as: True Positives (correct predictions) + False Positives (incorrect predictions)
.
True Positives
The number of inputs that were correctly predicted as the concept they were actually labeled. Also known as “hits.” For example, these are the images that were labeled as “dog” and were predicted as “dog.”
False Negatives
The number of inputs that were not predicted as the concept they were actually labeled. Also known as “misses”. For example, these are the images that were labeled as “dog” but were not predicted as “dog.”
False Positives
The number of inputs that were predicted as the concept but were not labeled as the concept. Also known as “false alarms.” For example, these are the images that were predicted as “dog” but were not labeled as “dog.”
Here is a table that summarizes the above concepts:
Image source: ResearchGate
Recall
Recall rate, also known as sensitivity or true positive rate, measures a model's ability to correctly identify all actual positive cases. It is the proportion of actual positive cases that were correctly predicted.
It is calculated as: True Positives / (True Positives + False Negatives)
.
Precision
Precision rate, also known as positive predictive value, measures the accuracy of the positive predictions of a model. It is the proportion of the predicted positive cases that are actually positive.
It is calculated as: True Positives / (True Positives + False Positives)
.
F1
The F1 score provides an overall assessment of a model's ability to balance precision and recall. It is the harmonic mean of precision and recall, offering a balanced measure of a model's performance concerning both false positives and false negatives.
It is calculated as: 2 * (Precision * Recall) / (Precision + Recall)
.
The F1 score ranges from 0 to 1, with 1 being the best possible score. A high F1 score implies that the model has both high precision and high recall, meaning it correctly identifies most positive instances in the dataset while minimizing false positives.
You can use the F1 score to determine how well your model aligns with the desired trade-off between false positives and false negatives, depending on the specific application context.
Prediction Threshold
The probability threshold determines the model's predictions. The default threshold is 0.5. An input is classified as a concept (e.g., "dog") only if its prediction probability exceeds the threshold. You can adjust the threshold slider based on how strict you want the classification to be.
For example, if the threshold is set to 0.5 and the model predicts that the probability of an image being a "dog" is 0.7, the image will be classified as a "dog." However, if you increase the threshold to 0.8, the same image will not be classified as a "dog" since 0.7 is below the new threshold. This adjustment can help reduce False Positives but might increase False Negatives.
All binary prediction metrics — such as True Positives, False Negatives, False Positives, Total Predicted, Recall Rate, and Precision Rate — depend on this threshold. Adjusting the threshold slider will change the values of these metrics.
Choosing a Prediction Threshold
A threshold is the "sweet spot" numerical score that aligns with your prediction objective for recall and/or precision. In practice, there are multiple ways to define "accuracy" in machine learning, and the threshold is the value used to gauge our preferences.
When setting your classification threshold for predicting out-of-sample data, it becomes a business decision. You must decide whether to minimize the False Positive Rate or maximize the True Positive Rate.
-
If your model predicts concepts that lead to high-stakes decisions, such as diagnosing a disease or ensuring safety through moderation, you might consider a few false positives as acceptable (better safe than sorry). In this case, you would prioritize high precision.
-
If your model predicts concepts that lead to suggestions or flexible outcomes, you might prioritize a high recall rate to allow for exploration and coverage.
In both scenarios, it is crucial to train and test your model with data that accurately reflects its intended use case.
After determining the goal of your model (high precision or high recall), you can use unseen test data to evaluate how well your model meets the established standards.
Evaluation Summary
The Evaluation Summary table provides a comprehensive overview of the numerical evaluation results for each concept. For every concept, you can review the metrics that indicate its performance.
This table enables you to gauge the overall prediction performance of your model and identify the high-performing and low-performing concepts.
Confusion Matrix
A confusion matrix is a table used to visualize the performance of a model. It summarizes the model's predictions on a set of data by comparing the actual labels with the predicted labels, helping to identify where the model gets things right or wrong.
The confusion matrix is typically presented in a tabular format with the Y-axis representing the Actual Concepts and the X-axis representing the Predicted Concepts. The cells display the average prediction probabilities for each concept and for groups of images labeled as a certain concept.
- Diagonal cells: Show the average probability for true positives (correctly predicted inputs).
- Off-diagonal cells: Show the average probability for false positives, false negatives, and other misclassifications.
Each row represents a subset of the test set labeled as a specific concept (e.g., "dog"). As you move across a row, each cell represents the average prediction probability for each concept (indicated by the column name) for all inputs in that subset across all K-splits.
The confusion matrix helps you assess the effectiveness of a model for a particular task. It provides insights into:
- Accuracy: Overall, how often is the model correct?
- Misclassification Rate: Overall, how often is the model incorrect?
- True Positive Rate (Recall): When the actual label is positive, how often does the model predict positive?
- False Positive Rate: When the actual label is negative, how often does the model predict positive?
- Specificity: When the actual label is negative, how often does the model predict negative?
- Precision: When the model predicts positive, how often is it correct?
- Prevalence: How often does the positive condition actually occur in the sample?
Selection Details
The Selection Details table allows you to analyze your inputs by showing an image of the input alongside the prediction probabilities for each specific concept. After selecting an actual concept used for labeling your input and a concept predicted during the evaluation process, the table displays relevant information to facilitate analysis.
To identify and correct biases in your model's predictions, you can hone in on the False Positives and False Negatives. This can help you detect any biases in how the model is predicting and correct any inputs that are mislabeled.
Note that the prediction probabilities in this table may differ from your actual model's probabilities. This is because all evaluation results are based on the new model built for evaluation purposes during the cross-validation process.
Precision-Recall Curve
A Precision-Recall Curve (PR curve) is a graphical representation that assesses the performance of a binary classification model across various thresholds. It plots the trade-off between precision (positive predictive value) and recall (sensitivity) of each concept used to train your model.
Typically, it provides insights into how well a model balances precision and recall across different decision thresholds, offering a nuanced view of its performance beyond traditional metrics like accuracy.
The PR curve is created by plotting precision on the y-axis and recall on the x-axis for different thresholds used by the model. A higher area under the PR curve (AUC-PR) indicates better performance of the model, especially when dealing with imbalanced datasets where one class (typically the negative class) is much more frequent than the other.
When a model has high recall but low precision, it correctly identifies most of the positive samples but also generates many false positives (i.e., it incorrectly classifies many negative samples as positive). On the other hand, when a model has high precision but low recall, it accurately classifies samples as positive but may miss many actual positive samples, resulting in fewer positive detections overall.
Because both precision and recall are important, the precision-recall curve is used to illustrate the trade-off between these metrics at different threshold levels. This curve helps in selecting the optimal threshold to balance and maximize both precision and recall, thereby improving the overall performance of the model.
Click here to learn more about how to evaluate an image classification model.