Model Evaluation FAQs
Frequently asked questions on model evaluation
How can I improve concept performance?
Concepts that perform well tend to be the ones that are annotated in images photographed in a consistent and unique way.
Concepts that tend to perform poorly are those:
- a) trained on data with inconsistent compositions;
- b) the photos require outside context (relationships to people in portraits, etc.); and/or,
- c) the subject matter is subtle.
Keep in mind the model has no concept of language; so, in essence, “what you see is what you get.”
Let’s take a case of a false positive prediction made by a model in the process of training to recognize wedding imagery.
Here is an example of an image of a married couple, which had a false positive prediction for a person holding a bouquet of flowers, even though there is no bouquet in the photo.
What’s going on here?
A photo’s composition and the combination of elements therein could confuse a model.
All the images below were labeled with the ‘Bouquet_Floral_Holding’ concept.
In this very rare instance, the image in question has:
- A veiled bride
- The bride & groom kissing/their heads close together
- Greenery over their heads
- Large, recognizable flowers
The model sees the combination of all these individual things in lots of photos labeled ‘Bouquet_Floral_Holding’; and thus, that is the top result.
One way to fix this is to narrow the training data for ‘Bouquet_Floral_Holding’ to images in which the bouquet is the focal point, rather than any instance of the bouquet being held.
This way, the model can focus on the anchoring theme/object within the dataset more easily.
What is the ROC AUC score, and how does it relate to prediction accuracy?
Above table is available in model evaluation page in the legacy Clarifai’s Explorer UI
The ROC AUC (Concept Accuracy Score) is the concept’s prediction performance score, defined by the area under the Receiver Operating Characteristic curve. This score gives us an idea of how well we have separated our different classes, or concepts.
ROC AUC is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary the threshold for assigning observations to a given class. The AUC, or Area Under the Curve of these points, is (arguably) the best way to summarize a model’s performance in a single number.
You can think of AUC as representing the probability that a classifier will rank a randomly chosen positive observation higher than a randomly chosen negative observation, and thus it is a useful metric even for datasets with highly unbalanced classes.
A score of 1 represents a perfect model; a score of .5 represents a model that would be no better than random guessing, and this wouldn’t be suitable for predictions and should be re-trained.
Note that the ROC AUC is not dependent on the prediction threshold.
How do we read a concept-by-concept matrix?
Above table is available in model evaluation page in the legacy Clarifai’s Explorer UI
A concept-by-concept matrix is a graphic flattening of data to show what has been labeled for a concept. This tool is another way of visualizing the performance of a model.
It allows us to review where we see true positives, or correctly predicted inputs (the diagonal row). Simply put, this is an excellent tool for telling us where our model gets things right or wrong.
Each row represents the subset of the test set that was actually labeled as a concept, e.g., “dog.” As you go across the row, each cell shows the number of times those images were predicted as each concept, noted by the column name.
Along with AUC, what other insights can a confusion matrix provide?
- Accuracy—Overall, how often is the model correct?
- Misclassification Rate—Overall, how often is it wrong?
- True Positive Rate—When it's actually yes, how often does it predict yes?
- False Positive Rate—When it's actually no, how often does it predict yes?
- Specificity—When it's actually no, how often does it predict no?
- Precision—When it predicts yes, how often is it correct?
- Prevalence—How often does the yes condition actually occur in our sample?
The diagonal cells represent True Positives, i.e., correctly predicted inputs. You’d want this number to be as close to the Total Labeled as possible.
Depending on how your model was trained, the off-diagonal cells could include both correct and incorrect predictions. In a non-mutually exclusive concepts environment, you can label an image with more than 1 concept.
For example, an image labeled as both “hamburger” and “sandwich” would be counted in both the “hamburger” row and the “sandwich” row. If the model correctly predicts this image to be both “hamburger” and “sandwich,” then this input will be counted in both on and off-diagonal cells.
Above table is available in model evaluation page in the legacy Clarifai’s Explorer UI
This is a sample confusion matrix for a model. The Y-axis Actual Concepts
are plotted against the X-axis Predicted Concepts.
The cells display average prediction probability for a certain concept, and for a group of images that were labeled as a certain concept.
The diagonal cells are the average probability for true positives, and any cells off the horizontal cells contain the average probability for non-true positives. From this confusion matrix, we can see that each concept is distinct from one another, with a few areas of overlap, or clustering.
Concepts that co-occur, or are similar, may appear as a cluster on the matrix.
In the above sample matrix for a food model trained to predict the cooking preparation for foods, we see a cluster around ‘Gratin’ and ‘Baked’. Without the context of what these concepts mean or how the training data appears, we can assume that the training data looks very similar, and it is (‘Gratin’ is a variant of baked preparation, only with cheese).
And so, clusters aren’t necessarily a bad thing; and sometimes they are absolutely correct if the subject matter is naturally similar.
At Clarifai, we can work with you to determine if the cluster or area of confusion is appropriate for the final model.
Here is an example of a noisy matrix:
Hair Volume - Version I | Hair Volume - Version II |
---|---|
- 1-2 concept model (left): Hair Volume is graded as it appears from root to tip, either voluminous, or not voluminous.
- 1-4 concept model (right): Hair Volume is tiered depending on where it exists from the root, ends of the hair, and hair length. Level 4 is the most distinct because it is dependent on the shortest hair length.
The matrix on the right is an example of a noisy matrix, compared to the binary model on the left, which has a more clearly defined diagonal line.
Concepts trained with data that is too similar, too subtle to differentiate, or labeled across different concepts, will muddle the distinction between concepts, and will not form a clean diagonal line in a confusion matrix.
In this case, the model on the left would be considered the ideal concept distinction.
How can I improve a model by drilling down to “problematic cells” in a confusion matrix?
What is the importance of recall and precision rate?
Recall rate refers to the proportion of the images labeled as the concept that were predicted as the concept. It is calculated as True Positives divided by Total Labeled. Also known as “sensitivity” or “true positive rate.”
Precision rate refers to the proportion of the images predicted as a concept that had been actually labeled as the concept. It is calculated as True Positives divided by Total Predicted. Also known as “positive predictive value.”
You can think of precision and recall in the context of what we want to calibrate our model towards. Precision and recall are inversely correlated; so, ultimately the ratio of false positives to false negatives is up to the client according to their goal.
We’re asking one of the following of our model:
- That the guesses are correct, while missing some concepts (high precision);
Or,
- That most things are considered to be predicted as a concept, while having some wrong predictions (high recall).
Example:
Precision = tp÷(tp+fp)
I guess for X, and my guess is correct, although I may miss another X.
Or,
Recall = tp ÷ (tp+fn)
I guess all the X as X, but occasionally predict other subjects that are not X as X.
How do we choose a prediction threshold?
A threshold is the “sweet spot” numerical score that is dependent on the objective of your prediction for recall and/or precision. In practice, there are multiple ways to define “accuracy” when it comes to machine learning, and the threshold is the number we use to gauge our preferences.
You might be wondering how you should set your classification threshold, once you are ready to use it to predict out-of-sample data. This is more of a business decision, in that you have to decide whether you would rather minimize your false positive rate or maximize your true positive rate.
If our model is used to predict concepts that lead to a high-stakes decision, like a diagnosis of a disease or moderation for safety, we might consider a few false positives as forgivable (better to be safe than sorry!). In this case, we might want high precision.
If our model is used to predict concepts that lead to a suggestion or flexible outcome, we might want high recall so that the model can allow for exploration.
In either scenario, we will want to ensure our model is trained and tested with data that best reflects its use case.
Once we have determined the goal of our model (high precision or high recall), we can use test data that our model has never seen before to evaluate how well our model predicts according to the standards we have set.
Once a model is trained and evaluated, how do we determine its accuracy?
The goal of any model is to get it to see the world as you see it.
In multi-class classification, accuracy is determined by the number of correct predictions divided by the total number of examples.
In binary classification, or for mutually exclusive classes, accuracy is determined by the number of true positives added to the number of true negatives, divided by the total number of examples.
Once we have established the goal we are working towards with the ground truth, we begin to assess your model’s prediction returns. This is a completely subjective question, and most clients simply want to know that their models will perform to their standards once it is in the real world.
We begin by running a test set of images through the model and reading their precision and recall scores. The test set of images should be:
- a) inputs that the model has not been trained with, and;
- b) be the same kind of data we would expect to see in the model’s particular use case.
Once we have our precision or recall scores, we will compare these to the model’s recall or precision thresholds for .5 and .8, respectively.