Like every organization on earth we use cookies. We use cookies to analyze our product usage. We don't use cookies for commercial purposes. Learn More

The Ultimate Guide to AI Metrics and Evaluation - Image Classification

Let us take you through how to evaluate your AI model's performance, and choose the most appropriate metrics. In this article, we focus primarily on image classification machine learning models.


At LabelFlow, we strive to build the GitHub for visual data by developing the most streamlined image labeling tool and dataset marketplace for machine learning models. We know that marginal gains can be found in fine-tuning the model’s architecture and parameters, but there are a number of mature tools and frameworks out there already, such as PyTorch that provide a great basis without starting from scratch. It has recently become clear that the key to improving AI performance lies in improving the quality of the training datasets.

But this begs the question: How is the performance of an AI model defined, and how can it be evaluated?

Before we get started, as there are so many potential use cases and applications for AI, each with its specific metrics, it’s important that we narrow the scope, and focus on a single use case. In this article, we’ll discuss how to define and evaluate the performance of image classification AI models.

What is Image Classification?

Image Classification can be defined as a fundamental task that attempts to understand an entire image as a whole. The goal is to classify the image by assigning it to a specific label. Typically, Image Classification refers to images in which only one object appears and is analyzed. This of course contrasts with object detection or image segmentation, which both combine classification and localization tasks.

This example should be straightforward to illustrate, and it comes with a relatively generic set of metrics that can be applied to different use cases. This article will serve as a basis for future content around other AI applications and their associated metrics.

How to evaluate a model?

Before diving into the evaluation metrics themselves, let’s discuss how to accurately estimate any evaluation metric on a given model. As defined in Statistical Methods for Machine Learning by Jason Brownlee, cross-validation is primarily used in applied machine learning to estimate the performance of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

Often referred to as k-fold cross-validation, where the k refers to the number of groups a specific data sample is to be split into. With low values of k, the models will be trained on fewer data and the generalization error will be biased upward. Conversely, with high values of k, the models will be trained on similar data and though highly correlated with each other; the generalization error will have a high variance. A typical value for k is 5 or 10.

The general procedure is as follows:

  1. Shuffle the dataset randomly.
  2. Split the dataset into k even groups
  3. For each unique group: a) Take the group as a holdout or test data set. b) Take the remaining groups as a training data set. c) Fit a model on the training set and evaluate it on the test set. d) Retain the evaluation score and discard the model
  4. Summarize the skill of the model using the sample of model evaluation scores

Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold-out set 1 time and used to train the model k-1 times.

Now that we’ve provided a little context, let’s move on to a more concrete example.

Evaluation Metrics

True and False Positives and Negatives

It is important to be up to speed with the nomenclature associated with AI model performance to fully understand the most relevant metrics. First of all, let’s talk about True Positives and True Negatives, as well as False Positives and False Negatives. Let’s take quite a topical example to illustrate these different ideas.


It’s as simple as that:

  • TP: Person with coronavirus tested positive
  • FP: Person without coronavirus tested positive
  • FN: Person with coronavirus tested negative
  • TN: Person without coronavirus tested negative

Precision and Recall

Once these ideas have been understood, we can move onto probably the most common metrics in machine learning: Precision and Recall. Precision evaluates the quality of the selection, by analyzing the pool of items selected by the AI model to find the number of relevant selections made. This ultimately leads to a ratio of relevant selected items over the total number of selected items.
Whereas Recall analyzes the number of relevant items selected from the pool of total relevant items, which equates to a ratio of the relevant selected items over the total number of relevant items that were available to be selected.

This idea is illustrated below.


Image classification models take an image as an input and the resulting outputs are probabilities for each of the considered classes. The class corresponding to the highest probability is retained - Mathematically, this corresponds to the “argmax” of the array of probabilities.

From the corresponding probability - also named confidence score - you can decide if this inferred class is relevant or not using the confidence score threshold. This threshold often requires careful fine-tuning depending on the use-case and class.

The simplest way of evaluating a machine learning model’s performance is to compare the model’s predictions with the ground truth. The first step is to make an account of the predictions or inferences by splitting them into 4 categories, as mentioned previously: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

  • Each of the model’s predictions can be classified into True Positives or False Positives.
  • Every ground truth that was missed can be labeled as a False Negative.
  • Images left unlabeled by the model that have no ground truth can therefore be classified as True Negatives.

Quantifying these values will allow you to calculate the Precision and Recall of your model.

Confusion Matrices

In order to get a better understanding of your model, you can organize the results into a confusion matrix to shed light on which classes are most often confused with each other by the model. A confusion matrix is also referred to as an error matrix, is the perfect tool to visualize the result of an image classification machine learning model. Let’s take the example of a machine learning model, tasked with classifying images containing cats and dogs. Technically speaking, the model is a binary classifier that needs to decide whether the input picture is a cat or not; a picture of a cat is considered as a positive sample, while a picture of a dog is considered as a negative sample.


The 12-image dataset consists of 8 cat images and 4 images of dogs.

The confusion matrix detailing this models’ performance is as follows:


This table tells us that: Of the 8 cat pictures, the model predicted that 2 were dogs, and of the 4 dog pictures, the model inferred 1 was a cat. The correct predictions are highlighted in bold, in the diagonal of the table.


Sensitivity & Specificity

These values mathematically describe the accuracy of a model in reporting the presence or absence of a specific object of class, in comparison to the ground truth or ‘Gold Standard’.

  • Sensitivity represents the True Positive Rate, which is calculated as follows:

Sensitivity = TP rate = True Positives / (True Positives + False Negatives)

  • Specificity represents the True Negative Rate, which is calculated as follows:

Specificity = TN rate = True Negatives / (True Negatives + False Positives)

There’s always a tradeoff between these two values, which can be modified by adjusting the confidence score.

Depending on the use case, either having a high sensitivity or specificity can be desirable. Let’s take the case of a medical test for diagnosing a condition. Sensitivity, in this case, refers to the model’s ability to detect ill patients who have the condition. For a test with high sensitivity, a negative test is useful for ruling out the disease. Negative results for models with high sensitivities are reliable, as a test with 100% sensitivity would recognize all patients suffering from the condition by giving a positive result.

However, a positive result in a high sensitivity model is not necessarily useful. A faulty model that always gives a positive reading would have a sensitivity of 100% when applied to diseased patients. Sensitivity does not take false positives into account, rendering it useless for “ruling-in” the disease.


On the other hand, a positive result with a high specificity model is useful for ruling in disease. A positive result represents a high probability of the presence of the condition, as a test with 100% specificity would recognize all patients without the disease by testing negative. Therefore, a positive result would definitely rule-in the presence of the disease.

Thus, a negative result from a model with a high specificity is not useful for ruling out the disease. Let’s once again use the example of a faulty model, which always returns a negative result; Said model will have a specificity of 100% as specificity doesn’t consider false negatives, rendering it useless for ruling out disease.



In terms of sensitivity and specificity, the confusion matrix is as follows:


This type of matrix can be used to further understand your model.

There are a number of possible ways to evaluate the performance of such a model. A lot of metrics are calculated based on the Precision and Recall values of the model, as explained earlier on.

F-score & Informedness

The model’s F-score is a measure of its Accuracy, and mathematically, is the harmonic mean of the precision and recall values. Its highest possible value is 1.0, representing perfect precision and recall, whereas its lowest possible value is 0. However, this value can be biased towards over-represented classes, which can render a machine learning model completely useless. The classic example here would be a dummy cancer classification model that always predicts “no cancer” and therefore has a very high accuracy, due to the rare nature of the disease. This model would statistically be correct in the vast majority of cases but is completely useless. Hence why it is crucial to be mindful of the model’s objectives, and to select the appropriate metrics and evaluation methods for your model.

Youden’s J statistic, generalized as informedness, can alleviate certain biases by also taking False Positives and False Negatives into account, and is calculated as follows:

J = Sensitivity + Specificity - 1

The expanded formula is:

J = [TP/(TP+FN)] + [TN/(TN+FP)] - 1

These values should provide a good understanding of the overall performance of the model, whilst limiting certain inherent biases.

Predicting Probabilities

It can often be more flexible to predict the probabilities of each class, instead of predicting the class directly. This allows for the capability to choose and calibrate the confidence threshold, as mentioned previously.

The confidence threshold can be fine-tuned to suit the required behavior of the model, for the specific use case. Indeed, for an AI model tasked with classifying medical images with cancerous cells, a False Positive error is a much safer outcome than a False Negative - The confidence threshold will therefore be calibrated accordingly.

A common way to compare models that predict probabilities for a two-class problem, such as image classification, is to use a ROC curve.

Data visualization

ROC Curves

The Receiver Operating Characteristic curve is a plot of the False Positive rate (x-axis) versus the True Positive rate (sensitivity) (y-axis) for a number of different threshold values between 0.0 and 1.0. In simple terms, this represents the false alarm rate against the hit rate.

The False Positive rate is also referred to as the inverse Specificity, and can be calculated as follows:

FP rate = False Positives / (False Positives + True Negatives) = 1 - Specificity

The ROC curve is a useful tool for a few different reasons:

  • The curves of different models can be compared more or less directly, or for different thresholds.
  • The areas under the curve can be used as a summary of the model skill.

The shape of the curve also contains a lot of information as pertains to the False Positive Rate and the False Negative Rate. To make this clear, as explained once again by Jason Brownlee at Machine Learning Mastery:

  • Smaller values on the x-axis of the plot indicate lower false positives and higher true negatives.
  • Larger values on the y-axis of the plot indicate higher true positives and lower false negatives.


Precision-Recall Curves

Before moving onto the curve, here’s a quick reminder of what these values represent: Precision = True Positives / (True Positives + False Positives) Recall = True Positives / (True Positives + False Negatives)

Precision-recall curves can be very useful in cases where there is an imbalance in the representation of two classes. Let’s take the example of an image classification model, analyzing images of wind turbine blades and tasked with classifying images that contain cracks. Thankfully, blade cracks (class 1) remain a rare occurrence and are severely underrepresented in the dataset, when compared to images without cracks (class 0).

Naturally, we’re less interested in the model’s ability to classify “No crack” images, or True Negatives in this case. As can be seen in the formulae above, Precision and Recall don’t take True Negatives into account, they are only concerned with the True Positive prediction of the minority class, or in this case, the “Blade Crack” class.

A Precision-Recall curve plots the Precision (y-axis) versus the Recall (x-axis) for different thresholds, similarly to the ROC curve.

As opposed to the ROC curve, the baseline for a P-R curve is determined by the ratio of Positives and Negatives, also known as a no-skill classifier, which is the point at which the model cannot discriminate between the two classes. For a balanced dataset with an equal number of images with and without blade cracks, the no-skill classifier would be a horizontal line with a precision y = 0.5.


To summarize,

  • ROC curves should be used when there are similar or equal numbers of observations for each class.
  • Precision-Recall curves should be used when there is a significant class imbalance.


Over the course of this article, we’ve explained how you can quantifiably and qualitatively evaluate your machine learning model. The appropriate metrics will of course vary depending on the data type, the task, and the required outputs.

This article details metrics that can be used to evaluate image classification machine learning models. There are a number of high-level ideas that can also be applied to the vast majority of machine learning applications, the main takeaways being:

  • Choose the appropriate metrics based on your model’s specific application
  • Beware of biases from class under or over-representation - Biases can also stem from specific metrics that don’t provide the whole picture
  • Data visualization tools can be used to help fine-tune your models' parameters, and typically your classes’ confidence thresholds. Once again, you must beware to choose the most appropriate curve for your model.

For more insights and practical examples, we recommend you check out Machine Learning Mastery.

Featured Articles

View all articles



Get news about our product and releases

© 2021 LabelFlow, All rights reserved.