How To Interpret Training Graphs to Understand and Improve Model Performance

With sufficiently detailed data and the knowledge to analyze them, users can reduce redundant experimentation and pinpoint improvements for model setup.

Akanksha Chokshi

What are Training Graphs and Why are They Important?

Once a model has finished training on its data, it is important to evaluate its performance to understand whether it is a good fit for our dataset and has succeeded in learning enough information in order to be able to make generalised predictions on unseen data. Training Graphs provide us a visual representation of how a model’s metrics like loss, recall and precision change over time, allowing us to see how well our model is learning from our data. By analysing these graphs, we can get valuable information about the model quality as well as improvements we might need to make for better predictions.

Nexus’s Training Graphs feature provides you with the relevant information you need to better understand the performance of your model. Once your project has finished training, navigate to the Trainings tab under the Project Overview section in order to see a list of all the Training Sessions for that project. Select the session you want to view the training graphs for, and a dashboard should open up.

Under the Metrics tab of this dashboard, you can view several graphs (losses and evaluation metrics) that will help you evaluate the performance of your model. You could also use the Advanced Evaluation tab to view individual predictions for each image as well as change the “smoothing” of the graphs on the Metrics tab under Settings to see a less noisier version of the training graphs.

In case you have any questions regarding this process, feel free to consult our developer docs or reach out to us on Slack.

Interpreting Training Graphs on Nexus

There are two kinds of training graphs displayed on Nexus - Losses and Evaluation Metrics (Recall and Precision). In this article, we shall delve deeper into how we can interpret these two kinds of graphs, what information they can provide about our model and how they can indicate potential improvements that our model might need. However, before we jump into the technicality of these graphs, we shall briefly go through the concept of model evaluation and how it works.

Before a model starts training on a dataset, the dataset is split into smaller subsets - training and test. The training dataset is the larger subset that the model learns from and adjusts its weights at every step. The test set is a smaller previously unseen subset the fully trained model attempts to make predictions on. The model’s performance on the test set gives us a significant overview on whether the model has been able to learn well from the training set and is able to generalise the information it has learned to make predictions on a new subset of data. Hence, in all the graphs on our dashboard, you will see a line called “eval”. This simply represents the model’s performance on the test dataset.

Before training your model, Nexus provides you with the option to adjust the train-test split, the proportion of the dataset that would be split across these two categories. A train-test split of 0.3, for example, means that 70% of the dataset is used for training and 30% is used for testing. As we shall explore in this article, this split proportion might need to be adjusted based on the model performance.

We shall now explore the interpretation and significance of Loss Curves and then move on to Evaluation Metrics (Recall and Precision).

Loss Curves

What are loss curves?

Loss curves capture a model’s change in performance over time or rather across the number of steps the model has run. They help us understand whether our model is fitting our data well (not overfitting or underfitting) as well as diagnose whether the datasets and/or number of training steps are representative enough.

Loss is simply a measure of the difference between the actual values and the values predicted by the model. On our dashboard, we can see three different types of loss curves - classification, localisation and regularisation. Total loss is the sum of these three loss curves. These losses are represented for both the training dataset and the test dataset (eval).

What do these different losses represent?

Most object detection models have two key types of losses that help evaluate a model: classification and localisation. Classification loss is associated with the process of labelling the object correctly and compares the predicted object labels to the actual object labels. On the other hand, localisation loss is associated with the process of determining the “bounding box” or the area where the object is located. Thus, it compares the predicted bounding boxes to the actual bounding boxes. Regularisation loss, on the other hand, is loss from regularisation (attempt to simplify the model by reducing its weights) and not usually an indicator of the performance of the model.

What can the loss curves tell us about the model?

For a model to be the right fit, both the training loss and the test loss decline and eventually plateau, indicating the model has finished the process of learning. The two curves (test and eval) also should be relatively close to each other. If that is the case, we can be reasonably confident that our model has been able to learn well from our dataset. This may look something like the image below:

However, there are two cases where a model might not be the right fit for the dataset: when it underfits and when it overfits. We shall now explore how our dashboard can help you identify when this is the case and what the best solution could be.

A model that underfits is one that is unable to learn from the training dataset. In case of an underfitting model, the test loss line is relatively flat or significantly higher than the training loss. This means that the dataset is currently too complex for the model to learn from. A few solutions include: increasing the size of the dataset, reducing the batch size of the model, improving quality of feature labels, increasing model complexity or reducing regularisation. The following graph represents a model that may be underfitting:

Another indicator of underfitting could be when the test loss does not plateau but rather continues to keep decreasing till the end. This usually means that the number of training steps are insufficient for the model to learn from and the training process should be made longer. A solution to this situation is to increase the number of training steps for the model.

A model that overfits is one that is specialised to the training set and unable to generalise to the test set. In case of overfitting, the training loss does not plateau but keeps decreasing or the test loss decreases and then starts increasing again. The image below shows slight indicators of overfitting:

A few solutions to overfitting include: reducing the learning rate, reducing the model complexity, increasing the batch size of the model, increasing regularisation and adding dropout layers to the model.

Some of these solutions are too complex for the average current user to implement themselves, and we are constantly working to add in these features in future iterations. Our goal is to empower our users to be able to easily and seamlessly be in control of the end-to-end process involving their data, and increased customizability and flexibility is what we are always working towards.

One such feature we provide is the Neural Training Insights section where you could also view the model statistics (training steps, batch size, learning rate and model resolution). This is located right at the bottom of the Project Overview tab. It helps you visualise total loss, classification loss and localisation loss across these different metrics and can be a concise overview of your model statistics and performance. Batch size refers to the number of training examples used in one epoch or iteration by the model. Increased batch size leads to a model that has a higher learning rate but is more likely to make generalisations and overfit. Reduced batch size, however, leads to a more regularised model that is less likely to make generalisations but has a lower learning rate and may be prone to underfitting. Based on your already evaluated model’s performance (overfit or underfit), you may choose to increase or decrease your training steps or batch size and rerun your model.

What do the loss curves tell us about our data itself?

Apart from overfitting and underfitting, the model may also fail to learn well from the data if the training or test datasets are not representative enough - either the dataset itself is too small or the split between the training and test sets needs to be adjusted. A large gap between the training and test curves (despite them both declining) indicates that the training dataset is not representative and needs to be larger. Given the declining losses, the model seems to be capable of learning from our training dataset. However, the fact that test losses are still much higher indicates that the training dataset is missing key information that the model needs to learn from. Increasing the size of the training dataset by adjusting the train-test split might be a good solution to explore here.

On the other hand, a very noisy test loss line indicates that the test dataset is not representative enough and needs to be larger so that it can better represent the performance of the model. A test loss lower than the training loss indicates that the test dataset is too simple or there are duplicate observations between the training and test sets. In this case, increasing the size of the test dataset by adjusting the train-test split might be a good idea.

As we can see, the graph above has a noisy test loss line, indicating that its test dataset should likely be bigger.

What do differences between the different types of loss curves tell us about our model?

Some models may perform well when it comes to classification loss compared to localisation loss, and vice versa. This could also be a good indicator of where the model is doing well and where it needs further improvement. If your model has a localisation loss that is significantly better than your classification loss, it may be identifying the right locations for an object but may struggle to distinguish what it actually is, perhaps due to very similar feature labels or inconsistent annotations.

On the other hand, if your model has a classification loss that performs significantly better than your localisation loss, it may be identifying the object correctly but may be struggling to place the exact boundary label within the frame. In this case, revisiting your annotations and ensuring that they are targeted and specific to the given object might help.

Defining Evaluation Metrics

Apart from loss, there are two other key metrics that can be used to evaluate the performance of a model - recall and precision (mean average precision). We shall now discuss these two metrics and the different ways they are represented on our dashboard.

Average Recall

Average Recall, also known as probability of detection, is the number of times the model currently identified an object out of the number of times the object was actually present. The higher the Recall score is, the better the model’s performance is when it comes to sensitivity or the true positive rate.

The highest possible Recall score is 1, which means that the model has identified the object every single time it was present, while the lowest possible Recall score is 0, which means that the model has failed to identify the object every single time.

What if a model doesn’t correctly identify an object as the most likely label? We would still like to know how close it actually was to the truth - whether the correct label was within the top few likely labels the model came up with or whether the boundaries of the object it detected were close to the actual object. To further such analysis, our dashboard shows Average Recall evaluated at different values (1, 10 and 100) as well as at different sizes (Small, Medium and Large).

Average Recall at 1 calculates the recall value considering only the first most likely label for each object. Thus, it considers a prediction to be correct only if the first most likely predicted label matches with the actual label. Average Recall at 10 considers the top 10 likely labels predicted for the object, which means that a prediction is said to be correct if any one of the top 10 most likely predicted labels match with the actual label. Similarly, Average Recall at 100 considers the top 100 likely predicted labels. The higher the value the Recall is calculated for, the higher the Recall value turns out to be, since there is more probability of the correct label being one of the chosen ones.

Additionally, the dashboard also shows Average Recall at 100 at different sizes - Small, Medium and Large. Here, the size refers to the size of the bounding box around the detected object. The larger the bounding box, the higher the Recall value turns out to be, since there is more probability of the predicted bounding box overlapping with the actual bounding box containing the object.

Thus, understanding the Average Recall values across different ranges and sizes can help us understand how close our model is to the truth even when it might not be making the correct predictions. Through this, we can gain a deeper sense of confidence in the model’s sensitivity or its ability to detect an object when it is present within the image.

Average Precision

Average Precision refers to the number of times the model is correct out of all the times it has identified a certain object. It is an indicator of how confident the model is when it comes to a certain prediction and ranges on a scale of 0 to 1 as well. An Average Precision of 1 indicates that every time the model has detected a certain object, it has been correct in doing so. An Average Precision of 0 indicates that every time the model has made a prediction, it has been incorrect. Similar to Recall, the size of the bounding box (Small, Medium, Large) applies here too. By the same logic, the larger the bounding box, the higher the precision value turns out to be, since the predictions are more likely to be accurate within a larger area range.

The dashboard also shows Average Precision at different IOUs. IOU refers to Intersection over Union, the overlapping area between the predicted bounding box and the actual bounding box. An IOU of 0.5 means at least half the area should overlap for a prediction to be classified as a correct identification, while an IOU of 0.75 means at least 75% of the area should overlap for it to be considered as correct. The lower the IOU, the higher the precision would be, since the criterion for considering a correct classification is less strict.

Thus, through understanding Average Precision, we can gain a better understanding of the model’s confidence when it detects a particular object. Based on the nature of the dataset, we might need to be more or less strict about how we choose to evaluate precision, which is when the different sizes and IOUs can help us identify the precision metric that is relevant to us.

Practical Impact of Training Analysis

Analyzing data from the training process is hugely important to the developmental process in a machine learning pipeline, and certainly, there is a level of expertise that can only be developed through experience and contextual knowledge of the use case. However, objectively analyzing these graphs can reveal details about the progress and capabilities of the model that will allow users to better understand how the current model performance differs from the ideal case.

At Datature we are committed in facilitating a seamless end-to-end machine learning experience for first-timers and advanced developers alike. Through our Training Graphs dashboard, we hope to make the process of training and evaluating your model more accessible, understandable and intuitive for the average user.

Our Developer's Roadmap

Our dashboard provides users with a concise set of tools which they can use to evaluate their model’s performance after training. These metrics (loss, recall and precision) can provide us valuable information about our model and help us diagnose quick fixes and improvements when needed. To that end, we plan to introduce more features that provide users the ability to further customise their training process so that they can be more in control of their models and their performance, thus providing a more in-depth toolkit to solve issues they are observing in the training data.

Want to Get Started?

If you have questions, feel free to join our Community Slack to post your questions or contact us if you have any questions about interpreting, analysing or evaluating your model.

For more detailed information about the process of uploading, annotating and training your data, as well as other tools that are offered on Nexus, or answers to any common questions you might have, read more about Nexus on our Developer Portal.

Build models with the best tools.

develop ml models in minutes with datature