Introduction to Dataset Distribution Analysis
With Nexus, you are empowered with an end-to-end no-code experience for training a machine learning model for your use-case. With convenient image annotation tools such as Intellibrush, to a drag-and-drop style training workflow, followed by real-time training metrics and visualisations and finally a seamless artifact export procedure, there is no need to interact with complicated code.
To further improve the quality of your experience, we are excited to introduce new advanced features alongside our pre-existing Dataset Statistics. You can now track the quality of your dataset before training your model with the improved Aggregation Statistics . Through our new statistics interface, the robustness of your dataset can be visualised, thereby providing you with more insights on how you can further improve your dataset. They include heat-maps that track the label area and centroids, as well as the image and label dimensions on demand.
Let’s walk through a few examples to highlight the importance of analyzing dataset distributions and the augmentation techniques that you could implement to improve your dataset before training your model.
Why is it Important to Analyze Your Dataset Distribution?
With insights into dataset distribution, you will be able to have a general overview of the quality of the dataset and its impact on the performance of the model. You can then make preemptive changes to the dataset prior to starting your model training, which can save a lot of time compared to only learning about them after your model has been trained. To enable this efficiency, your dataset analysis should address the following points:
What are the Fundamental Dataset Statistics to Watch Out For?
Constantly having an up-to-date count of the amount of data in your dataset, as well of your class distribution is important because you can then immediately identify critical factors that could impact your model performance, such as insufficient images and underrepresented classes.
With the existing Datature Statistics feature on Nexus, you can instantly zero-in on the quantity of your dataset, and verify the uniformity of your label classes.
In this example, the number of images in the dataset is 20. With an extremely small dataset, the model is extremely likely to overfit the training data. With an up-to-date count of the amount of data in the dataset, you are able to better identify the need to significantly increase the number of data within the dataset, so as to improve the overall performance of any future models that are trained on this dataset.
An uneven distribution of label classes could prove detrimental to a model during training. A model is less likely to learn as well from an under-represented class than a well-represented class. This is due to the lower frequency of exposure for the under-represented class. In this example, the labels within the dataset are not properly distributed. There are two classes that are clearly under-represented. With the tag distribution chart, you can better identify which classes need more labels, so as to create a dataset that is better distributed.
Why is it Important to Analyze Label Distributions?
It is important to spread your ground truth labels throughout different regions of your images. This will force the model to work on the entire image instead of just specific regions. For example, if all your labels are all located at the bottom right of the image, the model will zero-in on the bottom right of subsequent images when performing detections. Unless your use case dictates this occurrence, this could negatively impact the performance of the model when objects are present at the top left of the image during production.
The Aggregation Statistics feature provides two newly released heat-maps showing the distribution of the label areas and their centroids. This enables you to observe the areas that are represented within the entire dataset.
The greater the number of occurrences at that specific location represented by the point, the brighter the point in the heat-map.
For the Area heat-map, a brighter spot indicates a greater proportion of labels being located at that specific region of the image. For the Centroids heat-map, a brighter spot likewise signifies a greater number of label centroids existing at that particular spot in the image.
These heat-maps allow you to easily identify whether the spatial distribution of your labels lack uniformity, and immediately pinpoint the specific region of images in which you would have to add more labels to.
The dataset example above is ideal because the heat-maps for both the label area and label centroid heat-maps are evenly distributed.
Here, we have a dataset that comprises labels that are too focused in the center. Models that have been trained using this dataset may detect well if the object is located at the corners of the image. Augmentations such as Center Crop and Random Resized Crop can be utilised to improve the model performance by increasing the ratio of annotation area to image area. In addition, the Shift Scale Rotate augmentation can also help by increasing the number of annotation centroids at the edges of the dataset images.
In this example, the labels within the dataset are concentrated at the middle-left and center of their respective images. Augmentations to curb the impact of this phenomenon are Horizontal Flip and Shift Scale Rotate. These augmentations will likewise lead to more uniformly distributed labels within the dataset, resulting in better performance of the model.
How Do Asset and Label Dimensions Affect Model Training?
Most models take in square images as input. If the input image is not square, it will be resized to be a square before being fed into the model. Due to this, it is very important to keep track of both the aspect ratios of the image and bounding boxes at the same time.
For example, when image assets are too tall and the dimensions of their respective annotations are too wide (or vice versa), subsequent resizing prior to training will lead to an utterly squashed annotation.
In the worst case scenario, the bounding box becomes so compressed that it is impossible for the model to learn anything useful from the label.
Essentially, if your image is too wide, you would generally want to avoid annotations that are too tall. Likewise if the image is too tall, it is best to avoid wide annotations.
With the introduction of the Aggregation Statistics feature, Nexus now provides two more new heat-maps portraying the image and label dimensions respectively. For the Annotation Dimensions heat-map, it shows the concentration of bounding box sizes which can be a helpful way to see the distribution of your bounding box dimensions. This can encourage other types of positional augmentations to detect objects from more perspectives
For the Asset Dimensions heat-map, it shows the concentration of asset dimensions which can be a helpful way to see the distribution of your asset dimensions. Thus, this can elucidate the necessity to do further preprocessing before uploading the assets to the platform for training and annotating.
With these tools you can immediately identify the potential hazards mentioned above.
The dataset example above is ideal as the heat-maps show neither the asset or annotation dimensions are too tall nor wide.
From this example, you can instantly tell that some of your assets are too tall while some of your annotations are too wide. This may lead to the problems earlier discussed. A possible solution will be to implement the Random Crop augmentation to crop the image to a square first. By doing so, there will not be a subsequent change in aspect ratio of the image when it is used for training, overcoming the problem of squashed annotations.
The Impact of Dataset Distribution Analysis in Practice
While research in the field of computer vision continues to push for improvements in model robustness through changes in model architecture, it is undeniable that the most efficacious best practices come from the improved treatment and preprocessing of training data.
For example, the fundamental underlying architecture of computer vision models: convolutional neural networks, have been shown to exploit absolute spatial location in their predictions, demonstrating the lack of capability to be translation invariant. Therefore, if the fundamental structures that construct convolutional neural networks can learn to exploit other information besides the visual features of objects, we should look to improve the robustness of our dataset to better challenge and train our model.
However, without the ability to analyze dataset distributions, one would not be able to determine the most effective operations to use to improve dataset quality. Making arbitrary changes can lengthen model training time without necessarily making any improvements. Through the analysis of dataset distributions through tools such as our aggregation statistics on Nexus, one can get a clearer sense of their dataset deficiencies and take precise and effective actions that can improve the efficacy of the dataset in training settings. In practice, data scientists always desire naturally collecting well balanced and distributed datasets. However, in the innumerable cases where this is not possible, tools like data augmentation, from simple changes like vertical flips and random crops, to more advanced state-of-the-art operations like the simple copy-paste training method, can have tremendous impact on model training performance and robustness during model deployment.
As the world of machine learning has evolved to become more data-centric, just as equally, we should turn our attention to the most accessible and impactful component of the machine learning pipeline: datasets.
Additional Deployment Capabilities That You Could Explore
Overall, these heat maps can serve as an easy yet detailed visual reference for better understanding nuances in your dataset that can have real effects on the model training and performance.
Now that you’ve ensured that your dataset is optimised, it is time to start training your model! Perhaps, a YOLOX model could be a good starting point for you. With just a few clicks, you will be able to train and monitor a model of your own. Sounds appealing? Check out how you can train a YOLOX model (without code) to kick-off your model training experience!
Our Developer’s Roadmap
At Datature we are committed in facilitating a seamless end-to-end machine learning experience for first-timers and advanced developers alike. To that end, we plan to subsequently introduce an option for users to upload videos for training. Soon, we will be releasing more amazing features such as video-tracking annotators, so do stay tuned!
Want to Get Started?
If you have questions, feel free to join our Community Slack to post your questions or contact us about how the Aggregation Statistics Interface fits in with your usage.
For more detailed information about the Aggregation Statistics Interface and other tools that are offered on Nexus, or answers to any common questions you might have, read more about Nexus on our Developer Portal.