What is Small Object Detection?
Small object detection is a difficult task in the computer vision space focused on developing techniques that can enhance detection of relatively small features within larger images. In typical computer vision model inference pipelines, input images are directly resized into the model’s predefined input size, which is usually capped at around 2000 by 2000 pixels. However, in practice, large images such as DICOM or NIfTI files in the medical imaging field or GeoTIFF files in satellite imagery can have dimensions easily in the tens of thousands of pixels. For instance, with satellite imagery, individual images cover such a large geographic region that important objects occupy just a very small pixel region relative to the overall image dimensions. If they went through the typical inference pipeline, these small objects would be reduced to such a small pixel region that it would be rendered undetectable. As such, special techniques are necessary to improve the detection accuracy of these objects.
What is Slicing Aided Hyper Inference (SAHI)?
Slicing Aided Hyper Inference (SAHI) was put forward by Akyon et al. in 2021, as a variant of sliding window techniques to improve model performance for small object detection. The strategy is fairly simple, but produces empirical improvements in validation metrics. Image slicing for object detection has been used previously, as early as 2013, with "Real-time moving object detection algorithm on high-resolution videos using GPUs" by Kumar et al., which utilized slicing to accelerate inference. SAHI's contribution largely comes from using the sliced images as the basis of data for model fine-tuning as well as a combination of filtering, preprocessing, and postprocessing techniques to improve the quality of the slicing and the model prediction.
As a baseline, we will demonstrate the performance of a regularly trained model with the normal inference pipeline using an aerial vehicle detection dataset using bounding boxes, with a prediction confidence threshold of 0.8.
The general approach of sliding window techniques like SAHI is to slice large images into equal subsections, and perform model inference on each of these subsections. This allows relatively small objects in the original image to be represented as relatively large within their local crops, thus providing more visual detail for the model to work with. This inference technique on its own does provide an increased capability for models to make detections on small objects. However, it can also lead to hallucination, in which background features or other irrelevant features are detected as relevant objects with high confidence as the model is trained to locate small objects in large images. Additionally, slicing of images reduces the global context available for the model, so the model is unable to access contextual information surrounding objects and has increased difficulty in detecting large objects that don’t fit within singular slices. As shown in the example image below, more detections were made due to the sliding window inference pipeline. However, they aren’t necessarily accurate, and definitely are not a strong level of performance.
To enhance the model’s innate capabilities, the same concepts regarding cutting up subsections for inference can be applied as a training preprocessing step, in which original training images are spliced, and their annotations are correspondingly spliced and adjusted such to respect the new crops. These crops are treated as individual training images, thus allowing the model to better learn how to perform detections on smaller crops, so as to be more representative of the inference technique being used. As shown below, after training a YOLOv8-Small 320x320 model with training crop sizes of 320x320, we get significantly improved results as shown below, with almost all objects being predicted well even with the same confidence threshold of 0.8.
SAHI provides a few different hyperparameters, facilitating custom crop sizes to better align with model input sizes and the proportion of overlap to increase the likelihood of whole annotations being contained in each crop. Additionally, it provides filters to filter out images that have no annotations or filter annotations that only are partially shown in an image due to slicing. The library itself also provides optimizations for these transformations as well as improved inference speed to reduce the latency that is caused by generating more images.
Using Sliding Window Techniques with Datature’s Nexus
Datature’s Nexus provides a simple platform for all computer vision Machine Learning Operations (MLOps) steps, from data storage and management of training data as well as model training and deployment. Sliding Window is natively supported for object detection and instance segmentation models on the Nexus platform, and your datasets can be sliced for training without any additional effort for the user.
Uploading to Nexus
To use Sliding Window natively on Nexus, one can just upload their original images and the option will be available at the workflow stage. To upload images onto Nexus, you can go to the Dataset page and upload there or use Datature's Python SDK to upload images through code.
Build Your Training Workflow
Before we start model training, we have to design a training workflow. To use Sliding Window natively, we can select the Sliding Window Setup as shown below. We provide all the essential hyperparameters to determine the most appropriate setup for your dataset.
The sliding window crop allows you to choose the dimension size of the crop. Each dimension size lines up with our model input size options. In this case, we opt for 320x320 pixel crops. There are two overlap ratios, one for width and one for height, to allow crops to overlap. With overlaps, slices are less able to contain cut-off annotations, which can be confusing for a model to learn. Finally, the annotation completeness threshold is for Intersection-over-Union (IoU) with regards to sliced annotations in crops and their original annotations. If it is below the threshold, the annotation is ignored in the sliced image.
To view how the images in your dataset are being sliced with the inputted settings, you can click on the Preview Slices button at the bottom sidebar. You can also see which annotations will be kept for training depending on the selected annotation completeness threshold by toggling the View Annotations button.
You can also view annotations rendered on the original image by toggling the View Annotations button. The annotations that will be kept for training depends on the selected annotation completeness threshold, accounting for how the sliding window would potentially remove annotations based on whether an annotation is below the threshold.
The image here displays all sliced images, which clearly show the individual slices that were processed, and their corresponding annotations
Training Your Model
To train a model on Nexus, we can select the Run Training button on the bottom right, and select the corresponding training settings, such as hardware, checkpoint strategies, and advanced evaluation. Alternatively, we can just trigger an existing training workflow. Workflows have to be created on the Nexus platform directly, but triggering of new trainings can be done on the Datature Python SDK. We can continuously ping the process to check whether the training is complete.
Once the training has begun, you will be brought to a live training dashboard in which you can monitor the training loss, validation loss, and validation metrics as it progresses. This is part of how we can compare the performance of a model training using Sliding Window and without.
Nexus also provides Advanced Evaluation tools such as Advanced Evaluation in which example validation images are selected and their model predictions are displayed over various evaluation epochs.
Once the training is complete, we can export the artifact and subsequently deploy it. For the purposes of this demo, we can opt to export it locally. If the artifact was created from a training workflow with Sliding Window activated, Datature’s Nexus provides a .zip file that contains a Python script that supports sliding window inference using all the same settings provided in the training workflow, as well as instructions on how to set up your Python environment to run the code locally.
As we can see, Sliding Window has a significant impact on model performance, allowing the model to predict far more small objects in these large images, as we allow the model to focus on specific crops of an image at a time. This is traded off with a multiplicative increase in inference volume and the increase of false positives with a decrease in false negatives. If these trade-offs are acceptable in your use case, then Sliding Window is certainly a viable strategy for analyzing images with small objects effectively.
Working With Other Data
The Sliding Window method also works well with medical images, such as the HuBMAP + HPA - Hacking the Human Body dataset, to improve small object detection and segmentation. There are many industries and use cases where models have to identify small, and dense amount of objects.
In the case of computational pathology, medical scans analysis, etc - we definitely believe that Sliding Window will be beneficial toolkit to evaluate for potential accuracy improvements.
Try It On Your Own Data
To get started with your own Sliding Window project on Nexus, all you have to do is sign up for a Free Tier account, upload your images and annotations, and train the model right away. If you want to use SDKs instead, you can always go to this link for the Jupyter Notebook which will detail the code needed to follow each step.
Want to Get Started?
If you have questions, feel free to join our Community Slack to post your questions or contact us about how Sliding Window fits in with your usage.
For more detailed information about the Sliding Window functionality, customization options, or answers to any common questions you might have, read more about the process on our Developer Portal.
Build models with the best tools.
develop ml models in minutes with datature