YOLOv9 - A Comprehensive Guide and Custom Dataset Fine-Tuning

YOLOv9 is the latest advancement in the YOLO series for real-time object detection, introducing novel techniques such as Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (GELAN) to address information bottlenecks and enhance detection accuracy and efficiency. In this post, we examine some of the key advantages of YOLOv9.

Wei Loon Cheng

What is YOLO?

YOLO first arrived in the inaugural 2015 paper by Joseph Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”. The model architecture was significant in the field of deep learning in object detection because it was designed to only look at the image once, thus making a traditionally two-stage process of predicting location and then making a classification into a single stage process, in which classified objects are also mapped together with bounding boxes.

The YOLO model series is renowned in the computer vision scene as a powerful machine learning model for object detection due to its task flexibility and model compactness, while still maintaining state-of-the-art performance. This has allowed YOLO models to be integrated into many industry verticals and accessible to a broad range of machine learning practitioners.

A secondary reason for its continued success has been its transition from its initial implementations from versions 1-4 in Darknet, to the more commonly used PyTorch framework with YOLOv5, YOLOv7, and YOLOv8. Given the stronger research community in PyTorch, the YOLO model series received significant development attention and rapid improvements. YOLOv9 is the result of many developmental iterations since its inception that continues to challenge state-of-the-art model architectures in the object detection space.

YOLOv9 Overview

YOLOv9 builds on top of previous success from YOLOv7 introduced in 2022. Both were developed by Chien-Yao Wang, et al. YOLOv7 focused heavily on architectural optimizations in the training process, known as trainable bag-of-freebies, to strengthen the training cost for improving object detection accuracy, but without increasing the inference cost. However, it did not tackle the issue of information loss with the input data due to various downscaling operations in the feedforward process, a phenomenon known as information bottleneck

While existing methods such as the use of reversible architectures and masked modelling are proven to alleviate information bottlenecks, they appear to lose efficacy for more compact model architectures, which have been a hallmark feature of real time object detectors like the YOLO model series.

YOLOv9 introduces two novel techniques that not only address the issue of information bottleneck but also further push the boundaries of improving object detection accuracy and efficiency.

Programmable Gradient Information

YOLOv9 aims to address information bottlenecks through an auxiliary supervision framework known as Programmable Gradient Information (PGI). PGI is generally designed as a training aide to improve efficient and accurate gradient backpropagation through interconnections to previous layers but via a removable branch such that these additional computations can be removed at inference time for model compactness and inference speed. To improve upon these interconnections, it utilizes multi-level auxiliary information with integration networks, that aggregates gradients from multiple convolutional stages to consolidate meaningful gradients to propagate. PGI consists of three key components:

PGI Architecture in YOLOv9.
  • Main Branch: The main branch is primarily used for the inference process. Since the other components of PGI are not required for the inference stage, YOLOv9 ensures that no additional inference cost is incurred.

  • Auxiliary Reversible Branch: An auxiliary reversible branch is introduced to ensure reliable gradient generation and parameter updates in the network. This branch serves to maintain complete information by leveraging reversible architecture. However, integrating it directly with the main branch incurs significant inference costs, prompting the design of an auxiliary reversible branch. By incorporating this branch into the deep supervision framework, the main branch can receive reliable gradient information, aiding in the extraction of pertinent features for the target task. This enables its application across both shallow and deep networks while preserving inference capabilities by removing the auxiliary branch during inference.

  • Multi-Level Auxiliary Information: Enhances deep supervision by integrating an integration network between feature pyramid hierarchy layers, allowing the main branch to receive aggregated gradient information from different prediction heads. This approach mitigates the issue of deep feature pyramids losing important information needed for target object prediction, ensuring that the main branch retains complete information for learning predictions across various targets.
# YOLOv9 head
   # multi-level auxiliary branch  
   # elan-spp block
   [9, 1, SPPELAN, [512, 256]],  # 29

   # up-concat merge
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 7], 1, Concat, [1]],  # cat backbone P4

   # csp-elan block
   [-1, 1, RepNCSPELAN4, [512, 512, 256, 2]],  # 32

   # up-concat merge
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 5], 1, Concat, [1]],  # cat backbone P3

   # csp-elan block
   [-1, 1, RepNCSPELAN4, [256, 256, 128, 2]],  # 35
   # main branch  
   # elan-spp block
   [28, 1, SPPELAN, [512, 256]],  # 36

   # up-concat merge
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 25], 1, Concat, [1]],  # cat backbone P4

   # csp-elan block
   [-1, 1, RepNCSPELAN4, [512, 512, 256, 2]],  # 39

   # up-concat merge
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 22], 1, Concat, [1]],  # cat backbone P3

   # csp-elan block
   [-1, 1, RepNCSPELAN4, [256, 256, 128, 2]],  # 42 (P3/8-small)

   # avg-conv-down merge
   [-1, 1, ADown, [256]],
   [[-1, 39], 1, Concat, [1]],  # cat head P4

   # csp-elan block
   [-1, 1, RepNCSPELAN4, [512, 512, 256, 2]],  # 45 (P4/16-medium)

   # avg-conv-down merge
   [-1, 1, ADown, [512]],
   [[-1, 36], 1, Concat, [1]],  # cat head P5

   # csp-elan block
   [-1, 1, RepNCSPELAN4, [512, 1024, 512, 2]],  # 48 (P5/32-large)

   # detect
   [[35, 32, 29, 42, 45, 48], 1, DualDDetect, [nc]],  # DualDDetect(A3, A4, A5, P3, P4, P5)

The code snippet above shows the segregation within the YOLOv9 head into the main branch and multi-level auxiliary branch. The multi-level auxiliary branch is a direct subset of the main branch. The duplicated blocks in the auxiliary branch are responsible for storing the gradient information on behalf of the main branch.

Generalized Efficient Layer Aggregation Network (GELAN)

YOLOv9 also continues to uphold the trademark real-time inference support that the YOLO architecture family is well-known for through the introduction of a Generalized Efficient Layer Aggregation Network (GELAN), which combines important features from CSPNet and ELAN.

GELAN Architecture in YOLOv9.

CSPNet is known for its effective gradient path planning, enhancing feature extraction. ELAN, on the other hand, prioritizes inference speed by employing stacked convolutional layers. GELAN integrates these strengths to create a versatile architecture that emphasizes lightweight design, fast inference, and accuracy. It extends ELAN's capabilities by enabling the stacking of any computational blocks beyond convolutional layers, allowing the inference optimizations to be applied across all layers.

##### GELAN #####        
class SPPELAN(nn.Module):
    # spp-elan
    def __init__(self, c1, c2, c3):  # ch_in, ch_out, number, shortcut, groups, expansion
        self.c = c3
        self.cv1 = Conv(c1, c3, 1, 1)
        self.cv2 = SP(5)
        self.cv3 = SP(5)
        self.cv4 = SP(5)
        self.cv5 = Conv(4*c3, c2, 1, 1)

    def forward(self, x):
        y = [self.cv1(x)]
        y.extend(m(y[-1]) for m in [self.cv2, self.cv3, self.cv4])
        return self.cv5(torch.cat(y, 1))
class RepNCSPELAN4(nn.Module):
    # csp-elan
    def __init__(self, c1, c2, c3, c4, c5=1):  # ch_in, ch_out, number, shortcut, groups, expansion
        self.c = c3//2
        self.cv1 = Conv(c1, c3, 1, 1)
        self.cv2 = nn.Sequential(RepNCSP(c3//2, c4, c5), Conv(c4, c4, 3, 1))
        self.cv3 = nn.Sequential(RepNCSP(c4, c4, c5), Conv(c4, c4, 3, 1))
        self.cv4 = Conv(c3+(2*c4), c2, 1, 1)

    def forward(self, x):
        y = list(self.cv1(x).chunk(2, 1))
        y.extend((m(y[-1])) for m in [self.cv2, self.cv3])
        return self.cv4(torch.cat(y, 1))

    def forward_split(self, x):
        y = list(self.cv1(x).split((self.c, self.c), 1))
        y.extend(m(y[-1]) for m in [self.cv2, self.cv3])
        return self.cv4(torch.cat(y, 1))


The original `ELAN` module only allowed for stacking of convolutional layers. The new `SPPELAN` module in the code snippet above allows for the stacking of pooling layers in the `SP` block, that specifically contains `MaxPool2D` layers.

YOLOv9 vs YOLOv8 vs YOLOv7

Object detection with YOLOv9 boasts stark improvements across different metrics as compared to previous state-of-the-art models. Despite having significantly fewer parameters as compared to the largest architectural variants of YOLOv7 and YOLOv8, YOLOv9 still manages to outperform them in terms of accuracy. Furthermore, YOLOv9 maintains almost the same computational complexity as that of its direct predecessor, YOLOv7, and avoids the additional complexity that YOLOv8 incurs.

Source: YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

YOLOv9 even outperforms other state-of-the-art real-time object detectors outside of the YOLO family, such as RT DETR, RTMDet, and PP-YOLOE when trained on the COCO dataset. These models had the advantage of leveraging ImageNet pre-trained weights. Uniquely, YOLOv9 was still able to secure an edge over them despite utilizing a train-from-scratch method, demonstrating strong ability in learning robust features rapidly. This could mean that training YOLOv9 on custom datasets could potentially boost its already impressive metrics even further.

YOLOv9 Performance Analysis with Other Real-Time Object Detection Architectures.

Example YOLOv9 Inference for Crowd Detection

While many of the performance tests and evaluations were done on high-quality images, we wanted to see how YOLOv9 would perform on real-world data. We fed a completely unseen, medium-quality video depicting a crowd of people in a shopping mall for the model to predict. The pretrained YOLOv9-E with input resolution of 640x640 was able to detect most instances of people in the scene without any further finetuning, with decent performance on occlusions.

YOLOv9 can also be extended to work with other MOT Algorithms such as BYTETrack. In this example, we assigned unique IDs to each person and backpack and tracked them as they moved around the frame. This opens up important use cases for crowd management by counting the number of people in a particular location, as well as surveillance, for tracking the movement of suspicious persons or identifying suspicious bags. Check out our article to learn more about how to leverage BYTETrack with YOLOv9.

Tracking people and backpacks in a crowded shopping mall with YOLOv9 and BYTETrack.

YOLOv9 Applications

YOLOv9's real-time object detection support can be utilized for a variety of real-world applications, and is particularly suited for fast-paced environments, such as:

  • Autonomous Vehicles: YOLOv9 can be used in self-driving cars for detecting pedestrians, other vehicles, traffic signs, and obstacles on the road in real-time, enabling the vehicle to make decisions based on its surroundings.
  • Surveillance Systems: YOLOv9 can be employed in surveillance cameras to monitor public spaces, airports, train stations, and other areas by detecting suspicious activities, unauthorized intrusions, or abandoned objects.
  • Retail Analytics: In retail, YOLOv9 can be used for tracking customer movements, analyzing store foot traffic, and monitoring inventory levels by identifying products on shelves. It can also aid in retail security by detecting shoplifting incidents.

How to Fine-Tune YOLOv9 on Your Own Weights?

To fine-tune YOLOv9 on your own custom dataset, you will first need to clone the YOLOv9 repository and install the required Python packages. We recommend that you use a virtual environment for this, such as conda or virtualenvwrapper

git clone https://github.com/WongKinYiu/yolov9.git
pip install -r requirements.txt

If you prefer to develop in a Docker container, follow the Docker setup instructions below:

# create the docker container, you can change the share memory size if you have more.
nvidia-docker run --name yolov9 -it -v your_coco_path/:/coco/ -v your_code_path/:/yolov9 --shm-size=64g nvcr.io/nvidia/pytorch:21.11-py3

# apt install required packages
apt update
apt install -y zip htop screen libgl1-mesa-glx

# pip install required packages
pip install seaborn thop

# go to code folder
cd /yolov9

Then, you will need to prepare your dataset similar to the COCO dataset file structure. If you need to reference this structure, you can run the following script which will download the train, validation and test splits of the COCO dataset together with the labels.

bash scripts/get_coco.sh

To run training, it is recommended to provision a GPU due to the computational requirements.

# train yolov9 models
python train_dual.py --workers 8 --device 0 --batch 16 --data data/coco.yaml --img 640 --cfg models/detect/yolov9-c.yaml --weights '' --name yolov9-c --hyp hyp.scratch-high.yaml --min-items 0 --epochs 500 --close-mosaic 15

# train gelan models
# python train.py --workers 8 --device 0 --batch 32 --data data/coco.yaml --img 640 --cfg models/detect/gelan-c.yaml --weights '' --name gelan-c --hyp hyp.scratch-high.yaml --min-items 0 --epochs 500 --close-mosaic 15

If you wish to leverage multiple GPUs, simply modify the command to utilize `torch.distributed`:

# train yolov9 models
python -m torch.distributed.launch --nproc_per_node 8 --master_port 9527 train_dual.py --workers 8 --device 0,1,2,3,4,5,6,7 --sync-bn --batch 128 --data data/coco.yaml --img 640 --cfg models/detect/yolov9-c.yaml --weights '' --name yolov9-c --hyp hyp.scratch-high.yaml --min-items 0 --epochs 500 --close-mosaic 15

# train gelan models
# python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train.py --workers 8 --device 0,1,2,3 --sync-bn --batch 128 --data data/coco.yaml --img 640 --cfg models/detect/gelan-c.yaml --weights '' --name gelan-c --hyp hyp.scratch-high.yaml --min-items 0 --epochs 500 --close-mosaic 15

Once the model is trained, you can run the evaluation script to evaluate your model:

# evaluate converted yolov9 models
python val.py --data data/coco.yaml --img 640 --batch 32 --conf 0.001 --iou 0.7 --device 0 --weights './yolov9-c-converted.pt' --save-json --name yolov9_c_c_640_val

# evaluate yolov9 models
#python val_dual.py --data data/coco.yaml --img 640 --batch 32 --conf 0.001 --iou 0.7 --device 0 --weights './yolov9-c.pt' --save-json --name yolov9_c_640_val

# evaluate gelan models
# python val.py --data data/coco.yaml --img 640 --batch 32 --conf 0.001 --iou 0.7 --device 0 --weights './gelan-c.pt' --save-json --name gelan_c_640_val

Alternatively, if you wish to run inference on your test data, you can check out the code snippets here, or check out this HuggingFace Space for a convenient dashboard to quickly assess and visualize the inference performance on a couple of images. The original GitHub repository also contains other areas you might find useful, such as links to other related YOLOv9 tutorials from community contributions, such as implementing YOLOv9 on Tensorflow, ONNX, and TensorRT. To read the published paper, you can go to this link here.

If you have questions, feel free to join our Community Slack to post your questions or contact us to train your own YOLOv9 Object Detection Model on Datature Nexus. 

For more detailed information about the model functionality, customization options, or answers to any common questions you might have, read more on our Developer Portal.

Build models with the best tools.

develop ml models in minutes with datature