Implementing Object Tracking for Computer Vision - Complete Guide + Code

July 27, 2022

Computer vision is a field of computer science and artificial intelligence that enables machines to ‘see’ and process images like humans do. The idea of getting machines to locate and recognize objects in images has been around for decades, but it was not until the 2010s when the field achieved a huge breakthrough – where AlexNet, a deep convolutional neural network, achieved a top-5 error rate of 15.3% (~10% lower than the second best algorithm at the time) in an image classification competition – that people started to realize utilizing computer vision to solve real world problems was no longer a distant dream.

Today, computer vision can do a lot more than classifying objects in images. It is being adopted in different industries to support a wide variety of applications. For instance, object detection can be used to detect tumors in medical images, identify defective products on an assembly line, estimate the number of people in CCTV images for crowd control, and many more. 

Object tracking is another computer vision technique that has attracted a great deal of attention in recent years. It tracks an object or multiple objects in a sequence of images or videos, both spatially and temporally. It has great potential in making our lives more convenient. Some notable use cases include autonomous cars, where successful tracking of objects in the car’s vicinity is crucial to ensuring a safe and smooth drive, and “Just Walk Out” stores, where customers can simply walk into a store, take what they need, and walk out.

In this article, we will dive deeper into object tracking. In addition to going over the fundamentals of object tracking, we will be implementing and comparing the performance of three object multiple object tracking algorithms, IoU, Norfair, and DeepSORT. All codes and notebooks are made available to our readers. We look forward to seeing you applying object tracking techniques and supporting you as you build your own object tracker for your project.

Table of Contents 

1. What is Object Tracking?

         Human Visual System and Computer Vision

         A Multi-step Process

2. What are the different types of Object Tracking?

         Single Object Tracking (SOT)

         Multiple Object Tracking (MOT)

3. Why is Object Tracking important?

4. What are some use cases of Object Tracking?

5. What are some Object Tracking algorithms?

6. How to Implement an Object Tracker?

What is Object Tracking?

Human Visual System and Computer Vision

Imagine yourself sitting on a bench in a park and noticing a person walking past you with a dog. The act of focusing your eyes on the dog as it moves from one side of your field of vision to the other is an example of object tracking in real life.

In computing, object tracking is the process by which an algorithm (as opposed to the human visual system mentioned above) detects and predicts the positions of a target in a sequence of images or videos. It is an important task in computer vision – a field of computer science and artificial intelligence that enables computers to ‘see’ and process images like humans do.

A Multi-step Process 

 While visually following an object over time and space, e.g. tracking a ball during a basketball game, may seem like an effortless task to many of us, it is in fact a complex process that involves multiple subprocesses. For instance, before we can start tracking the movement of the ball, we first need to find the basketball on a court full of players and other objects. 

In computer vision, this is referred to as object detection. Object detection locates the presence of objects in an image with bounding boxes (localization), and indicates the types of the objects located (classification). It forms the basis of many other computer vision tasks, including object tracking.

Thus, we can think of Object-Detection-and-Tracking as a multi-step process – 


  1. Object Detection : As mentioned, object detection detects, localizes, and classifies objects in a frame. There are many object detection algorithms out there, and you may already have some experience building one. Some popular examples are Region-Based Convolutional Neural Networks (R-CNN), Fast R-CNN, Faster R-CNN, Mask R-CNN, and YOLO (You Only Look Once).
  1. Unique ID Assignment : Practically speaking, there are usually more than one object to track in a real-world object tracking task. Thus, following the detection in the initial frame, each object will be assigned a unique ID to be used throughout the sequence of images or videos.
  1. Motion Tracking : Lastly, the tracker will estimate the positions of each of the unique objects in the remaining images or frames to obtain the trajectories of each individual re-identified object.

Figure extracted from Multiple object tracking with context awareness by Laura Leal-Taixé

What are the different types of Object Tracking?

There are two main types of object tracking, Single Object Tracking (SOT) and Multiple Object Tracking (MOT).

  1. Single Object Tracking : SOT tracks a single object in continuous video frames, such as the ball during a basketball game. Typically, the initial position of the target object in the first frame is given to the tracker, the tracker then estimates or predicts the position of the object in the remaining frames. There are traditional computer vision based trackers such as CSRT and KCF, as well as deep learning based trackers such as GOTURN and SiamRPN.
  1. Multiple Object Tracking : MOT tracks multiple objects of interest in the video simultaneously, such as the players in a basketball game. Other than multiple objects from the same class, it is also able to track multiple objects of different classes. For instance, pedestrians, vehicles, and road signs in self-driving cars. Some MOT trackers are DeepSORT, JDE, and CenterTrack.

    MOT typically uses methods that extract features to re-identify detected objects in a later video frame. 

Why is Object Tracking important?

We have already covered that object tracking can be thought of as a multi-step process supported by object detection, but how different is it from object detection and when do we need a tracker?

To put simply, there are many problems that can be solved with object detection alone, but there are also instances where object detection alone is insufficient. 

Object detection is great for inquiries related to single moments. For example, the total number of people in a store at noon, whether or not a tennis ball is out, and how congested a road is at 5:30 pm. However, it does not provide any information regarding an object’s movement before and after the moment, e.g. the route a customer took while shopping, the trajectory of the tennis ball before it landed outside the line, and whether or not a car speeded.

“What if we have multiple images or a video of an object? Can’t we simply perform object detection on all images and frames? And how is that different from tracking?” you might ask. The major difference is that in object tracking, each object is assigned an unique id that links the same object in different frames together. Without this, the algorithm will simply treat the same object in different frames as distinct items.

Therefore, object tracking is important as it allows continuous observation of a target object across time, providing much richer information than object detection can.

Detecting Pedestrians with Portal

What are some use cases of Object Tracking?

Now, let’s explore some use cases of object tracking in greater detail.

  1. Sports Analytics : Computer vision, especially tracking, is heavily utilized in real-time sports analytics to track humans (e.g. players, drivers, referees) and moving objects (e.g. balls, pucks, cars) during a competition. State-of-the-art systems can provide large amounts of information such as team formation, athletes’ poses and movements, and event statistics instantaneously.
  1. Surveillance : Other than identifying events of interest such as abnormal traffic and illegal activities, in recent years, tracking has been applied to perform crowd monitoring to ensure people are following social distancing rules and wearing masks at public places.
  1. Retail : More and more companies are embracing the use of computer vision technologies to study consumer behavior and improve customer experience. For example, merchants can gain valuable insights such as where their customers spend most of their time in the store and consider putting new products there for exposure, and which areas in the store have less traffic and adding some signs to direct the customers there.

What are some Object Tracking algorithms?

Now that we have some basic understanding of object tracking and how it works, let’s look at some algorithms to see how it is being executed in terms of computing.

IoU (Intersection over Union) 

This method relies entirely on the detection results rather than the image itself. Intersection over union (IoU) is used to calculate the overlap rate between two frames. When IoU reaches the threshold, the two frames are considered to belong to the same track. Since this method relies solely on IoU, it assumes that every object is detected in every frame or that the "gap" in between is small and the distance between two detections is not too large, i.e. video frame rate is high. The IOU is calculated by: IOU(a, b) = (Area(a) Area(b)) (Area(a) Area(b)) 


DeepSORT mainly uses the Kalman filter and the Hungarian algorithm for object tracking. Kalman filtering is used to predict the state of tracks in the previous frame in the current frame. The Hungarian algorithm associates the tracking frame tracks in the previous frame with the detection frame detections in the current frame, and performs tracks matching by calculating the cost matrix.


The “Deep” in DeepSORT means that it uses a simple CNN to extract the appearance features of the detected objects. After each frame is detected and tracked, the appearance features of the objects are extracted and saved. In this way, the same object that is far away can also be matched, which solves the occluded object problem.


ByteTrack is an MOT algorithm that relies heavily on detector performance. Based on the mechanism of DeepSORT, ByteTrack requires the detector to put the detection boxes regardless of the score into the matching stage. For detection boxes with high scores, ByteTrack performs feature matching and IOU matching while those with low scores only perform IOU matching. This change will make the trajectory more coherent, instead of fragmenting the trajectory and failing to match due to the low score of the detection box due to slight occlusion. This is an algorithm similar to DeepSORT, but heavily reliant on detecting the model's effectiveness.



In order to make the fragmented trajectory coherent, StrongSORT's approach is to add tracks matching on the basis of DeepSORT. If the matching degree of convolution features of different tracks is high enough, it is judged as the same track. Second, StrongSORT applies Gaussian Smooth Interpolation (GSI) to compensate for missing detections. The algorithm based on Gaussian process regression will no longer ignore motion information and can achieve more accurate positioning. However, the running time of this algorithm will be longer than pure DeepSORT.



Based on DeepSORT, BOT-SORT modifies the state vector and other matrix parameters in the Kalman filter (KF) so that the prediction frame can better match the target. Since the Kalman filter is a uniform motion model, BOT-SORT adds camera motion compensation (CMC), which makes the predicted frame of the target not lead or lag when the target moves at a non-uniform speed. However, when the video resolution is large, CMC will greatly increase the time-consuming.


Norfair Library

Norfair is a object tracking library based on DeepSORT algorithm.However, Norfair provides users with a high degree of customization, such as the distance function. The distance function we set here is to calculate the distance between the center points of the two boxes. Norfair also use the Kalman filter but it uses its own distance function instead of the Hungarian algorithm. And since Norfair does not use deep embedding like pure DeepSORT, it cannot be well matched again for occluded objects. But it will be faster than pure DeepSORT.

Result Comparison

To compare the performances of the three tracking algorithms on a multiple object tracking task, we prepared a dataset and quickly built a basic object detection model using a no-code platform. We then used this model to initialize the three tracking models in our implementation. Now, let’s look at the results together.

With the same detection result, the three algorithms give quite different tracking performance. 

In terms of speed, IOU and Norfair spend similar time and IOU is a little bit faster. DeepSORT used almost twice as long as the other two.

However, in terms of prediction performance, IOU can only keep track with continuous high frame rate detection, like the right person. Like on the left, intermittent detection results in constant track switching. Norfair's tolerance for detection interval is greater than IOU, so the detection will not be mismatched except for the long distance frame interval, and the small gap will not affect the tracking effect. DeepSORT performed the best, although a few frame mismatches were quickly corrected.  It works well for rematching goals that haven't been seen for a long time.

IOU tracks for left person: 10

Norfair tracks for left person: 3

DeepSORT tracks for left person: 2 (matched back to original one later)

Build Your Own Object Tracker

These methods have their pros and cons, and the one that worked best for our project might not be the best for your project. So, we have made all our Jupyter Notebooks with our object detection and tracking code available -

Try running them for your projects and let us know which algorithm works best for yours!

Object Tracking Resources

We have tested a few of the algorithms and methods listed above - and have uploaded them to our repositories where it can be tested out - here’s the download links -

How to Use the Notebooks
  1. You can choose any of the notebooks folders to recreate the tracking - for the purposes of the demonstration, we use models trained and exported from Datature - Learn More About Building Your Own Face Mask Model Here
  1. Install the requisite dependencies in the Jupyter Notebooks - For DeepSORT, we have slightly modified the codes from this repository to ensure compatibility with TF2.0 - therefore we have included a directory in /Tracking/DeepSORT. This should be pretty straightforward for IoU and Norfair.
  1. Change the constants to appropriate values in the following cell, for DeepSORT’s case - there are a few parameters, such as selecting the video, model path, thresholds, etc - a complete list for DeepSORT here -
  • video_path: path to your input video
  • model_path: path to your downloaded model from datature
  • output_vid_trk_path: path and your expected name for output video
  • size: size of image to load into prediction model
  • threshold: confidence threshold
  • output_format: format for your output video

  1. Run the predictions in all Jupyter Notebooks to observe the results in the`/output/` folder or render it on your Jupyter Notebook!

If you require help or have questions on how to use these notebooks to run object tracking for your project - feel free to reach out to us at our Datature Developer Community under #askdevs!


Algorithms based on:






A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional Neural Networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.

Build models with the best tools.
develop ml models in minutes with datature
start a project