Articles

A Comprehensive Guide to Object Tracking Algorithms in 2025

Leonard So

December 6, 2025

MIN READ

Sections

Object tracking is a fundamental computer vision task with practical applications spanning autonomous driving, surveillance, movement analytics, and other real-time use cases. Recent advances have created two distinct tracking paradigms:

Tracking-by-Detection (TbD): Traditional approaches that detect objects independently in each frame and associate them across frames
Detection-by-Tracking (DbT): End-to-end approaches that jointly learn detection and tracking

Tracking by detection allows for a more compartmentalized setup, making it easier to swap out object detection or segmentation models to improve performance, or swap out tracking algorithms without too much hindrance. However, as the modules are designed separately, they lack learned cohesion that would allow for potentially improved tracking performance.

Detection by tracking reuses features extracted on individual frames to maximize object tracking performance rather than just individual frame detection. As the objective is oriented towards tracking, the potential for performance is greater. However, as the architecture is integrated by construction, the individual modules can’t easily be swapped out and are more difficult to train given the increased complexity of the task.

Additionally, object tracking outputs can generally split between two types:

Bounding Box Detection: These types of models take in and return bounding boxes outputs and are more amenable to statistical methods such as Kalman filters given their compact data format with a fixed size.
Video Segmentation: Segmentation models are more expensive and have a more complex task as they are required to make pixel level classification for each object. However, they are able to provide more detailed outlines of objects, providing valuable insight that is often lost in object detection models.

This article evaluates five various tracking methods representing different approaches and areas of application, with a baseline comparison with one of the most popular and easily applied tracking methods, ByteTrack.

ByteTrack: A pioneering TbD method that leverages simple heuristics with a then novel approach to low-confidence detections
SAMBA-MOTR: A sequence modeling approach for multi-object tracking
CAMELTrack: A context-aware learned association module
Cutie: An object-level approach to video object segmentation
DAM4SAM: A distractor-aware approach to video segmentation tracking

Metrics

Higher Order Tracking Accuracy (HOTA)

HOTA's Formula

HOTA provides a balanced assessment of both detection and association performance by combining DetA and AssA across multiple IoU thresholds ranging from 0.05 to 0.95. The core formula HOTA = √(DetA × AssA) creates a balanced evaluation approach. HOTA's strength lies in its ability to balance detection quality and identity consistency while using multiple IoU thresholds for more comprehensive evaluation, making it the current standard for benchmarking tracking algorithms. The metric does have drawbacks, including being less suitable for online tracking since it requires future data for complete evaluation, and it doesn't penalize fragmented tracking results.

Association Accuracy (AssA)

AssA's Formula

AssA evaluates how accurately a tracker maintains object identities across frames, focusing on the temporal consistency of ID assignments and measuring how well the tracker links detections of the same object over time. This metric examines whether a tracker can successfully follow the same object through multiple frames while maintaining the same identity label. AssA becomes particularly important in scenarios where objects might be temporarily occluded, move in complex patterns, or interact with other objects. A tracker with high AssA demonstrates strong capability in maintaining consistent identity assignments, even when detection conditions become challenging.

Detection Accuracy (DetA)

DetA's Formula

DetA measures how well a tracker localizes objects in each frame, typically using Intersection over Union (IoU) thresholds to quantify the spatial accuracy of detections. This metric focuses purely on whether the tracker can accurately detect and localize objects regardless of their identity assignments. DetA essentially evaluates the quality of the bounding boxes or detection regions produced by the tracker, making it fundamental for understanding the spatial precision of tracking systems. When DetA performs well, it indicates that the tracker successfully identifies where objects are located in each frame, even if it might struggle with maintaining consistent identities over time.

Identification F1 Score (IDF1)

IDF1's Formula

IDF1 focuses on how long a tracker correctly identifies an object across the entire video, using global assignment rather than frame-by-frame analysis. The formula IDF1 = 2 × IDTP / (2 × IDTP + IDFP + IDFN) balances identification precision and recall. This metric excels at measuring long-term tracking consistency, balances precision and recall of identity predictions effectively, and remains less affected by scene crowd density compared to MOTA. However, IDF1 can oversimplify complex scenarios, such as when players switch positions in crowded situations like a football corner kick. The metric may actually decrease when detection improves and still relies on a fixed IoU threshold.

Multiple Object Tracking Accuracy (MOTA)

MOTA's Formula

MOTA incorporates identity switches into object detection metrics, penalizing situations where a ground truth object gets assigned to different track predictions over time. The formula considers false negatives, false positives, and identity switches relative to ground truth objects. While MOTA offers simplicity and straightforward interpretation by considering identity tracking beyond basic detection, it has notable limitations. The metric only examines the previous frame for identity switches, can be dominated by false positives and negatives in crowded scenes, and uses a fixed IoU threshold that doesn't reflect varying detection accuracy levels.

Expected Average Overlap (EAO)

VOTS Performance Measures

Quality (Q)

Quality's Formula

Quality serves as the primary performance measure and represents the sequence-normalized average overlap across all targets and frames. It is calculated as the area under the tracking quality plot and mathematically equals the sum of intersection-over-union (IoU) values for all target-frame combinations, normalized by the total number of sequences, targets, and frames. This measure captures the overall tracking performance by considering both successful localization when targets are present and correct absence prediction when targets are not visible.

Robustness

Robustness's Formula

Robustness quantifies the tracker's reliability in detecting visible targets and is defined as the percentage of frames with IoU greater than 0 when the target is visible, essentially functioning as a recall measure. It calculates the ratio of successfully tracked frames to the total number of frames where the target is present and visible. Robustness indicates how consistently the tracker can maintain detection of targets throughout the sequence, with higher values representing more reliable tracking performance. Together with accuracy, robustness forms the basis for AR plots that summarize tracker performance on frames with visible targets.

Accuracy

Accuracy's Formula

Accuracy measures the localization precision and is defined as the sequence-normalized average overlap computed only over successfully tracked frames. Unlike quality, accuracy focuses exclusively on frames where the tracker successfully detects the target (IoU greater than 0) and the target is actually visible in the sequence. This metric evaluates how precisely the tracker localizes targets when it does detect them, providing insight into the spatial accuracy of the tracking system.

ByteTrack: The Baseline Approach

ByteTrack, introduced by Zhang et al., represents a significant advancement in the tracking-by-detection paradigm with its key innovation lying in its association mechanism. The method uses the YOLOX detector for high-quality object detection and introduces a revolutionary approach by associating nearly every detection box rather than just high-confidence ones. This is achieved through a two-stage association process where high-score detections are first matched to existing tracklets, followed by matching low-score detections to remaining unmatched tracklets. The similarity metric relies on IoU-based matching, making it computationally efficient.

The strengths of ByteTrack are notable in several areas. It operates at extremely high efficiency, achieving 30 FPS on a V100 GPU while maintaining a simple yet effective association strategy. The method demonstrates strong performance on MOT17 with 80.3 MOTA and 77.3 IDF1, and effectively handles occlusions by recovering objects through low-confidence detections that would typically be discarded by other methods.

However, ByteTrack has several limitations that subsequent methods have attempted to address. The approach relies primarily on spatial consistency through IoU matching, which limits its ability to maintain long-term identity preservation. It lacks learned representations for objects, making it vulnerable in scenarios with complex motion patterns or visually similar objects. Performance decreases significantly in challenging scenarios such as dance performances or crowded sports scenes where appearance and motion cues become unreliable.

ByteTrack established a strong baseline by demonstrating that effective use of low-confidence detections can significantly improve tracking performance. However, it lacks sophisticated mechanisms for dealing with visually similar objects or complex motion patterns, which has motivated the development of more advanced approaches.

Bounding Box Tracking Methods

SambaMOTR: Synchronized State Space Modeling

SambaMOTR, developed by Segu et al., introduces state space models for tracking multiple objects with complex motion patterns, representing a significant departure from traditional association-based approaches. The method is built upon Mamba's selective state space models (SSM) and synchronizes multiple SSMs to model interdependencies between object trajectories. This synchronization allows the system to capture coordinated motion patterns commonly found in dance performances, team sports, and animal group behaviors. The approach uses autoregressive prediction of track queries and introduces MaskObs, a technique for handling uncertain observations during occlusions or challenging scenarios.

SambaMOTR combines a transformer-based object detector with the set-of-sequences processing model, Samba, by leveraging the object detector’s encoder to extract image features from an individual frame, and then concatenating this information with detect and track queries from previous frames to track objects. The Samba model updates the track memory for existing objects and instantiates new hidden states for new objects. If the state of a detection is uncertain, either due to natural occlusions or internal model uncertainties, the query is masked.

Compared to ByteTrack, SAMBA-MOTR takes a fundamentally different approach to tracking. While ByteTrack uses a detection-association strategy, SAMBA employs sequence modeling that explicitly captures complex and interdependent motion patterns rather than relying solely on IoU overlap. The memory representation in SAMBA maintains long-range dependencies in a principled manner through state space models, contrasting with ByteTrack's simpler frame-to-frame association. Performance improvements are substantial, with SAMBA achieving significantly better results on complex scenarios such as DanceTrack with a 3.8 HOTA and 5.2 AssA improvement over its next competing method, MeMOTR, and 7.7 HOTA and 8.9 AssA over ByteTrack. The efficiency trade-off is reasonable, operating at 16 FPS compared to ByteTrack's 30 FPS while still maintaining real-time performance.

The strengths of SAMBA-MOTR are particularly evident in challenging tracking scenarios. The method excels on datasets with complex motion patterns such as dance performances, sports activities, and bird flocks, where traditional methods struggle. It effectively models interdependencies between objects, allowing it to predict motion patterns based on group behavior. The MaskObs mechanism provides robust handling of occlusions by masking uncertain observations while maintaining state updates through historical information and interactions with confidently tracked objects. The approach scales efficiently to long videos with linear-time complexity, making it suitable for extended tracking scenarios.

However, SAMBA-MOTR has limitations that restrict its applicability in certain deployment scenarios. The more complex architecture requires more training data compared to simpler methods like ByteTrack. The computational overhead, while manageable, results in slower performance than the baseline method. Additionally, the approach may face difficulties with highly similar objects that lack distinctive motion patterns, as the method relies heavily on motion modeling and temporal consistency.

CAMELTrack: Context-Aware Multi-cue Exploitation

CAMELTrack, introduced by Somers et al., represents a novel approach that learns association strategies directly from data rather than using hand-crafted rules, addressing a fundamental limitation in traditional tracking-by-detection methods. The core innovation lies in the Context-Aware Multi-cue ExpLoitation (CAMEL) module, which consists of two transformer-based components. The Temporal Encoders (TE) aggregate detection cues over time into robust tracklet-level representations, while the Group-Aware Feature Fusion Encoder (GAFFE) integrates multiple cues including appearance, motion, and pose information into unified discriminative embeddings. The system employs an association-centric training scheme that generates challenging tracking scenarios through data augmentation and cross-video sampling, while maintaining a modular design that leverages off-the-shelf models for detection, re-identification, and pose estimation.

The comparison between CAMELTrack and ByteTrack reveals fundamental differences in their approaches to object association. CAMELTrack employs learned association strategies versus ByteTrack's hand-crafted rules, enabling dynamic, context-aware feature fusion rather than static similarity measures. The temporal context exploitation in CAMELTrack is significantly richer, maintaining tracklet histories and learning to weight different cues based on their reliability in specific scenarios. While ByteTrack primarily relies on spatial information through IoU matching, CAMELTrack effectively integrates appearance features from re-identification models, motion predictions, and pose keypoints when available. The performance improvements are substantial, with CAMELTrack achieving 18.4 HOTA improvement on DanceTrack and 18.3 HOTA improvement on SportsMOT compared to ByteTrack, demonstrating the effectiveness of learned association strategies.

CAMELTrack's strengths are evident across multiple dimensions of tracking performance. The method achieves state-of-the-art performance on multiple benchmarks, particularly excelling in challenging tracking scenarios with similar-looking objects, frequent occlusions, and complex motion patterns. The training efficiency is remarkable, requiring under an hour on a single GPU compared to end-to-end methods that typically need days on multiple GPUs. The modular design allows for easy integration of specialized models and adaptation to different domains by incorporating domain-specific cues. Despite the sophisticated architecture, the method maintains reasonable inference speed at 13 FPS, making it suitable for near real-time applications.

The limitations of CAMELTrack primarily stem from its increased complexity compared to simpler baseline methods. The computational requirements are higher than ByteTrack, both in terms of memory usage and processing time. The transformer-based architecture requires careful design and hyperparameter tuning to achieve optimal performance. Additionally, the method's performance is inherently dependent on the quality of the feature extractors used for appearance, motion, and pose estimation, potentially limiting its effectiveness when these components produce poor features.

Instance Segmentation Tracking Methods

Cutie: Object-Level Memory Reading

Cutie, developed by Cheng et al., focuses on video object segmentation with a novel object-level memory reading approach that addresses the limitations of pixel-level matching in challenging scenarios. The method employs a two-stage memory processing architecture that first performs pixel-level memory reading for initial feature readout, followed by object-level memory reading through a query-based object transformer. This design enables the system to maintain both high-resolution spatial information and high-level object representations. The architecture incorporates a foreground-background masked attention mechanism that cleanly separates target object semantics from background distractors, and maintains a compact object memory that summarizes target object features over time for enhanced discrimination.

When compared to bounding box tracking methods, Cutie operates in a fundamentally different domain by providing pixel-level segmentation masks rather than bounding boxes, enabling more precise object localization and shape representation. The memory structure is more sophisticated than traditional approaches, incorporating both object-level and pixel-level memory components that serve different purposes in the tracking pipeline. The attention mechanism is specifically designed for foreground-background separation, contrasting with general association mechanisms used in bounding box methods. Cutie is designed specifically for object segmentation tracking rather than general multi-object tracking, focusing on maintaining accurate segmentation of pre-specified objects rather than discovering and tracking multiple objects simultaneously.

The strengths of Cutie are particularly evident in challenging tracking scenarios with significant distractors and occlusions. The object-level reasoning capability enables robust handling of scenarios where pixel-level matching fails due to visual ambiguity. The architecture maintains high-resolution features throughout the processing pipeline, ensuring accurate segmentation boundaries even in challenging conditions. The efficiency is notable, with the small model variant achieving 45.5 FPS while maintaining strong performance. The method demonstrates significant improvements on challenging datasets, achieving 8.7 J&F improvement over XMem on the MOSE dataset, which contains heavy occlusions and crowded environments.

Although as we will see in the next section, Cutie’s metrics are outperformed by DAM4SAM’s SAM2.1++, it should be noted that Cutie’s methodology does have a fairly modularized architecture, allowing for integration with other frame segmentation tools, as well the possibility of integrating as an additional module to other memory-based video segmentation algorithms without a large computational overhead.

DAM4SAM: A Distractor-Aware Memory (DAM) for Visual Object Tracking with SAM2

SAM2 (Segment Anything Model 2) represents a significant advancement in video object segmentation, extending the original SAM model from static image segmentation to temporal video understanding. Developed by Meta AI, SAM2 introduced a unified architecture capable of handling both image and video segmentation tasks through a sophisticated memory mechanism. The model consists of four main components: a ViT-based image encoder, a prompt encoder for interactive inputs, a memory bank for temporal consistency, and a mask decoder for output generation.

The memory bank is particularly crucial for SAM2's tracking capabilities, storing the encoded initialization frame with user-provided segmentation along with six recent frames containing generated segmentation masks. SAM2 applies temporal encodings to recent frames while keeping the initialization frame unencoded to preserve its role as a supervised prior. The model uses pixel-wise attention mechanisms to transfer labels from memory frames to the current frame, producing memory-conditioned features that are subsequently decoded into segmentation masks. SAM2's memory management follows a first-in-first-out protocol for recent frames while always maintaining the initialization frame, enabling the model to maintain object identity across extended video sequences. This foundation provided the basis for DAM4SAM, which builds upon SAM2's architecture while introducing more sophisticated memory management strategies specifically designed to handle challenging tracking scenarios with distractors.

DAM4SAM, introduced by Videnovic et al., extends the SAM2 foundation model with a novel distractor-aware memory model specifically designed for video object tracking in the presence of challenging visual distractors. The key innovation lies in functionally dividing the memory structure into two components: Recent Appearance Memory (RAM) for maintaining segmentation accuracy through recent target appearances, and Distractor Resolving Memory (DRM) for ensuring tracking robustness and enabling reliable re-detection. The system introduces a novel DRM updating mechanism based on SAM2 output introspection, where the divergence between the primary output mask and alternative predictions indicates the presence of potential distractors worthy of memory storage.

The comparison between DAM4SAM and other tracking methods reveals several distinctive characteristics. The memory structure represents a functional division of responsibilities rather than the unified approaches employed by other methods, with each memory component serving a specific purpose in the tracking pipeline. The distractor handling approach is explicitly focused on distinguishing targets from visually similar objects, contrasting with more general robustness mechanisms in other methods. DAM4SAM extends a pre-trained foundation model rather than developing a specialized architecture from scratch, leveraging the extensive pre-training of SAM2 while adding targeted improvements. The update mechanism uses introspection of the model's own predictions to determine when to store frames in distractor-resolving memory, a novel approach compared to regular temporal update strategies.

The strengths of DAM4SAM are demonstrated through significant performance improvements on challenging benchmarks. The method achieves state-of-the-art performance on segmentation tracking benchmarks, with a notable improvement on the DiDi dataset compared to competitive methods, such as a 21% improvement on Cutie in quality, as well as 0.711 vs. 0.722 compared to S3-Track, a highly complex and optimized algorithm for VOT2024, and Cutie at 0.607. The approach effectively handles challenging distractors that typically cause tracking failures in other methods, as evidenced by substantial improvements in robustness metrics. One of the most appealing aspects is that the method requires no additional training, leveraging the pre-trained SAM2 foundation model while introducing algorithmic improvements through better memory management. The approach is elegant and conceptually simple compared to more complex architectural modifications, making it accessible and interpretable.

The limitations of DAM4SAM primarily concern computational efficiency and model dependencies. There is a moderate speed reduction compared to the baseline SAM2.1, operating at approximately 20% slower performance due to the additional memory management overhead. The method is inherently tied to the SAM2 foundation model, limiting its applicability to scenarios where SAM2 is not suitable or available. Like other segmentation-based tracking methods, it requires initialization with segmentation masks rather than simple bounding boxes, which may limit its applicability in certain tracking scenarios.

Overall Analysis

Association Strategies

The five methods represent a clear evolution in object association strategies, demonstrating the field's progression from simple heuristics toward sophisticated learned approaches. ByteTrack employs traditional Kalman filter prediction combined with IoU matching, with the key innovation being the effective utilization of low-confidence detections through a two-stage association process. SAMBA-MOTR introduces synchronized state space models with selective memory updating, enabling the modeling of complex interdependencies between multiple objects and their motion patterns. CAMELTrack represents a significant leap toward learned association through its transformer-based architecture that performs contextual association by dynamically weighting multiple cues based on their reliability in specific scenarios. Cutie implements object-level query-based association combined with pixel-level memory, enabling high-level reasoning about object identity while maintaining detailed spatial information. DAM4SAM focuses on distractor-aware memory management with functionally divided memory components, each serving specific purposes in maintaining tracking accuracy and robustness.

This progression demonstrates the field's clear movement from hand-crafted association rules toward learned, context-aware approaches that can adapt to challenging scenarios. The evolution shows increasing sophistication in handling complex tracking scenarios, with each method addressing specific limitations of simpler approaches while introducing new capabilities for robust object tracking.

Memory Representation

The memory structures employed by these methods reveal significant variation in approaches to temporal information storage and utilization. ByteTrack uses simple trajectory-based memory with first-in-first-out updates, maintaining minimal historical information focused primarily on recent object positions and basic motion prediction. SAMBA-MOTR employs hidden state memory that captures long-range dependencies while modeling interactions between tracklets, enabling the system to understand coordinated motion patterns and complex temporal relationships. CAMELTrack maintains feature banks with temporal encoding and group-aware fusion, allowing for rich historical representation that can be dynamically accessed based on current tracking context. Cutie implements a dual memory system combining pixel-level features for detailed spatial information with object-level summaries for high-level identity maintenance. DAM4SAM introduces functionally divided memory with separate components for recent appearance maintenance and distractor resolution, each updated according to different criteria and serving distinct purposes in the tracking pipeline.

The trend across these methods is toward more sophisticated memory structures that can handle long-term dependencies, complex interrelationships between objects, and the varying reliability of different types of information over time. This evolution reflects the increasing understanding that effective tracking requires not just current frame information but sophisticated utilization of historical context.

Computational Efficiency

The computational efficiency analysis reveals interesting trade-offs between performance and speed across the different approaches. ByteTrack remains the efficiency champion at 30 FPS due to its algorithmic simplicity and reliance on basic IoU calculations without complex feature processing. DAM4SAM operates at a moderate 11 FPS, with the efficiency reduction primarily due to the additional memory management overhead rather than fundamental architectural complexity. CAMELTrack achieves 13 FPS despite its transformer-based architecture, demonstrating that learned association can be implemented efficiently through careful architectural design. SAMBA-MOTR operates at 16 FPS, which is reasonable given its complex state space modeling and synchronization mechanisms. Cutie achieves impressive efficiency with its small model variant running at 45.5 FPS, though this comes with some performance trade-offs compared to larger variants.

The efficiency analysis reveals that while more sophisticated methods generally require more computational resources, careful architectural design can maintain reasonable performance levels. The choice between methods often depends on the specific application requirements, with real-time applications potentially favoring simpler approaches while offline or high-accuracy applications benefiting from more sophisticated methods.

Performance on Challenging Scenarios

The performance comparison on challenging datasets with complex motion, similar appearances, and frequent occlusions reveals significant differences in method capabilities. CAMELTrack achieves state-of-the-art performance on DanceTrack with 69.3 HOTA, demonstrating the effectiveness of learned multi-cue association in scenarios with similar-looking dancers and complex choreographed movements. SAMBA-MOTR shows strong performance on DanceTrack with 67.2 HOTA, indicating that synchronized state space modeling effectively captures coordinated motion patterns. DAM4SAM achieves the best performance on MOSE with 68.3 J&F, highlighting the effectiveness of distractor-aware memory management in crowded and occluded scenarios. Cutie demonstrates strong performance on MOSE with competitive J&F scores, showing that object-level reasoning effectively handles challenging segmentation scenarios.

In stark contrast, ByteTrack achieves only 47.7 HOTA on DanceTrack, illustrating the limitations of simple association strategies in complex scenarios. This performance gap demonstrates that while ByteTrack established an effective baseline for standard tracking scenarios, more sophisticated approaches are necessary for challenging real-world applications involving complex motion patterns, visual similarity, and frequent occlusions.

The performance analysis reveals that different methods excel in specific types of challenging scenarios, with no single approach dominating across all conditions. This suggests that method selection should be based on the specific characteristics of the target application and the types of challenges expected in the tracking scenarios.

Conclusion

This meta-analysis reveals several key trends and insights in contemporary object tracking research. The field is experiencing a clear transition from heuristic-based approaches toward learned, data-driven methods that can adapt to complex scenarios.

Future research directions will likely continue advancing learned association strategies while addressing the computational efficiency requirements of real-time applications. The integration of foundation models represents a promising direction that leverages large-scale pre-training while enabling targeted improvements for specific tracking challenges. The development of adaptive systems that can select appropriate tracking strategies based on scenario characteristics, and the exploration of hybrid approaches that combine the strengths of different paradigms, represent additional promising directions for future investigation.

‍

Tracking-by-Detection (TbD): Traditional approaches that detect objects independently in each frame and associate them across frames
Detection-by-Tracking (DbT): End-to-end approaches that jointly learn detection and tracking

Additionally, object tracking outputs can generally split between two types:

Bounding Box Detection: These types of models take in and return bounding boxes outputs and are more amenable to statistical methods such as Kalman filters given their compact data format with a fixed size.
Video Segmentation: Segmentation models are more expensive and have a more complex task as they are required to make pixel level classification for each object. However, they are able to provide more detailed outlines of objects, providing valuable insight that is often lost in object detection models.

ByteTrack: A pioneering TbD method that leverages simple heuristics with a then novel approach to low-confidence detections
SAMBA-MOTR: A sequence modeling approach for multi-object tracking
CAMELTrack: A context-aware learned association module
Cutie: An object-level approach to video object segmentation
DAM4SAM: A distractor-aware approach to video segmentation tracking