Accelerating Video Annotation with Video Interpolation/Video Tracking

With video interpolation, your annotation on one frame will be used to annotate all other frames, and the tedium of annotating frame by frame is mitigated.

Leonard So

Annotating Video for Computer Vision

As the machine learning community continues to learn about the important role that training data plays in terms of shaping a model’s performance, it has become even more essential to build large, well-labelled datasets to give your computer vision project the best chance of succeeding.

Unfortunately, as with many machine learning efforts, quality training data is difficult to come by. In particular, the computer vision space requires time and effort intensive manual annotations to provide quality training data for computer vision models. Tasks such as object detection or segmentation require detailed and precise annotations in order to achieve strong accuracy.

Video is an integral source of data and inspiration in the computer vision field. Video data is an easy way to rapidly collect a lot of image data, by splicing videos into their individual frames. Additionally, video is the chosen format in many real world contexts, such as live video streaming from CCTV cameras, the analysis of video footage for object tracking, as well as a multitude of other applications. However, by the nature of video data, nearby frames tend to be quite similar to each other, and annotations made on one frame are visually similar to the ones that would have to be made on other frames. Thus, manual frame-by-frame annotation is not only labor intensive, but frustratingly repetitive as well.

As such, it’s critical that computer vision tools and platforms be able to facilitate the use of video data in their MLOps pipeline, and provide annotation tools that can streamline and accelerate the annotation process in such a way that leverages the visually similar nature of frames.

What is Video Interpolation?

Video annotation interpolation techniques are precisely designed to utilize the similarity of visual features between frames to efficiently construct annotations based on just a couple manual annotations. Overall, our tools were designed to provide annotation suggestions in other frames based on a user’s manual annotation. Additionally, as we understand that users and use-cases all require various levels of annotation accuracy, the tools are designed to help users improve the quality of the predictive annotations with additional annotations.

Broadly speaking, video interpolation techniques can be split into computer vision model-based and model-free approaches. Model-free approaches use the manual annotation polygon coordinates to construct a mathematical interpolation for polygons in the frames in between the start and end frames. Model-based approaches utilize the power of machine learning based computer vision models to extract features within the manual annotations and search for similar features in the other frames to automatedly produce annotations. Model-free approaches are generally quite computationally cheap, while model-based approaches will have some level of computational overhead but are much more capable of analyzing the visual features for better predictions.

Model-free interpolation can be considered as a practical context for polygon morphing, a topic that is very common in graphics. As the goal of interpolation in our case is to produce polygons that most easily represent the changing of object shapes from a view over time, our goal with our interpolation tool was to reduce visual anomalies and frequent, large changes in polygon shape throughout the interpolation.

What is Video Tracking?

With all the caveats described above with difficulties in matching non-linear, atypical movements, an AI-assisted tool is a much stronger alternative as it leverages visual features rather than being reliant on polygon coordinate values.

Our AI assisted tool is a video tracker that utilizes an initial annotation on one frame and uses a computer vision model to match the features of the annotation in other frames, and reconstruct annotation masks around them. Visual features can evolve throughout the video, so users are able to re-annotate annotations in other frames. These additional ground-truth labels provide more features that are used in conjunction with each other to improve annotations in the other frames. Notably, the tool is semantic in nature, so multiple polygons can be associated to the same class. When users are annotating hundreds of frames, it can be difficult to tell which frames to be corrected to assist with the predicted annotation quality. As such, the tool also provides suggested keyframes for correction which the model evaluates as the most lossy. Therefore, the annotation process for hundreds of frames is reduced to annotating a few frames to make corrections where needed or suggested.

How Does Video Interpolation Work on Nexus?

To access video interpolation tools, you will have to be on the Annotator page looking at a video asset. The video bar below will have a toggle on the top of it, showing Interpolation Mode. When the toggle is switched on, users’ manual annotations will operate as keyframes that will help make predictive annotations on other frames. As the Nexus platform supports instance and panoptic segmentation, ground-truth annotations and their corresponding instance segmentation annotations will retain their “object”-ness, in that they will contain the same object labels.

Example of Linear Interpolation on Nexus

Linear Interpolation on Nexus

As described above, linear interpolation requires two or more manual annotations for annotation to occur. Once Interpolation Mode is toggled to be on, an annotation made using a specified class will create a keyframe at the frame you are annotating on. On the video bar, you will see a diamond shape in the row dedicated to your object indicating a keyframe has been made. You can then navigate to another frame and make another annotation for what should be intuitively the same object but in a different frame. Once two keyframe annotations are made, a dashed line will be created between the two keyframes on the video bar, indicating that interpolation has occurred. To improve or extend the interpolation, you can continue to add keyframe annotations. To place a keyframe annotation in between two existing annotations, you can delete the interpolated annotation and replace it with your own ground truth annotation. The interpolation algorithm will automatically rerun using the two nearer keyframes. Once you are happy with the interpolated annotations, you can select the green Confirm button to commit the interpolated annotations as proper annotations.

Video Tracking on Nexus

Example of Video Tracking on Nexus

Video tracking on Nexus can be started by just annotating one frame manually. Once Tracking Mode is toggled to be on, one can make as many annotations as they want on a frame of their choosing. Once they confirm these annotations for tracking, they then select the range of frames that they wish the objects to be tracked on. Once the tracking begins, a progress bar will appear and the Annotator will be locked. Once the tracking has completed, predicted annotations will appear for the selected frames. Users can then navigate between the frames and select a frame in which the annotations are the most incorrect. To support and reduce the tedium of finding a bad frame, the tool also provides keyframe suggestions in the form of a red bar overlaying a frame on the video bar, which indicates that the tool believes the predicted annotations do not match the visual features on the frame very well. Once a user has found a frame they wish to reannotate, they can delete the predicted annotations and replace it with their own ground truth annotations, and the tracking will refine the other predicted annotations in the other frames. A user can repeat this task for as many times as they want. Once the user is satisfied, they can confirm all the annotations.

Our Developer’s Roadmap

With our commitment to ensuring that the annotation process is less of a roadblock in the MLOps process, we are continuing to introduce new tools and improve on existing features in order to improve accuracy and reduce time taken. We will be leveraging new state-of-the-art model architectures and algorithms in the future to improve our IntelliBrush capabilities as well as introduce model-assisted automatic mask segmentation without the need for Nexus trained models to be hosted for inference.

Want to Get Started?

If you have questions, feel free to join our Community Slack to post your questions or contact us about how video interpolation or video tracking fits in with your usage. 

For more detailed information about Video Interpolation, Video Tracking, tracking and interpolation capabilities, or answers to any common questions you might have, read more on our Developer Portal.

Build models with the best tools.

develop ml models in minutes with datature