Introducing MoViNet for Video Classification

Revolutionize action recognition by harnessing temporal dynamics for unparalleled precision and insight with our new MoViNet architecture.

Leonard So

What Is Video Classification?

Video classification involves the task of accurately assigning classification labels to multiple consecutive video frames. This can range from just a couple frames within a video or an entire video. While the image classification task is specifically designed to classify individual frames, video classification has the more computationally expensive challenge of incorporating a temporal understanding of classification across multiple frames. As such, this task requires the classification of objects, actions, or scenes within each frame but also incorporates an understanding of the overarching theme or content of the video as a whole. Typical deep learning techniques for video classification span techniques such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and attention mechanisms to capture both frame-level details and the overall narrative or context of the video.

What Are Popular Applications for Video Classification?

  • Surveillance and Security: Video surveillance systems leverage video classification to detect and recognize objects, activities, and anomalies in real-time, enhancing security in public spaces, buildings, and borders.
  • Automated Content Tagging & Recommendation: Media companies and broadcasters use video classification to automatically tag and organize vast libraries of video content, enabling efficient content management and retrieval. This can also be used to recommend personalized content to users based on their viewing history, preferences, and behavior patterns.
  • Healthcare Monitoring: Video classification algorithms can analyze medical imaging data, such as X-rays and MRI scans, to assist healthcare professionals in diagnosing and monitoring the spread of diseases such as cancer cells.
  • Autonomous Vehicles: In autonomous driving systems, video classification helps vehicles perceive and understand the surrounding environment by identifying and classifying certain dangerous situations, such as if a car is veering off its designated lane.
  • Sports Analytics: Sports organizations and teams can use video classification to analyze player movements, tactics, and performance metrics, aiding in strategic decision-making, player development, and opponent scouting.

What is MoViNet?

MoViNets: Mobile Video Networks for Efficient Video Recognition (Kondratyuk et al., 2021) is a convolutional neural network architecture developed by Google researchers with a focus on efficient video recognition, particularly suited for mobile devices. Its design prioritizes computational efficiency while maintaining high accuracy, making it ideal for tasks like real-time video analysis on smartphones and tablets.

MoViNet Architecture (source).

MoViNet has a few key features that set it apart from other video classification networks. Typical 3D convolutional neural networks have scalability issues as video size can easily scale with higher frames per second or longer videos, and additionally have issues with inference speed. To counteract this, firstly, leveraging neural architecture search, they craft a diverse set of efficient 3D CNN architectures tailored for mobile platforms. Secondly, the stream buffering technique that uncouples memory from video clip duration, enabling MoViNets to process streaming video sequences of any length with a minimal memory footprint by caching feature maps produced for subsections of a video input, which allow for temporally guided operations, thus removing the need for the entire video context to be computed at once. This enables real-time, online inference, which is commonly necessary for various applications stated above. Finally, they introduce a model ensemble technique that enhances accuracy without compromising efficiency. Their ensemble involves training two models independently with half the frames per second of the original video, and each model trained on a different stride, such that they’re trained on distinct datasets. This enables the benefits of ensemble learning with more stable results, but maintains the same computational complexity and model size.

How to Train MoViNet on Datature Nexus?

MoViNet can be easily trained with Datature’s Nexus. To do so, you can follow the simple steps below.

Create Your Classification Project

To get started, log in or create an account with Datature Nexus. In your workspace, you can create a project with the project type of Classification and specify your data type as Videos or Images and Videos.

Creating Your Video Classification Project on Nexus.

Onboard and Label Your Video Data

In your classification project, you can upload your video data. For classification labels, you can import annotation classification data with the CSV classification file format. You can additionally label the video frames manually with the Nexus Annotator. The classification annotator can assign frames to created labels or to the background class. To speed up annotation of multiple video frames, you can leverage our Video Interpolation tool, where users typically only need to label ~10% of all frames in a video.

Using Interpolation For Efficient Video Annotation on Nexus.

How to Fine-Tune MoViNet on Your Custom Data?

To fine-tune a MoViNet model, you can create a workflow that will fine-tune a model with your annotated data. With Datature, you can choose to train a MoViNet model with pre-trained weights from the Kinetics-600 dataset (Carreira et al., 2018) and continue from a trained artifact of the same model type on Nexus. Datature offers architectures from A0 to A5 with resolutions from 172x172 to 320x320.

Building Training Workflow with Different MoViNet Architectures on Nexus.

MoViNet also offers a few different hyperparameters to tune your MoViNet architectures with batch size, frame size (number of frames processed at once), frame stride, discard threshold, and training steps.

Customizing Model Hyperparameters on Nexus.

With these hyperparameters, you can even train an ensemble that was described in the original paper by training two models and using aggregated inference for better performance.

Once the training workflow has been set up, we can select Run Training and train a model with various hardware and checkpoint settings.

Customizing Training Settings on Nexus.

To monitor your training and model performance, you can view and analyze the metrics curves in real-time on the Trainings page, as well as visualize predictions through our Advanced Evaluation and Confusion Matrix tools.

Visualizing Real-Time Training Metric Curves on Nexus.

How to Deploy Your Trained MoViNet Model for Inference?

With your trained artifacts, you can quickly deploy your model on the cloud with both CPU and GPU. MoViNet is deployed for streaming video, utilizing stream buffers to keep inference computationally efficient and ensure that as the clips are passed through, the temporal context information is kept for improved accuracy.

Configuring a Hosted Model Deployment on Nexus.

The deployment can be called using an API route using our provided script, but it can also be tested on platform for ease of access with the Test Your API button for local data.

Testing the Model Deployment with Unseen Videos on Nexus.

Try It On Your Own Data

You can easily try this out on your own video data by following the steps above with your own Datature Nexus account. With our Free tier account, you can perform the steps without any credit card or payment required and can certainly test the steps within the limits of the account quota.

What’s Next?

You can always compare if image classification or video classification is more well suited for your context by training an image classification model to compare, which for simpler use cases, can benefit deployment with faster inference. To learn more about training image classification, you can read this article.

Our Developer’s Roadmap

Alongside video classification, we will be developing support for more video and 3D data inputs. As such, users can look out for model training support for 3D medical models and other video related models for action recognition and action classification tasks. As always, user feedback is welcome and if there are any particular models in this 3D or temporal space that you feel should be on the platform, please feel free to reach out and let us know!

Build models with the best tools.

develop ml models in minutes with datature