A Comprehensive Guide to Neural Network Model Pruning

Model pruning is a technique to remove unimportant parameters from neural networks, enhancing efficiency without significantly compromising performance. It balances model accuracy with size reduction, ideal for deployment in constrained environments or real-time applications.

Marcus Neo

What is Model Pruning?

Model pruning refers to the act of removing unimportant parameters from a deep learning neural network model to reduce the model size and enable more efficient model inference. Generally, only the weights of the parameters are pruned, leaving the biases untouched. The pruning of biases tends to have much more significant downsides.

Visualization of How Weights Are Zeroed Weights During Unstructured Pruning

As these parameters are being removed, there may be resultant degradation of the model’s inference performance, hence it should be performed with care. In the subsequent sections, we will explore the types of pruning, as well as pruning strategies, before discussing the optimal pruning strategy for your specific use-case.

Why is Model Pruning Important?

It is well documented that neural networks have an excess of parameters needed to generalize well and make accurate predictions. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks” (Frankle and Carbin, 2019) demonstrates that neural networks tend to have a specific subset of parameters that are essential for prediction. Model pruning is a very intuitive approach to model compression, with the hypothesis that effective model compression should remove weights that aren’t being used, similarly to how the brain reduces usage of connections between neurons to emphasize important pathways. 

With this in mind, pruning generally is more surgical in compressing and streamlining models than other methods such as quantization, which is just bluntly removing precision from model weights. As such, model pruning is ideal for practitioners who prioritize finding the balance between maintaining or improving model accuracy but also reducing overall model size.

How Can Model Pruning Be Achieved?

There are two main approaches to pruning neural networks, namely train-time pruning and post-training pruning, based on when the pruning process occurs in relation to the training of the model. Both train-time pruning and post-training pruning aim to reduce the size and computational complexity of neural networks, but they differ in when the pruning decisions are made relative to the training process.

A Comparison of Different Pruning Approaches. Source: When to Prune? A Policy towards Early Structural Pruning

Train-Time Pruning

Train-time pruning involves integrating the pruning process directly into the training phase of the neural network. During training, the model is trained in a way that encourages sparsity or removes less important connections or neurons as part of the optimization process. This means that the pruning decisions are made simultaneously with the weight updates during the training iterations.

Train-time pruning can be implemented using techniques such as regularization methods like L1 or L2 regularization, where the penalty terms encourage sparsity, or by incorporating pruning masks into the optimization process.

Post-Training Pruning

Post-training pruning, as the name suggests, involves pruning the trained model after it has been fully trained without considering pruning during the training process. Once the model has been trained to convergence, pruning techniques are applied to identify and remove less important connections, neurons, or entire structures from the trained model. This is typically applied as a separate step after training has been completed.

Key Differences

Both train-time pruning and post-training pruning offer significant benefits to model compression and optimization. However, each comes with its own set of downsides as illustrated in the figure below.

Table of Trade-offs of Different Pruning Approaches

A good starting point is typically to implement post-training pruning, since you can immediately prune any existing models without having to add complexity to your training pipeline and re-train the model from scratch. If any accuracy degradation caused by the pruning process is not ideal even after fine-tuning your model, you can consider train-time pruning instead.

Types of Post-Training Pruning

There are two main types of pruning: structured and unstructured pruning. Unstructured pruning is generally focused on removing individual model weight parameters, while structured pruning deals with cutting out entire weight structures.

Unstructured Pruning

Unstructured pruning is a simpler, more naive approach to pruning, but is an accessible method with low barriers to entry. General approaches to unstructured pruning use minimum thresholds depending on the raw weights themselves or their activations to determine whether the individual parameter should be pruned or not. If the parameter fails to meet the threshold, it is zeroed out. As unstructured pruning involves zeroing individual weights within the weight matrices, this means that all calculations prior to model pruning would be performed, and thus there is minimal latency improvement. On the other hand, it can help in denoising model weights for more consistent inference as well as aid in reducing model size lossless model compression. Unlike structured pruning, which certainly cannot be used without contextual information and adaptation, unstructured pruning can generally be used out of the box without too much risk. “Post-training deep neural network pruning via layer-wise calibration” (Lazarevich et al., 2021) demonstrates the effectiveness of this simple, data-free paradigm to reduce model weights by more than 50% with less than 1% drop in accuracy.

Structured Pruning

Structured pruning is a more ambitious, architecturally minded approach to pruning. By removing entire structured groups of weights, the method reduces the scale of calculations that would have to be made in the forward pass through the model’s weights graph. This has real improvements for model inference speed and model size. “DepGraph: Towards Any Structural Pruning” (Fang et al., 2023) demonstrate strong capabilities across a broad range of architectures to maintain accuracy while halving inference speeds. Given the more ambitious goal, structured pruning methods have to be more precise and intentional to prune entire groups of weights given that the impact spreads across relationships between that node and other nodes in the graph. This requires more underlying adaptive processes and certainly can have catastrophic consequences on performance if used arbitrarily.

Visualization of How Layers Are Removed in Structured Pruning

Post-Training Pruning Scopes

For each of these main pruning types, there are also different pruning scopes: local pruning and global pruning. This section provides a deeper dive into what these scopes encompass, and the key differences between them.

Visualization of How Weights Are Chosen to be Zeroed Based on Minimum Magnitude Threshold in Unstructured Pruning

Local Pruning

Local pruning involves pruning at the level of individual neurons, connections, or weights within a layer of the neural network. It typically focuses on removing less important connections or neurons based on certain criteria such as low weight magnitude, low importance in the context of the specific layer, or minimal contribution to the model's performance. Local pruning often involves iterative techniques where weights or connections are pruned one at a time or in small groups based on certain criteria. Examples of local pruning methods include weight magnitude pruning, unit magnitude-based pruning, or connection sensitivity-based pruning.

Global Pruning

Global pruning, on the other hand, involves pruning entire neurons, layers, or even large sections of the model simultaneously. It considers the overall importance of neurons or layers across the entire network rather than focusing on specific parts within individual layers. Global pruning often involves more sophisticated techniques that take into account the interactions and dependencies between different parts of the network. Examples of global pruning methods include iterative magnitude pruning (where weights across the entire network are ranked and pruned simultaneously), optimal brain damage, or optimal brain surgeon algorithms.

Key Differences

Table of Trade-offs Between Local and Global Pruning Scopes

Both local and global pruning have their merits, but generally, global pruning has more context and will make more impactful pruning decisions, but can lead to specific layers suffering as a whole. Local pruning has less context so it may not produce as efficacious results but is a more measured approach.

Effects of Pruning

The effects of pruning vary across deep learning models. We did a comparative study by pruning several models in our range of model offerings on Nexus at varying percentages. These models include DeepLabV3 MobileNetV3, UNet ResNet50, YOLOX Large, and several variants across the YOLOv8 family of models, all exported in ONNX. We leveraged unstructured global pruning to showcase how one of the simpler pruning approaches would affect these models.

Model Compression

Model pruning results in significant reductions in the model file size. The compressed model file size decreases linearly with the increase in pruning amount. This is expected as the zeroed out weights should occupy a negligible amount of space when the model file is compressed. One thing to note is that the pruning is performed in a best-effort manner. For example, a pruning percentage of 90% means that 90% of weights that are eligible for pruning will be zeroed. Since not all weights are selected for pruning, and other nodes like activations and biases are excluded from this process, the percentage of model compression may not directly correlate to 90% of the original model size, as evidenced by models such as UNet ResNet50 and YOLOX Large in the graph below.

Graph Showing the Effects of Various Pruning Ratios on Compressed Model Size Across Different Model Architectures

The storage savings brought about by pruning is crucial for edge devices with limited storage capacity, such as drones and system-on-chip cameras. Larger model architectures that are typically associated with GPU deployments can potentially be integrated onto these edge devices for more accessibility and on-premises inference. It is also beneficial for other deployment environments that may require multiple models to be loaded on the same hardware.

Inference Speed

Model pruning can also reduce the inference time since zeroed weights are a simple pass-through and do not contribute to the computational complexity of the model. In the graph below, the time taken for the model to perform inference on each image generally decreases as the pruning ratio increases. While this may not be the case for all models, most seem to follow this trend.

Graph Showing the Effects of Various Pruning Ratios on Model Inference Time Across Different Model Architectures

Accelerated inference speeds are critical in real-time applications with dynamic movements, such as fast-moving objects along a conveyor belt for product inspection, or crowd and traffic management in busy areas. Larger model architectures that undergo pruning can replace smaller existing models with similar inference speeds, but with potentially better accuracy.

Inference Performance

While pruning is generally beneficial for model compression and faster inference speeds, too much pruning can be detrimental for models. Though pruning aims to zero out unimportant weights, these weights may still contribute slightly to the decision-making process of the model. Higher pruning ratios may also inadvertently prune important weights. This may result in the accuracy degradation of the model.

Based on the graph below, some models managed to retain their high performance despite a majority of their weights being zeroed out (e.g. Semantic Segmentation Models like DeepLabV3 MobileNetV3 and UNet ResNet50). However, there are still other models that can be greatly affected by high amounts of pruning (e.g. YOLOv8x, YOLOv8s-seg).

Graph Showing the Effects of Various Pruning Ratios on Model Inference Performance Across Different Model Architectures

* Model inference performance is measured using mAP@0.5IOU for object detection, keypoint detection and instance segmentation models, while Accuracy is used for semantic segmentation and classification models. Both values range from 0 to 1, where a higher value generally indicates a better model performance.

Overall, there are significant memory and inference time savings associated with greater amounts of model pruning. Despite degraded inference performance being a potential consideration, choosing the right pruning ratios can help to mitigate this.

When Should You Prune Your Model?

Pruning models is particularly beneficial in deployment scenarios where computational resources are constrained or efficiency is critical.

  • Edge Devices: Deploying models on edge devices such as smartphones, IoT devices, or embedded systems often requires lightweight models due to limited computational resources, memory, and power constraints. Pruning can significantly reduce the model size and computational complexity, making it feasible to deploy on such devices without sacrificing performance.

  • Real-Time Applications: In applications where low latency is crucial, such as real-time video analysis, autonomous vehicles, or speech recognition, pruning can help reduce the inference time of the model. By removing redundant parameters or connections, the pruned model requires fewer computations, leading to faster inference without compromising accuracy.

  • Cloud Services: Even in cloud-based deployment scenarios, where computational resources may be more abundant, pruning can still be beneficial for cost savings and scalability. Smaller models require fewer resources to deploy and maintain, leading to reduced infrastructure costs and improved scalability, especially in scenarios with high demand or elastic workloads.

  • Mobile Applications: Mobile applications often have limited storage space and processing power, making it challenging to deploy large models. Pruning allows developers to create more lightweight models that can be integrated into mobile apps without significantly impacting performance or user experience.

  • Embedded Systems: In scenarios where models are deployed on embedded systems for tasks such as industrial automation, robotics, or sensor data analysis, pruning can help optimize resource utilization and improve energy efficiency. This is critical for prolonging the battery life of battery-powered devices and reducing energy consumption in resource-constrained environments.

  • Bandwidth-Constrained Environments: In deployment scenarios where bandwidth is limited, such as remote locations or IoT deployments with intermittent connectivity, smaller models resulting from pruning require less data transmission during deployment and inference, leading to faster and more reliable communication.

How Much Should You Prune Your Model?

Similar to model training, there are general good principles but the best way to find out is to determine standardized benchmarks or baselines, and subsequently experiment with various settings to determine what works best for you. 

From the above charts, we can observe that model inference performance steeply degrades outside the “safe-zone” of 30% - 50% of parameters pruned. As such, a suggested starting point could be an initial pruned ratio of 30%.

Depending on your use-case, you may then subsequently perform the following:

High Performance Batch Jobs

For batched tasks requiring more precise predictions, you may further reduce the pruning percentage so as to increase the accuracy of the performance. Do note that this will increase the model file size and possibly the inference time. It will therefore be important to find the precise level that balances no loss in validation metrics while trimming as many weights as possible.

High Speed Inference

If your model must fit within a certain fixed size (e.g. 25 MB) so that it can satisfy certain requirements for deployment on an edge device, or if you want to potentially reduce the time taken for a single inference, you can opt for a certain minimum level of pruning to achieve that size. Likewise, with the reduction in model size and inference time, the accuracy of the model might hence be affected as well.

Pruning Models on Datature Nexus

Datature offers post-training model pruning as an advanced export option after training your model on Nexus. We focus on post-training pruning as an initial option to provide pruning compatibility for all new and existing models trained on Nexus. The model is pruned with magnitude-based unstructured pruning, and can be exported in any of the export formats that the model is compatible with (e.g. PyTorch, TensorFlow, ONNX, TFLite, CoreML).

To experience the model pruning feature on Datature Nexus, you will first need to train a model on Nexus. To learn how you can quickly get started and train your very first model, check out our five-minute tutorial, or explore how we trained and visualized a face detection model with Nexus.

Once that’s done, navigate to the Artifacts page where you can view the model checkpoints that have been saved. When you have chosen the model checkpoint you wish to export, click on the three dots (...) -> Export Artifact. The Artifacts Exports and Conversion card will show up, and a list of all available export formats will be shown.

Generating Advanced Export with Quantization and Pruning Options in the Artifacts Page on Nexus

To generate a pruned model, click on View Advanced Exports under the export format of your choice. You can select the percentage of weights to prune depending on your tradeoff requirements between model compression and inference accuracy. The pruning process may take up to 5 minutes depending on how large your model architecture is and the pruning percentage chosen.

Downloading Succesfully Pruned Models

Once pruning has been completed, click on the Download Advanced Export button to save your model to your local filesystem. Alternatively, you can use our Python SDK to convert and download your model.

Validating Your Pruned Models

We can inspect the convolutional layers of a sample pruned ONNX model with pruning ratio of 90% on Netron to verify that a large majority of the weights have been zeroed out. Furthermore, Datature provides evaluation scripts to validate pruned models’ performances on specific hardware architectures.

Visualization on Netron of Zeroed Weights In Model Convolutional Layers After Pruning

What About Quantization?

Model pruning typically goes hand-in-hand with model quantization as both methods are known to be effective in reducing the model’s memory footprint and accelerate inference performance. Model pruning can be performed before the weights are quantised, and the effects of both of these memory reduction techniques will stack. In other words, the compressed file size of a pruned and quantized model will be even smaller than simply applying one of the two techniques.

Table Showing the File Size Reduction When Pruning and Quantization Are Applied

Check out our in-depth article to learn more about post-training model quantization and how to quantize your models on Nexus.

Try It On Your Own Data

To get started with Model Pruning on Nexus, all you have to do is sign up for a Free Tier account, upload your images and annotations, train the model, and export it right away. Pruning is supported for classification, object detection, instance segmentation, and keypoint detection use cases.

What’s Next?

Model Pruning provides a simple and convenient way for users to compress their models and improve compatibility with edge devices, but this is just one step into the realm of model optimization. Datature is always looking to expand its capabilities to support other pruning modes such as Structural Pruning, as well as Train-time Pruning. You can also combine Pruning with our other model compression offerings, such as Post-Training Quantization.

Want to Get Started?

If you have questions, feel free to join our Community Slack to post your questions or contact us about how Model Pruning fits in with your usage. 

For more detailed information about the Model Pruning functionality, customization options, or answers to any common questions you might have, read more about the process on our Developer Portal.

Build models with the best tools.

develop ml models in minutes with datature