What is Model Quantization?
Quantization is the process of reducing the number of bits that represent a number. In the context of machine learning models, these numbers are represented by tensors such as weights and biases, which are typically floating-point values with 32 bits of precision (FLOAT32). Model quantization converts them into a lower-precision format such as floating-point values with 16 bits of precision (FLOAT16 or half-precision), or even integers with 8 bits of precision (INT8). Only the forward pass is supported for quantized operators.
Why is Model Quantization Important?
Lowering the precision and number of bits used to represent each value in a tensor provides a few significant benefits, typically to enhance compatibility with edge devices and hardware accelerators:
- Reduced Memory Footprint: Each weight and activation is now compressed, leading to a smaller memory footprint. This is particularly important for deploying models on edge devices with limited storage capacity.
- Faster Inference: Reduced precision increases the computational speed of arithmetic operations, such as matrix multiplication and convolutional layers in the model. This is crucial for achieving real-time inference on video streams and live camera feed, or even for deploying heavyweight models on compute-limited edge devices.
- Energy Efficiency: Reduced computational resources directly translates to lower energy consumption, which is advantageous for battery-powered edge devices such as legged robots and drones to improve their operating duration.
How Can Model Quantization Be Achieved?
There are various methods of model quantization that fall under two broad categories, Quantization-Aware Training (or Training-Time Quantization), and Post-Training Model Quantization.
Quantization-Aware Training (QAT)
QAT is a technique that is employed during model training to prepare the model for quantization. It bridges the gap between standard training with high-precision float values and the eventual reduced precision during deployment by introducing the quantization effects during the model training itself. Quantization errors are modeled in both the forward and backward passes using fake-quantization modules. This helps the model to learn representations that are more robust to the eventual reduction in precision.
However, QAT is a non-trivial technique that increases the training complexity by a large margin. The introduction of additional operations and adjustments to the loss function during the training process requires careful implementation and tuning. The model must be modified to simulate the effects of quantization accurately, which involves incorporating model quantization-aware layers and ensuring compatibility with the chosen model quantization scheme. This heightened complexity can make the training process more computationally intensive and may require additional time and computational resources. Furthermore, the need for thorough validation and fine-tuning to mitigate potential accuracy loss adds another layer of intricacy to the training pipeline.
Post-Training Model Quantization (PTQ)
PTQ applies the reduced precision after the initial training of a model. In contrast to QAT, PTQ does not consider the effects of model quantization during the training process but focuses on compressing the pre-trained model for deployment on resource-constrained devices. It is generally considered less complex compared to QAT since it doesn't require adjustments to the training process.
However, PTQ can also result in a loss of model accuracy since any semantics encapsulated in the floating point values could potentially be lost. Hence, achieving the right balance between model size reduction and retained accuracy may still involve careful calibration and evaluation, particularly in applications where maintaining high precision is critical.
Which Model Quantization Format Should You Choose?
There are a variety of model quantization parameters that can be customized, such as precision values and quantization methods. Each method has its own merits, and is typically optimized for specific deployment environments and requirements.
FLOAT16 quantization reduces model size by up to half, since all weights become half of their original size. Since the weight values are still represented with adequate precision, you should observe a minimal loss in accuracy. FLOAT16 quantization is useful in GPU deployment scenarios, since they can operate directly on that level of precision, resulting in faster execution than the original FLOAT32 computations.
However, there will be minimal inference speed gain observed when deploying on a CPU, since CPUs are not designed to operate on FLOAT16 data. In this scenario, the model will dequantize the weights back to FLOAT32 during inference runtime.
In INT8 quantization, the range of floating-point values of tensors are mapped to a quantized (integer) range, specifically [-127, 128] for 8-bit integers. To generate this mapping, it is necessary to first identify what the range of values is so that the maximum and minimum observed values can be directly mapped to the maximum and minimum values of the quantized range respectively, and all floating-point values in between are then linearly interpolated. The mapping is generally approximated using the formulae below (the definition can vary slightly across different model frameworks such as PyTorch, TFLite, ONNX, and CoreML).
The `scale` value is a constant factor that scales all floating-point values down to within the [-127, 128] quantized range. It is also used in the dequantization function to retrieve the original floating-point values from the quantized values (with some slight rounding errors). It is typically applied on a per-channel basis rather than a per-tensor basis, meaning that each output channel dimension of a tensor has its own scale value, rather than a single scale value for the whole tensor. This can potentially reduce the overall quantization error.
The `zero_point` value is used to provide a range offset, which can be used to eliminate any negative values if the chosen data type is UINT8 instead of its signed counterpart. For example, a zero_point of 127 can shift the quantized range to [0, 255], with negative floating-point numbers being mapped within the range of [0, 127], and positive floating-point numbers being mapped within the range of [128, 255]. Typical implementations of INT8 quantization leverage symmetric quantization, which constraints `zero_point` to be equal to 0, and therefore maintains the quantized range as [-127, 128]. This simplifies both the quantization and dequantization operations.
INT8 quantization is typically used for deploying on edge devices, since the 8-bit precision reduces model size, computation, and inference latency much more than FLOAT16. However, this method does come with more drawbacks, including significant accuracy drops and added model quantization complexity.
Model weights are statically quantized from FLOAT32 to INT8 at conversion time, while other floating point tensors such as activations, inputs, and outputs remain in FLOAT32. Though the inference latency can be significantly reduced due to the integer weight computations, additional quantize and dequantize nodes still need to be included to be compatible with the floating point tensors.
Dynamic Range Quantization
Dynamic range quantization is a recommended starting point for integer quantization as it is a simpler pipeline and requires fewer steps than full-integer quantization, while providing latency improvements close to the latter method. Similar to weight-only quantization, the model weights are statically quantized from FLOAT32 to INT8 at conversion time, while other tensors, such as the activations, are left in FLOAT32. The key difference between dynamic range quantization and weight-only quantization is that the activations are dynamically quantized to INT8 during runtime (hence the name), since the feed-forward pass during inference provides a way to estimate the range of values for the activations, which is something that cannot be directly estimated during the conversion process. This means that computations involving both weights and activations are performed in INT8, which further boosts the inference speed as compared to weight-only quantization. However, the activations themselves are still stored in FLOAT32, which means the model size is not fully compressed. Furthermore, the input and output tensors are still required to be stored and computed in FLOAT32, hence the model still loses out on some optimizations by having to perform some dequantization steps.
Full-Integer Quantization (Static Quantization)
Certain integer-only devices such as 8-bit microcontrollers and the Coral Edge TPU can only support models that have purely integer values. Full-integer quantization helps to provide compatibility with these devices by converting ALL floating-point tensors, including inputs and outputs, into integers, but at the added cost of complexity. This is because the range of values of variable tensors, such as inputs, outputs, and activations, needs to be determined by running computations with input data, and cannot be statically derived during the model quantization process. Hence, the model needs to run a few feed-forward inference cycles using a representative dataset for calibration (typically a small subset of around 100-500 samples of the validation set). By observing the values that the variable tensors take on through multiple computations, we can get a fairly decent estimate on what these minimum and maximum values might be to be used for the quantized range mapping.
Since all floating-point tensors are converted to integers, the largest size and inference optimizations can usually be observed using this method. Theoretically, the model size should be reduced by 50% compared to FLOAT16 storage precision since the total number of bits is halved; in reality, the exact compression ratio will be less than 2 since extra memory is required to be allocated to store the per-channel scale values as previously mentioned.
The table below summarizes some of the pros and cons of each method, which should hopefully help you to decide which is most suitable based on your requirements and deployment setup.
How Can You Quantize Models on Datature Nexus?
Nexus currently offers post-training dynamic quantization for both FLOAT16 and INT8 for YOLOv8 models. The two supported model frameworks, TFLite and CoreML, are optimized for edge devices such as microcontrollers and iOS devices respectively.
To begin your model quantization journey, train a model on Nexus. Once that’s done, navigate to the Artifacts page where you can view the model checkpoints that have been saved. When you have chosen the model checkpoint you wish to export, click on the three dots (...) -> Export Artifact. You should see export options for TFLite and CoreML, together with other formats such as Tensorflow, PyTorch, and ONNX.
If you wish to export the original model in FLOAT32 precision, simply click on the Generate button next to your preferred model format. If a quantized model is desired, click on the View Advanced Exports button to bring up the advanced export options menu. Under the Quantization section, you can select the model quantization precision, FLOAT16 or INT8. Finally, click on Generate Advanced Export to export your quantized model.
Once your model has successfully been quantized and exported (this may take up to 15 minutes), click on the Download Advanced Export button to save your model to your local filesystem. Alternatively, you can use our Python SDK to convert and download your model.
Validating Your Quantized Models
We can inspect the convolutional layers of a sample quantized CoreML model on Netron to verify that the weights are in INT8 precision.
Furthermore, Datature provides test scripts to validate quantized models’ performances on specific hardware architectures. From our own benchmarking tests on YOLOv8 Nano for object detection and classification tasks, a common pattern observed is a reduction in model size for both FLOAT16 and INT8, as well as a slight degradation in inference accuracy for INT8 as shown in the table below. Inference speeds are omitted as they are highly dependent on the hardware architecture used (for example, running an INT8 model on CPU is almost twice as slow as running a FLOAT32 model, since CPUs are not designed to run INT8 operations, and extra computations have to be performed to quantize and dequantize between the two precisions at every step).
The key savings lie in the compression of the model size by almost 75% when quantized to INT8. This not only enables larger models to fit onto storage-constrained devices, but also greatly reduces overhead when loading the model into device memory, all with a negligible dip in prediction accuracy.
Try It On Your Own Data
To get started with model quantization on Nexus, all you have to do is sign up for a Free Tier account, upload your images and annotations, train the model, and export it right away. Model quantization is supported for classification, object detection, instance segmentation, and keypoint detection use cases across our range of model architectures. To view the specific model frameworks that support model quantization, check out our Developer Docs.
Post-training dynamic quantization provides a simple and convenient way for users to compress their models and improve compatibility with edge devices, but this is just a first step into the realm of model optimization. Datature is always looking to expand its capabilities to support other model quantization modes such as Static Quantization and Quantization-Aware Training, as well as other optimization techniques like model pruning.
Want to Get Started?
If you have questions, feel free to join our Community Slack to post your questions or contact us about how Model Quantization fits in with your usage.
For more detailed information about the Model Quantization functionality, customization options, or answers to any common questions you might have, read more about the process on our Developer Portal.
Build models with the best tools.
develop ml models in minutes with datature