Demystifying Quantization

Do you need high bit-precision to operate neural networks? More specifically, do you need a high bit-precision processor to run neural network inferences in the same level of fidelity that you’ve trained them? These concepts seem to be popular myths but are not true.

A large part of the confusion around these myths comes from the black-box nature of AI (e.g. lack of explainability or interpretability), creating the mythology and fiction that neural networks must operate with significant bit-precision in order for them to work as designed.  How can a neural network even work if you remove bits of information from it?

The purpose of this article is to provide some insight into the framework and tooling available for ML/AI developers to deploy edge AI solutions. 

The Essence of Quantization

Quantization is a key element in efficient deep neural network processing. In its most basic form, it is composed of algorithms to find the best mapping between floating-point and integer representations.  An 8-bit quantization brings the promise of a 4x reduction in model size, and in fact, going from a 32-bit floating-point (FP32) to an 8-bit integer representation (INT8), we automatically get the file size reduction. 

Quantization is an algorithm that analyzes the tradeoff between dynamic range and precision. FP32 has both, with range and precision that require mapping to a much smaller INT8 space. A good quantization algorithm minimizes the neural network degradation in accuracy, moving from FP32 to INT8.

Eliminating Misconceptions

There is no single quantization algorithm that works generically well for all possible cases because the optimal solution depends on the neural network architecture and training data. In this recent presentation on “Adaptive AI for a Smarter Edge” hosted by tinyML.org,  I present a number of algorithms including symmetric, asymmetric, and logarithmic. Each algorithm has benefits related to how it maintains accuracy when trading off range and precision.

Symmetric
Asymmetric
Logarithmic

Not all Quantization Algorithms are Equal

When a tool provides post training quantization (PTQ) support, don’t take for granted that it will provide your desired results. For example, MobileNet is a class of neural networks for object classification, but it is notoriously difficult to quantize with asymmetric quantization. New approaches have been developed, such as channel-based quantization, but at a cost of inference speed. 

Similarly, training aware quantization techniques are not all the same. For example, recently announced Google QAT (Quantization Aware Training) includes basic algorithmic support, However, there are advances in QAT that can better help guide training convergence and optimization. A very naïve QAT approach would be equivalent to subjecting the neural network to “periodic amnesia” as you train-and-quantize iteratively. This, in turn, will greatly increasing training time, and in the worst case, the model will not converge or generalize.

How LatentAI Helps

Latent AI provides tools with state-of-art PTQ and QAT algorithms. We have shown that our PTQ algorithm can maintain accuracy and inference speeds for the most difficult neural networks like MobileNet. We have also shown high accuracy on MobileNet_SSD networks for QAT results using knowledge distillation. 

Latent AI Quantization Aware Training (Logarithmic power-of two at effective 4-bits precision)

Quantization allows product managers and developers to dream up new use cases, previously unimaginable with cloud-based support.

What would you like to build?  Contact us at [email protected] with your use case so that we can help you create the future you envision.

Have a few minutes?  Take our Designing for TinyML Survey and help us and the TinyML community with our research to better understand your needs and requirements.

Photo Credits:  Latent AI, Inc.


Leave a Reply

Your email address will not be published. Required fields are marked *