Improving Quantization Performance in TVM
This article examines ways to enhance quantization in deep learning models using TVM.
― 5 min read
Table of Contents
Quantization is a method used in deep learning to make models smaller and faster. It works by changing the way numbers are stored in a model. Normally, models use 32-bit floating-point numbers, but with quantization, we switch to 8-bit integers. This change helps lessen the amount of memory needed and speeds up the calculations without hurting the accuracy too much.
What is TVM?
TVM, or Tensor Virtual Machine, is an open-source tool that helps run deep learning models on different types of hardware. It is designed to help machine learning developers run their models efficiently on various devices. TVM has two main parts that optimize the Performance of the models. The first part focuses on how the data flows through the model, while the second part improves the way the model uses memory.
The Challenge with Quantization in TVM
Even though many studies have discussed the benefits of quantization in reducing the time and memory used for running a model, we found that the quantized version of models in TVM often does not perform as well as expected. When people use 8-bit quantization, they typically expect the model to run about twice as fast as when using 32-bit numbers. However, in TVM, this quantized version sometimes ran slower, almost double the time compared to the full-precision version.
In our research, we looked closely at why this was happening and found ways to make 8-bit quantization work better in TVM. We focused on two different types of tasks: those where calculations are the main limit (computation-bound) and those where memory use is the main limit (memory-bound).
Types of Tasks
Computation-Bound Tasks
Computation-bound tasks are those in which the main challenge comes from the amount of computing power needed. This kind of task usually involves heavy calculations, such as multiplying matrices or running convolutions. In machine learning, many tasks fit into this category because they require significant computational effort to make predictions or to train models. For instance, if we are running a model with a batch size of one, we see that quantization can help. This is because it uses simpler arithmetic operations with less resource demand compared to the standard 32-bit procedures.
Some hardware is better at performing lower-precision calculations than higher ones. This means that calculations using 8-bit numbers can often be done faster than those with 32-bit numbers.
Memory-Bound Tasks
Memory-bound tasks are different because they are limited by how much data can be moved to and from memory quickly. Large inputs can cause delays as the system struggles to transfer data between memory and the processor. This issue tends to appear more with larger batch sizes (e.g., 8, 64, or 256) since these require more memory for input data and computations.
By reducing the size of the numbers from 32 bits to 8 bits with quantization, we can save a lot of memory. This smaller size means less data needs to move back and forth between memory and the processor, which can help improve speed.
Setting Up Experiments
To see how quantization affects performance, we looked at a model called ResNet18 compiled by TVM. We ran this model on a powerful system with an 8-core CPU and a decent amount of memory. In our experiments, we tested different number sizes and layouts to see how they impacted performance. Each test involved running the model over and over, averaging the time taken for each run.
Fixing Performance Issues
During our testing, we found that quantization was making the model run slower than it should. After examining the setup, we identified a bug that was causing the quantized model to underperform. Once we fixed this bug, the quantized model started to show better performance.
We also discovered that TVM has different types of executors for running models. One executor is better for static models with fixed operations, while the other is more suited for dynamic models that can change. For our experiments, we switched to the static model executor, which allowed us to optimize the quantized model better. After this change, we saw a significant improvement in performance.
Analyzing Computation-Bound Performance
With the bug fixed and the right executor in place, we looked into further improving performance for computation-bound tasks. We focused on optimizing the convolutions in our model, as they require a lot of calculations.
However, we learned that not all optimization strategies work together well. Different settings in TVM lead to different performance outcomes because some strategies are already fine-tuned for certain tasks. The improvements vary based on how well the specific setup and schedules fit the tasks being run.
For example, spatial packing is a technique that helps speed up memory access for tasks by changing how data is stored. The goal is to make it easier for the hardware to access the data, which helps improve performance. This change can lead to a significant increase in speed for the computations.
Analyzing Memory-Bound Performance
In addition to the performance benefits from better computations, quantization also helps with memory use. By using 8-bit integers instead of 32-bit floating-point numbers, we can reduce how much memory the model requires and how often it needs to fetch data.
We noticed that with larger batch sizes, the advantages of reduced memory bandwidth became even clearer. Saving the intermediate results in a higher-precision format still maintained the performance gains of using 8-bit quantization, ensuring that we didn’t lose precision during calculations.
Conclusion
Quantization can be a powerful tool for enhancing the efficiency of deep learning models, especially when implemented correctly in a system like TVM. By understanding the strengths and weaknesses of computation and memory-bound tasks, we can better apply quantization to achieve significant performance improvements.
Through careful tuning and fixing issues within the model, we can turn quantization into an asset rather than a liability. This work opens up avenues for further optimizations and sets the stage for using these powerful techniques in real-world applications.
Title: Analyzing Quantization in TVM
Abstract: There has been many papers in academic literature on quantizing weight tensors in deep learning models to reduce inference latency and memory footprint. TVM also has the ability to quantize weights and support low-bit computations. Although quantization is typically expected to improve inference time, in TVM, the performance of 8-bit quantization does not meet the expectations. Typically, when applying 8-bit quantization to a deep learning model, it is usually expected to achieve around 50% of the full-precision inference time. However, in this particular case, not only does the quantized version fail to achieve the desired performance boost, but it actually performs worse, resulting in an inference time that is about 2 times as slow as the non-quantized version. In this project, we thoroughly investigate the reasons behind the underperformance and assess the compatibility and optimization opportunities of 8-bit quantization in TVM. We discuss the optimization of two different types of tasks: computation-bound and memory-bound, and provide a detailed comparison of various optimization techniques in TVM. Through the identification of performance issues, we have successfully improved quantization by addressing a bug in graph building. Furthermore, we analyze multiple optimization strategies to achieve the optimal quantization result. The best experiment achieves 163.88% improvement compared with the TVM compiled baseline in inference time for the compute-bound task and 194.98% for the memory-bound task.
Authors: Mingfei Guo
Last Update: 2023-08-19 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2308.10905
Source PDF: https://arxiv.org/pdf/2308.10905
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.