NITRO: A Game Changer for LLMs on NPUs
NITRO bridges the gap for running LLMs on NPUs, enhancing performance and efficiency.
Anthony Fei, Mohamed S. Abdelfattah
― 8 min read
Table of Contents
- What is NITRO?
- What’s Special About Meteor Lake?
- The Challenge with LLMs
- The Role of OpenVINO
- A Peek Inside OpenVINO
- How NITRO Works
- Rewriting PyTorch Models
- Extending the KV-Caches
- Moving Rotary Embeddings
- Efficient Model Conversion
- The Importance of Naming
- The Model Directory Structure
- Putting It All Together for Inference
- Benchmarking Performance
- Error Handling and Challenges
- NITRO vs. Other Libraries
- Looking Ahead
- Conclusion
- Original Source
- Reference Links
Large Language Models (LLMs) are like the superstars of the tech world these days. They help with everything from chatbots to various research tasks. If you’ve ever chatted with a virtual assistant, then you've met an LLM. One of the exciting areas in technology right now is creating hardware that works perfectly with these models. One such hardware is called a Neural Processing Unit (NPU).
In 2023, a major chip maker introduced the Intel Core Ultra processor, named Meteor Lake. This processor features three main components: a central processing unit (CPU), a graphics processing unit (GPU), and an NPU. However, there’s a catch: the software available for these NPUs doesn't support the dynamic needs of LLMs right out of the box. So, researchers have been looking for a way to make this work better. That’s where the concept of NITRO comes in.
What is NITRO?
NITRO is a framework designed to help LLMs run on NPUs. It's built using Python and works alongside Intel's OpenVINO framework. Think of it as the friendly helper that makes sure LLMs can generate text and hold conversations efficiently on this specialized hardware.
What’s Special About Meteor Lake?
The Meteor Lake processor is like a Swiss Army knife, featuring several tiles that each serve different functions. These tiles include areas for computing, graphics, system control, and input/output handling. Now, if you're picturing a bustling city with different districts, you're not too far off!
Among these tiles, the NPU stands out. It specializes in running AI tasks, and it does so with low power consumption. To illustrate, the NPU can handle a staggering amount of operations per second, which is impressive for a small device. This makes it well-suited for tasks like running LLMs. However, it also faces some challenges, primarily because it can only work with static models. Imagine trying to put together a puzzle where some pieces keep changing shapes while you try to fit them in!
The Challenge with LLMs
LLMs operate dynamically; think of them like a recipe where you keep adding ingredients as you cook. They constantly create new data during the text generation process. Unfortunately, the static model requirement of most NPUs doesn't jive well with this ingredient-adding process.
So, researchers have been scratching their heads, trying to find a way to make these dynamic models work on the hardware that supports them. It's similar to trying to fit a square peg into a round hole—frustrating and often impossible.
The Role of OpenVINO
Intel's OpenVINO is a toolkit that helps developers deploy machine learning models on various Intel devices, including CPUs, GPUs, and NPUs. It presents models in a specific format. However, when it comes to supporting LLMs, OpenVINO has some limitations.
The models it handles are primarily for static operations, meaning every part of the model needs to have a defined shape. When you think about the transformer architecture that LLMs use, this creates a level of difficulty. Transformers work by adjusting their structure based on the input they receive, but the static requirement prevents that flexibility.
A Peek Inside OpenVINO
OpenVINO is made up of specific model files that detail how the model operates. Each model is similar to a blueprint, with nodes representing various operations (like moving data around) and edges that connect them. While this structure serves many machine learning applications well, it’s less than ideal for LLMs due to their dynamic nature.
In simpler terms, if OpenVINO were a classroom, each node would represent a kid waiting for their turn to speak. But since LLMs keep adding new ‘students’ (i.e., data) every second, the setup is a bit chaotic.
How NITRO Works
Now let’s dive into how NITRO works to bridge this gap. The framework has a few guiding principles to make it effective. First, it aims to stay true to OpenVINO’s design, meaning it lets OpenVINO do most of the heavy lifting while NITRO steps in for tasks that require additional help.
Second, the framework is designed to be flexible. With so many ideas buzzing around in research, it’s important that NITRO can adapt and handle various tasks. Finally, keeping everything easy to understand and maintain is a priority. After all, no one wants to deal with a tangled mess of code that requires a degree to decipher.
Rewriting PyTorch Models
To make LLMs work well with NPUs, researchers often rewrite existing models. Imagine taking a classic novel and adapting it into an easy-to-read comic book. That’s what’s happening here. By reworking these models, they can be converted into a format that is compatible with OpenVINO.
One change involves simplifying the inputs into the models. Many existing models use complex setups; this can result in unsuccessful conversions. By streamlining everything, it becomes much easier to transition from a PyTorch model to OpenVINO IR format.
Extending the KV-Caches
The Key-Value Cache system in LLMs, which stores data for quick access, can become tricky when we need to maintain static shapes. NITRO solves this by extending the caches to ensure there’s always extra space available. It’s a bit like reserving a few extra chairs at a dinner party—you never know when more guests might drop by!
Moving Rotary Embeddings
Another change involves rotary embeddings, which are mathematical techniques that help LLMs understand context. NITRO moves these embeddings into the main working area instead of handling them separately. This adjustment helps streamline the process and keeps everything more organized.
Efficient Model Conversion
Once the models are rewritten and properly set up, they’re ready for conversion into OpenVINO IR. But there’s a catch: larger models can quickly exceed memory limits, like piling too many books on a shelf. To combat this, researchers use a method called “chunking.”
This technique involves breaking the model into smaller pieces, which can be processed one at a time instead of trying to handle everything at once. This is an efficient way to manage resources and ensures successful transitions from PyTorch models to OpenVINO.
The Importance of Naming
As models are being converted, naming is crucial. Just like having a well-organized filing cabinet, having clear names for each piece of the model makes everything easier to track. When nodes have descriptive names, it simplifies the process of finding and managing data throughout the model's operation.
The Model Directory Structure
After conversion, each model is organized in a neat directory structure. This organization is essential to ensure that everything is easily accessible and well-defined. If you’ve ever tried to find your way around a messy closet, you’ll appreciate the value of a tidy setup!
Putting It All Together for Inference
Once everything is in place, NITRO sets up a standard pipeline for generating text. This is the part where it’s like a well-oiled machine, taking in inputs and producing coherent text output. The framework abstracts the complexity so that developers don’t have to worry about the nitty-gritty details.
Benchmarking Performance
Researchers have been busy testing how well these models run on the NPU compared to other hardware like CPUs and GPUs. They’ve set up a laptop equipped with the Meteor Lake processor to conduct various tests, tracking how quickly different models can generate text.
While the GPU might be the champion in raw speed, the NPU shows a lot of promise, especially for medium-sized models. The results reveal that while the NPU is generally slower than the GPU, it still has advantages in energy efficiency. It’s like choosing between a flashy sports car and a reliable, fuel-efficient sedan—it depends on what you value more!
Error Handling and Challenges
Despite all the progress, there are hiccups along the way. When testing various configurations, the results don’t always match expectations. Specifically, problems have arisen with certain weight compression techniques, and errors pop up when certain combinations are used.
But fear not! This is part of the journey in technology development. Just like a chef sometimes has to tweak their recipe, researchers must adjust their methods to overcome these challenges.
NITRO vs. Other Libraries
When comparing NITRO with other NPU acceleration libraries, the results show that NITRO delivers significantly better performance. NITRO's approach leads to faster inference times, outperforming alternatives.
However, there remain areas where further development can help enhance overall efficiency and performance.
Looking Ahead
While NITRO has made great strides in enabling LLMs to run on NPUs, there’s still room for improvement. Future work might focus on refining rotary embeddings further or developing new methods to streamline the entire inference process.
The ultimate goal remains to make NPUs a go-to option for running LLMs, especially given their potential for energy efficiency. Being power-conscious is more important now than ever before, and NPUs might just be the best candidate to meet that requirement.
Conclusion
In the grand scheme of technology, developers face constant challenges in keeping pace with advances in LLMs and hardware. The ongoing work with frameworks like NITRO shows promise for future integration and optimization. As research continues and improvements are made, the hope is that we’ll see a world where energy-efficient devices can handle the heavy lifting of advanced AI without breaking a sweat.
So, while the journey has its bumps, the road ahead looks bright for NPUs, LLMs, and the tech community as a whole. After all, they say necessity is the mother of invention, and with ever-growing demands for smarter systems, we can expect exciting innovations right around the corner!
Original Source
Title: NITRO: LLM Inference on Intel Laptop NPUs
Abstract: Large Language Models (LLMs) have become essential tools in natural language processing, finding large usage in chatbots such as ChatGPT and Gemini, and are a central area of research. A particular area of interest includes designing hardware specialized for these AI applications, with one such example being the neural processing unit (NPU). In 2023, Intel released the Intel Core Ultra processor with codename Meteor Lake, featuring a CPU, GPU, and NPU system-on-chip. However, official software support for the NPU through Intel's OpenVINO framework is limited to static model inference. The dynamic nature of autoregressive token generation in LLMs is therefore not supported out of the box. To address this shortcoming, we present NITRO (NPU Inference for Transformers Optimization), a Python-based framework built on top of OpenVINO to support text and chat generation on NPUs. In this paper, we discuss in detail the key modifications made to the transformer architecture to enable inference, some performance benchmarks, and future steps towards improving the package. The code repository for NITRO can be found here: https://github.com/abdelfattah-lab/nitro.
Authors: Anthony Fei, Mohamed S. Abdelfattah
Last Update: 2024-12-15 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2412.11053
Source PDF: https://arxiv.org/pdf/2412.11053
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://github.com/goodfeli/dlbook_notation
- https://github.com/abdelfattah-lab/nitro
- https://docs.openvino.ai/2024/index.html
- https://huggingface.co/docs/optimum/en/intel/index
- https://github.com/intel/intel-npu-acceleration-library
- https://github.com/intel/intel-npu-acceleration-library/blob/main/src/bindings.cpp
- https://github.com/meta-llama/llama-models/blob/main/models/llama3/reference_impl/model.py
- https://github.com/abdelfattah-lab/nitro/tree/main/nitro/pytorch_model
- https://github.com/intel/linux-npu-driver/releases
- https://github.com/openvinotoolkit/nncf
- https://docs.openvino.ai/2024/get-started/install-openvino/install-openvino-genai.html