New System for Running Large Language Models on Smartphones
A breakthrough system allows fast LLM operations on smartphones, enhancing user privacy.
― 5 min read
Table of Contents
This article talks about a new system designed to run large Language Models (LLMs) quickly on smartphones. These models can be very big, often larger than the available Memory on a phone. The system cleverly uses different types of computer resources available in the phone to handle the model's demands.
Key Features of the System
The system includes several important features. It breaks down complex calculations into smaller parts, allowing it to use the phone's varied computing resources more effectively. It has a special engine that adapts how it works based on the model being used. Additionally, it saves frequently used data in cache to speed up operations and minimize delays caused by reading from memory or storage.
With this design, the system supports a wide range of language models on different smartphones. It can work up to 29.2 times faster than other leading systems currently available. Remarkably, this is the first system able to run a model named TurboSparse-Mixtral-47B on a smartphone, allowing it to generate text at a speed of 11.68 tokens per second.
The Rise of Large Language Models
Large language models have changed how we interact with technology. These models can understand and generate human-like text, making them useful for many tasks. However, the most sophisticated models need powerful computers in data centers, where there are advanced graphics processing units (GPUs) and a lot of memory.
As smartphones become more capable, researchers are looking for ways to run these models directly on phones. Doing so would allow the phone to act as a smart assistant, using personal data without needing to send it to the cloud, which helps protect user privacy.
Challenges of Running LLMs on Smartphones
Despite their advantages, smartphones face big challenges in running LLMs. They typically have less Processing Power and memory compared to high-end computers. Attempts to use smaller models often lead to sacrifices in Performance. For example, Google's Gemini Nano model is scaled down to fit a phone's memory, but it doesn't perform as well as larger models.
There are other methods that help lower the memory and computing needs of LLMs. One approach is designed for personal computers, but it struggles with the limited hardware in smartphones. Because mobile storage is slower and less efficient, it often becomes a bottleneck when the system needs to read data, causing delays in processing.
Introducing the New System
The new system is designed to run large models on smartphones even when they exceed memory limits. It is built on top of previous work that focused on efficiently using limited resources. By recognizing that not all parts of a large model need to be active at once, the system can work with only a selected group of neurons, which are the building blocks of the model.
The system's ability to adapt to the unique hardware of smartphones means it can optimize the speed of generating responses. It achieves this by using different processing strategies, depending on what it is doing at the moment, whether it is preparing for processing or actually generating responses.
Memory and Storage Solutions
One of the big challenges is the limited memory available on smartphones. To cope with this, the system uses memory effectively by caching frequently used data. It also introduces a technique that allows for a better balance between reading data from memory and performing calculations. This means it can minimize the amount of time spent waiting for data to load, thus speeding up the overall process.
The way that the system works involves carefully planned reading and processing strategies that consider how the smartphone's memory and storage interact. This planning happens automatically when a new model is first run on a smartphone. By analyzing both the model and hardware capabilities, the system can create a detailed plan that optimizes performance.
How the New System Works
The new framework handles two key steps: prefill and Decoding. During the prefill stage, the entire input is processed at once, while the decoding stage generates one token at a time based on the previous one. Each stage has its own computational needs, and the system optimizes for each one individually.
In the prefill phase, the system uses the full capabilities of the smartphone's processing units, and this phase can manage larger batches of data efficiently. In contrast, the decoding phase focuses on processing smaller amounts of data quickly, which allows it to take advantage of the smartphone's architecture in a more balanced way.
Performance Evaluation
The system was tested on two smartphone models, OnePlus 12 and Ace 2, which come with different processing capabilities. It supports a variety of LLMs, including sizes from 7 billion to 47 billion parameters. The results show an average speedup in performance, which shows that it can operate effectively on mobile hardware.
In particular, when both smartphones had enough memory, the system significantly reduced the amount of memory needed while still providing fast inference speeds. For instance, when handling smaller models, it achieved nearly a 40% reduction in memory usage, while still matching performance levels found in other competitive systems.
Real-World Task Performance
The system's performance was also tested on real-world tasks such as multi-turn dialogue, code generation, and math problem solving. It consistently showed robust decoding speeds across these tasks. Even when the memory was limited, it performed better than other systems, proving its effectiveness in handling practical applications.
Conclusion
This new framework represents a significant step forward in the ability to run large language models on smartphones. By adapting to the unique characteristics of mobile hardware and intelligently managing computations and data storage, it can offer impressive performance while respecting device limitations. As it continues to evolve, the system promises to unlock even greater capabilities for personal devices in understanding and generating human-like text, paving the way for a more intelligent and responsive mobile experience.
Title: PowerInfer-2: Fast Large Language Model Inference on a Smartphone
Abstract: Large language models (LLMs) on smartphones enable real-time AI assistance and privacy-preserving, offline operation. However, resource constraints of smartphones limit current deployments to small language models (SLMs), significantly compromising their capabilities. This paper introduces PowerInfer-2, a smartphone-based framework that enables fast inference for LLMs exceeding the memory capacity. The key insight is decomposing matrix operations into neuron clusters as the basic processing unit, which enables flexible scheduling and efficient I/O-computation pipelining. PowerInfer-2 leverages this neuron-cluster-based design in both computation and storage. For computation, neuron clusters with dense activations are processed on NPU, while sparse clusters use CPU. The storage engine provides a fine-grained pipeline mechanism that coordinates cluster-level computation and I/O operations, enhanced by a segmented neuron cache to reduce I/O activities. PowerInfer-2 achieves up to a 27.8x speed increase compared to state-of-the-art frameworks. PowerInfer-2 is the first system to serve a 47B LLM on a smartphone, achieving 11.68 tokens/s. Notably, these performance improvements preserve model quality with negligible accuracy degradation.
Authors: Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen
Last Update: 2024-12-12 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2406.06282
Source PDF: https://arxiv.org/pdf/2406.06282
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.