New System for Running Large Language Models on Smartphones

Table of Contents

Original Source
Reference Links

This article talks about a new system designed to run large Language Models (LLMs) quickly on smartphones. These models can be very big, often larger than the available Memory on a phone. The system cleverly uses different types of computer resources available in the phone to handle the model's demands.

Key Features of the System

The system includes several important features. It breaks down complex calculations into smaller parts, allowing it to use the phone's varied computing resources more effectively. It has a special engine that adapts how it works based on the model being used. Additionally, it saves frequently used data in cache to speed up operations and minimize delays caused by reading from memory or storage.

With this design, the system supports a wide range of language models on different smartphones. It can work up to 29.2 times faster than other leading systems currently available. Remarkably, this is the first system able to run a model named TurboSparse-Mixtral-47B on a smartphone, allowing it to generate text at a speed of 11.68 tokens per second.

The Rise of Large Language Models

Large language models have changed how we interact with technology. These models can understand and generate human-like text, making them useful for many tasks. However, the most sophisticated models need powerful computers in data centers, where there are advanced graphics processing units (GPUs) and a lot of memory.

As smartphones become more capable, researchers are looking for ways to run these models directly on phones. Doing so would allow the phone to act as a smart assistant, using personal data without needing to send it to the cloud, which helps protect user privacy.

Challenges of Running LLMs on Smartphones

Despite their advantages, smartphones face big challenges in running LLMs. They typically have less Processing Power and memory compared to high-end computers. Attempts to use smaller models often lead to sacrifices in Performance. For example, Google's Gemini Nano model is scaled down to fit a phone's memory, but it doesn't perform as well as larger models.

There are other methods that help lower the memory and computing needs of LLMs. One approach is designed for personal computers, but it struggles with the limited hardware in smartphones. Because mobile storage is slower and less efficient, it often becomes a bottleneck when the system needs to read data, causing delays in processing.

Introducing the New System

The new system is designed to run large models on smartphones even when they exceed memory limits. It is built on top of previous work that focused on efficiently using limited resources. By recognizing that not all parts of a large model need to be active at once, the system can work with only a selected group of neurons, which are the building blocks of the model.

The system's ability to adapt to the unique hardware of smartphones means it can optimize the speed of generating responses. It achieves this by using different processing strategies, depending on what it is doing at the moment, whether it is preparing for processing or actually generating responses.

Memory and Storage Solutions

One of the big challenges is the limited memory available on smartphones. To cope with this, the system uses memory effectively by caching frequently used data. It also introduces a technique that allows for a better balance between reading data from memory and performing calculations. This means it can minimize the amount of time spent waiting for data to load, thus speeding up the overall process.

The way that the system works involves carefully planned reading and processing strategies that consider how the smartphone's memory and storage interact. This planning happens automatically when a new model is first run on a smartphone. By analyzing both the model and hardware capabilities, the system can create a detailed plan that optimizes performance.

How the New System Works

The new framework handles two key steps: prefill and Decoding. During the prefill stage, the entire input is processed at once, while the decoding stage generates one token at a time based on the previous one. Each stage has its own computational needs, and the system optimizes for each one individually.

In the prefill phase, the system uses the full capabilities of the smartphone's processing units, and this phase can manage larger batches of data efficiently. In contrast, the decoding phase focuses on processing smaller amounts of data quickly, which allows it to take advantage of the smartphone's architecture in a more balanced way.

Performance Evaluation

The system was tested on two smartphone models, OnePlus 12 and Ace 2, which come with different processing capabilities. It supports a variety of LLMs, including sizes from 7 billion to 47 billion parameters. The results show an average speedup in performance, which shows that it can operate effectively on mobile hardware.

In particular, when both smartphones had enough memory, the system significantly reduced the amount of memory needed while still providing fast inference speeds. For instance, when handling smaller models, it achieved nearly a 40% reduction in memory usage, while still matching performance levels found in other competitive systems.

Real-World Task Performance

The system's performance was also tested on real-world tasks such as multi-turn dialogue, code generation, and math problem solving. It consistently showed robust decoding speeds across these tasks. Even when the memory was limited, it performed better than other systems, proving its effectiveness in handling practical applications.

Conclusion

This new framework represents a significant step forward in the ability to run large language models on smartphones. By adapting to the unique characteristics of mobile hardware and intelligently managing computations and data storage, it can offer impressive performance while respecting device limitations. As it continues to evolve, the system promises to unlock even greater capabilities for personal devices in understanding and generating human-like text, paving the way for a more intelligent and responsive mobile experience.

New System for Running Large Language Models on Smartphones

A breakthrough system allows fast LLM operations on smartphones, enhancing user privacy.

Key Features of the System

The Rise of Large Language Models

Challenges of Running LLMs on Smartphones

Introducing the New System

Memory and Storage Solutions

How the New System Works

Performance Evaluation

Real-World Task Performance

Conclusion

Reference Links

Referenced Topics

New System for Running Large Language Models on Smartphones

A breakthrough system allows fast LLM operations on smartphones, enhancing user privacy.

#Key Features of the System

#The Rise of Large Language Models

#Challenges of Running LLMs on Smartphones

#Introducing the New System

#Memory and Storage Solutions

#How the New System Works

#Performance Evaluation

#Real-World Task Performance

#Conclusion

Reference Links

Referenced Topics

Key Features of the System

The Rise of Large Language Models

Challenges of Running LLMs on Smartphones

Introducing the New System

Memory and Storage Solutions

How the New System Works

Performance Evaluation

Real-World Task Performance

Conclusion