For years, the promise of truly intelligent, conversational AI has felt out of reach. We’ve marveled at the abilities of ChatGPT, Gemini, and other large language models (LLMs) – composing poems, writing code, translating languages – but these feats have always relied on the vast processing power of cloud GPUs. Now, a quiet revolution is brewing, aiming to bring these incredible capabilities directly to the device in your pocket: an LLM on your smartphone.
This shift isn’t just about convenience; it’s about privacy, efficiency, and unlocking a new world of personalized AI experiences.
However, shrinking these massive LLMs to fit onto a device with limited memory and battery life presents a unique set of challenges. To understand this complex landscape, I spoke with Aleksei Naumov, Lead AI Research Engineer at Terra Quantum, a leading figure in the field of LLM compression.
Indeed, Naumov recently published a paper on this subject which is being heralded as an extraordinary and significant innovation in neural network compression – ‘TQCompressor: Improving Tensor Decomposition Methods in Neural Networks via Permutations’ – at the IEEE International Conference on Multimedia Information Processing and Retrieval (IEEE MIPR 2024), a conference where researchers, scientists, and industry professionals come together to present and discuss the latest advancements in multimedia technology.
“The main challenge is, of course, the limited main memory (DRAM) available on smartphones,” Naumov said. “Most models cannot fit into the memory of a smartphone, making it impossible to run them.”
He points to Meta’s Llama 3.2-8B model as a prime example.
“It requires approximately 15 GB of memory,” Naumov said. “However, the iPhone 16 only has 8 GB of DRAM, and the Google Pixel 9 Pro offers 16 GB. Furthermore, to operate these models efficiently, one actually needs even more memory – around 24 GB, which is offered by devices like the NVIDIA RTX 4090 GPU, starting at $1800.”
This memory constraint isn’t just about storage; it directly impacts a phone’s battery life.
“The more memory a model requires, the faster it drains the battery,” Naumov said. “An 8-billion parameter LLM consumes about 0.8 joules per token. A fully charged iPhone, with approximately 50 kJ of energy, could only sustain this model for about two hours at a rate of 10 tokens per second, with every 64 tokens consuming around 0.2% of the battery.”
So, how do we overcome these hurdles? Naumov highlights the importance of model compression techniques.
“To address this, we need to reduce model sizes,” Naumov said. “There are two primary approaches: reducing the number of parameters or decreasing the memory each parameter requires.”
He outlines strategies like distillation, pruning, and matrix decomposition to reduce the number of parameters and quantization to decrease each parameter’s memory footprint.
“By storing model parameters in INT8 instead of FP16, we can reduce memory consumption by about 50%,” Naumov said.
While Google’s Pixel devices, with their TensorFlow-optimized TPUs, seem like an ideal platform for running LLMs, Naumov cautions that they don’t solve the fundamental problem of memory limitations.
“While the Tensor Processing Units (TPUs) used in Google Pixel devices do offer improved performance when running AI models, which can lead to faster processing speeds or lower battery consumption, they do not resolve the fundamental issue of the sheer memory requirements of modern LLMs, which typically exceed smartphone memory capacities,” Naumov said.
The drive to bring LLMs to smartphones goes beyond mere technical ambition. It’s about reimagining our relationship with AI and addressing the limitations of cloud-based solutions.
“Leading models like ChatGPT-4 have over a trillion parameters,” Naumov said. “If we imagine a future where people depend heavily on LLMs for tasks like conversational interfaces or recommendation systems, it could mean about 5% of users’ daily time is spent interacting with these models. In this scenario, running GPT-4 would require deploying roughly 100 million H100 GPUs. The computational scale alone, not accounting for communication and data transmission overheads, would be equivalent to operating around 160 companies the size of Meta. This level of energy consumption and associated carbon emissions would pose significant environmental challenges.”
The vision is clear: a future where AI is seamlessly integrated into our everyday lives, providing personalized assistance without compromising privacy or draining our phone batteries.
“I foresee that many LLM applications currently relying on cloud computing will transition to local processing on users’ devices,” Naumov said. “This shift will be driven by further model downsizing and improvements in smartphone computational resources and efficiency.”
He paints a picture of a future where the capabilities of LLMs could become as commonplace and intuitive as auto-correct is today. This transition could unlock many exciting possibilities. Thanks to local LLMs, imagine enhanced privacy where your sensitive data never leaves your device.
Picture ubiquitous AI with LLM capabilities integrated into virtually every app, from messaging and email to productivity tools. Think of the convenience of offline functionality, allowing you to access AI assistance even without an internet connection. Envision personalized experiences where LLMs learn your preferences and habits to provide truly tailored support.
For developers eager to explore this frontier, Naumov offers some practical advice.
“First, I recommend selecting a model that best fits the intended application,” Naumov said. “Hugging Face is an excellent resource for this. Look for recent models with 1-3 billion parameters, as these are the only ones currently feasible for smartphones. Additionally, try to find quantized versions of these models on Hugging Face. The AI community typically publishes quantized versions of popular models there.”
He also suggests exploring tools like llama.cpp and bitsandbytes for model quantization and inference.
The journey to bring LLMs to smartphones is still in its early stages, but the potential is undeniable. As researchers like Aleksei Naumov continue to push the boundaries of what’s possible, we’re on the cusp of a new era in mobile AI, one where our smartphones become truly intelligent companions, capable of understanding and responding to our needs in ways we’ve only begun to imagine.