Like the human brain, many modern artificial intelligence (AI) models often feature multiple information processing layers, each containing millions or even billions of parameters. These tiny internal settings are what enable large language models (LLMs) such as ChatGPT and Claude to produce complex text outputs. However, this scale comes at a cost; massive models need massive amounts of computing memory and power, making them difficult to deploy on everyday hardware such as smartphones, medical devices or industrial sensors.
According to Kaixin Xu, a Senior Research Engineer at the A*STAR Institute for Infocomm Research (A*STAR I2R), many parameters are developed during a model’s training, but aren’t actually critical to its overall performance. Hence, to streamline a model, researchers often ‘prune’ it, identifying and removing redundant parameters while maintaining its capabilities.
Conventional pruning generally relies on post-training methods that apply simple rules, such as deleting parameters with the smallest weights in a layer. While such methods are easy to implement, Xu noted that they often failed to account for how changes in a specific layer might negatively impact the rest of the model.
In a recent work, Xu and A*STAR I2R colleagues including Principal Scientist Min Wu and former Principal Scientist Xiaoli Li, worked with collaborators at Nanyang Technological University, Singapore, to address two optimisation challenges in model pruning: finding the link between specific parameters and overall model performance, and minimising the ‘cost’ of the pruning process itself.
Drawing inspiration from rate-distortion theory—a concept in signal processing traditionally used to find the best balance between data compression and quality in image and sound files—Xu and his team developed a holistic pruning method that focused on how much the model’s final output changes after pruning.
“Instead of guessing which layers can tolerate more pruning, we directly measure how pruning affects the model’s final output, and then choose pruning levels across all layers together in a coordinated way,” Xu explained.
Putting their method to the test, the team found that it could be solved through two proposed algorithms, one of which could run very quickly, even without graphics processing units (GPUs): specialised and increasingly costly components for AI hardware. This efficiency could be critical for deployment-constrained environments where memory and power consumption are primary bottlenecks.
“Our method works during the post-model training stage and can be guided by practical targets such as computation cost,” Xu said. “This makes it easier to adapt large, high-quality models for deployment without needing specialised hardware or complex training pipelines.”
The researchers were also heartened by their success in adopting aspects of classical information theory for their pruning approach, given the relative complexity of DNNs today. “This encourages future research to borrow more ideas from the field, instead of relying mainly on trial-and-error pruning rules,” Xu added.
Looking ahead, Xu and the team are exploring how they could apply their pruning approach to LLMs and their visual counterparts, while also developing hardware-side optimisations to unlock their algorithms’ full potential.
The A*STAR-affiliated researchers contributing to this research are from the A*STAR Institute for Infocomm Research (A*STAR I2R).