|
Yilong LiEmail / GitHub / Google Scholar / LinkedIn |
![]() |
PRISM: Portable Resource-efficient Inference System for Multimodal Models on a Tiny Device |
Yilong Li, Shuai Zhang, Hao Zhang, Jingyu Liu, Pan Hu, Jayaram Raghuram, Suman Banerjee
Large Multimodal Models (LMMs) have revolutionized various tasks. However, deploying them on resource-constrained devices poses significant challenges due to limit computational power and battery life. Existing frameworks struggle with poor performance and lack of GPU support on mobile and small embedded devices, often neglecting considerations for battery efficiency. To solve these challenges, we present the first attempt to explore the full potential of low-cost and tiny hardware in a software-hardware co-design approach that enables efficient on-device inference of large multimodal models (LMMs) on resource-constrained devices, supporting modalities~(voice, vision). By utilizing multiple optimizations across hardware, software, and models, the system maximizes the efficiency on extremely resource-limited devices. We made substantial efforts in hardware design, system-level implementation, and model compression, particularly the optimization of mobile GPU drivers for low-bit quantized models. We've also developed a power-efficient workload offloading mechanism that dynamically allocates computations to the GPU or NPU based on resource usage and power consumption. By employing these efforts, a tiny device can efficiently operate large language models ~(LLMs) and models with modalities) within constrained hardware resources by directly offloading workloads to on-device GPU or NPU based on power and memory usage, bypassing CPU RAM. This approach enhances inference performance and significantly reduces power consumption. Our efforts explored the limitations of constrained resources through optimized hardware integration and innovative workload scheduling, supporting multi-modal interactions like language, voice, and sensor data processing—all within a single unit. Our device operates independently on minimal power for extended periods, maintaining high energy efficiency and privacy by processing data locally without needing internet connectivity. Running LLMs locally on a compact device not only enhances data security and privacy, our effort of such a tiny device represents a major step forward in making sophisticated multi-modal inference accessible on the smallest and most energy-efficient platforms. |
AbstractLarge Multimodal Models (LMMs) have revolutionized various tasks. However, deploying them on resource-constrained devices poses significant challenges due to limit computational power and battery life. Existing frameworks struggle with poor performance and lack of GPU support on mobile and small embedded devices, often neglecting considerations for battery efficiency. To solve these challenges, we present the first attempt to explore the full potential of low-cost and tiny hardware in a software-hardware co-design approach that enables efficient on-device inference of large multimodal models (LMMs) on resource-constrained devices, supporting modalities~(voice, vision). By utilizing multiple optimizations across hardware, software, and models, the system maximizes the efficiency on extremely resource-limited devices. We made substantial efforts in hardware design, system-level implementation, and model compression, particularly the optimization of mobile GPU drivers for low-bit quantized models. We've also developed a power-efficient workload offloading mechanism that dynamically allocates computations to the GPU or NPU based on resource usage and power consumption. By employing these efforts, a tiny device can efficiently operate large language models ~(LLMs) and models with modalities) within constrained hardware resources by directly offloading workloads to on-device GPU or NPU based on power and memory usage, bypassing CPU RAM. This approach enhances inference performance and significantly reduces power consumption. Our efforts explored the limitations of constrained resources through optimized hardware integration and innovative workload scheduling, supporting multi-modal interactions like language, voice, and sensor data processing—all within a single unit. Our device operates independently on minimal power for extended periods, maintaining high energy efficiency and privacy by processing data locally without needing internet connectivity. Running LLMs locally on a compact device not only enhances data security and privacy, our effort of such a tiny device represents a major step forward in making sophisticated multi-modal inference accessible on the smallest and most energy-efficient platforms. |
Design and source code modified from Jon Barron's website. Edit here.