|
|
Yilong LiEmail / GitHub / Google Scholar / LinkedIn |
|
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices |
|
Yilong Li, Shuai Zhang, Hao Zhang, Jingyu Liu, Pan Hu, Xinmiao Xiong, Suman Banerjee
Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware–software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular bricks (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be decomposed into modular components and scheduled on the most appropriate compute units. NANOMIND performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate a compact, battery-powered device capable of running LMMs entirely on-device. The prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. Our design bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Compared to existing implementations, NANOMIND reduces energy consumption by 42.3% and GPU memory usage by 11.2%, enabling a battery-powered device to run LLaVA-OneVision-Qwen2-0.5B with a camera for up to 20.8 hours. |
AbstractLarge Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware–software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular bricks (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be decomposed into modular components and scheduled on the most appropriate compute units. NANOMIND performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate a compact, battery-powered device capable of running LMMs entirely on-device. The prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. Our design bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Compared to existing implementations, NANOMIND reduces energy consumption by 42.3% and GPU memory usage by 11.2%, enabling a battery-powered device to run LLaVA-OneVision-Qwen2-0.5B with a camera for up to 20.8 hours. |
Design and source code modified from Jon Barron's website. Edit here.