Yilong Li

Ph.D. in Computer Sciences

Yilong Li

University of Wisconsin-Madison, advised by Prof. Suman Banerjee

I am a systems researcher working on on-device AI, human sensing, and agentic memory for resource-constrained devices. My research builds practical systems that let small devices perceive, reason, and remember over long horizons under real-world hardware and privacy constraints.

My recent work extends efficient multimodal inference with reinforcement-learning-driven memory systems. In StoreAgent, I study how an LLM can learn a memory policy that decides online what to write, how to structure it, what context to retain, and how to recall memory packages for a downstream task solver.

Across my work, I build the full stack from hardware prototypes and embedded runtimes to memory managers, retrieval-facing memory stores, algorithms, and model fine-tuning. My systems span on-device AI, biometric and motion sensing, and personalized cognitive assistance, with publications in MobiCom, ICLR, NSDI, and SenSys.

News / Updates

2026.03

CRANE was open-sourced for direct Apple Neural Engine inference

CRANE, our compiled runtime for the Apple Neural Engine, is now open-sourced. Built on reverse-engineered private APIs, it provides direct Python control of ANE, compiles MIL programs with baked weights, executes fused transformer blocks on ANE hardware, and caches kernels for repeated inference without requiring Core ML.

Ongoing Research

My current work centers on efficient multimodal AI systems, wireless sensing platforms, and wearable intelligence that can run reliably under tight compute, memory, and battery constraints.

Theme 01

Efficient Multimodal Inference on Edge Devices

Building cross-accelerator systems for vision-language models on small, battery-powered platforms with low-bit quantization, memory-aware scheduling, and hardware-software co-design.

Theme 02

Wireless Sensing Systems for Human-Centered Applications

Designing UWB, mmWave, and distributed MIMO sensing systems for vital sign monitoring, mobile sensing, and robust operation in real-world environments.

Theme 03

Wearable and Context-Aware Assistive AI

Exploring multimodal assistants that combine perception, on-device inference, and contextual understanding for wearable and accessibility applications.

Virgile: A Multimodal Visual Memory Assistant with Persistent Object and Face Recognition on Edge Devices

Current Project

Virgile: A Multimodal Visual Memory Assistant with Persistent Object and Face Recognition on Edge Devices

A wearable earpiece device with camera and IMU sensor powered by our TinyLLM hardware platform performs fully on-device, multimodal inference without requiring internet connectivity. Equipped with an integrated camera, the device runs a visual instruction model (LlaVa-Onevision) to assist visually impaired individuals or elderly users in locating objects or navigating environments, such as identifying road signs or nearby landmarks. By leveraging a software-hardware co-design, the system ensures real-time, local natural language interaction. Current challenges include enhancing the device's positioning and reasoning capabilities to improve accuracy and reliability, addressing limitations in object localization and context understanding.

Explore project
Split to Fit: Cross-Accelerator Hybrid Quantization for Efficient Video Understanding on Edge Systems.

Current Project

Split to Fit: Cross-Accelerator Hybrid Quantization for Efficient Video Understanding on Edge Systems.

We present a system for efficient vision-language model (VLM) inference on mobile SoCs with unified memory, exemplified by deployment on RK3588. Our design decouples VLM execution across heterogeneous accelerators: an 8-bit vision encoder runs on the NPU, while a 4-bit language model runs on the GPU. These modules communicate via shared DRAM buffers, avoiding PCIe overhead.To reduce token and compute load, we introduce two lightweight modules: Spatial Embedding Reduction, which compresses ViT outputs without modifying the encoder, and Temporal Attention Pooling, which fuses multi-frame embeddings to preserve temporal information at reduced frame rates. Together, these enable high-throughput inference from 60fps input to 15ps language output under tight memory and power constraints. Our implementation on RK3588 achieves efficient, real-time VLM inference within sub-1GB memory, offering a practical solution for deploying multimodal intelligence at the edge.

Explore project

Selected Publications

Recent work on efficient multimodal inference, mobile AI benchmarking, and wireless sensing systems.

Scalable Biometric Sensing in the Wild through Distributed MIMO Radars

Scalable Biometric Sensing in the Wild through Distributed MIMO Radars

Yilong Li, Ramanujan K Sheshadri, Karthik Sundaresan, Eugene Chai, Yijing Zeng, Jayaram Raghuram, Suman Banerjee
MobiCom 2025 · 2025

Radar-based techniques for detecting vital signs have shown promise for continuous contactless vital sign sensing and healthcare applications. However, real-world indoor environments face significant challenges for ex...