The research, a collaboration between Dnotitia and UC San Diego’s VVIP Lab, focuses on the KV cache—a temporary GPU memory store that prevents large language models from recomputing previously processed tokens. As AI agents increasingly ingest vast amounts of external data, this cache has become a primary bottleneck. Using a LLaMA-3.1-8B model with a 128K-token context, the team found that the KV cache consumes roughly 81% of total GPU memory, highlighting the urgency of compression technologies.
Dnotitia’s STAR-KV Tackles Long-Context AI Memory Bottlenecks
Selected as a Spotlight paper at ICML 2026, Dnotitia’s new STAR-KV framework achieves up to 20x compression of the KV cache. By combining low-rank approximation with mixed-precision quantization, the method addresses the memory constraints that currently limit long-context AI performance and inference speed.
STAR-KV utilizes custom GPU kernels to accelerate attention computation by up to 6.9x and overall generation throughput by 3.1x. Beyond mere memory savings, the approach maintains higher accuracy levels than existing compression methods. With the paper accepted into the competitive ICML 2026 program—where it joins an elite 2.2% of submissions—Dnotitia has already released the source code on GitHub. CEO MK Chung stated that the company intends to integrate these advancements into open-source inference frameworks like vLLM, aiming to lower the operational costs of long-context AI services.
Comments (0)
No comments yet. Be the first!