AURA: Solving the KV Cache Problem for Continuous Embodied AI
AURA introduces action-gated memory to prevent VRAM bloat in robots, allowing long-term policies to run indefinitely without crashing or hallucinating.
Research
Papers that actually matter
28 articles in this section.
AURA introduces action-gated memory to prevent VRAM bloat in robots, allowing long-term policies to run indefinitely without crashing or hallucinating.
Explore how Adaptive Runtime Termination (ART) reduces memory bandwidth bottlenecks to improve token throughput during long-context LLM inference.
BitsMoE uses spectral energy to guide non-uniform bit allocation, potentially allowing massive MoE models to fit on consumer GPUs.
EAGLE 3.1 addresses attention drift to provide more consistent and predictable throughput for LLM inference via speculative decoding.
Together AI’s OSCAR system uses attention-aware rotation to compress KV caches to 2-bit, significantly expanding context windows on consumer GPUs.
A ByteDance study suggests that training multimodal models via question-answering outperforms transcription-heavy methods for analyzing long, complex documents.
An analysis of recurrent depth and Sparse MoE as a way to trade memory efficiency for gradient stability in transformer architectures.
A new study explores using multi-pass verification to recover accuracy lost in 2-bit and 3-bit quantized models, though critics argue it’s a workaround.
An OpenAI model has disproven a geometry conjecture, highlighting the shift from human intuition to high-speed automated counterexample searching in mathematics.
Two AI assistants are accelerating drug retargeting by filtering medical literature, though physical lab validation remains the primary bottleneck in drug discovery.