AIresearch & breakthroughsNew Model Architectures
Manifest AI's Brumby Model Replaces Attention with Power Retention
When the transformer architecture first emerged in 2017 through Google's landmark paper 'Attention Is All You Need,' it fundamentally reshaped artificial intelligence, establishing attention mechanisms as the bedrock of modern large language models. From OpenAI's GPT series to Anthropic's Claude and Meta's Llama, every major AI system has relied on this mathematical operation that enables models to dynamically weigh the importance of different parts of their input.Yet eight years into this architectural dominance, attention is revealing its fundamental limitations—specifically, its quadratic computational scaling that makes processing long contexts increasingly prohibitive. This bottleneck has become particularly problematic as AI ambitions expand to reasoning across massive codebases, lengthy documents, and even hour-long video streams, where attention's memory demands grow exponentially with sequence length.Enter Manifest AI's Brumby-14B-Base, introduced on October 28, 2025, which represents nothing less than an architectural revolution by completely replacing attention with a novel mechanism called Power Retention. This 14-billion-parameter model, built as a retrained variant of Qwen3-14B-Base, achieves performance parity with established transformers while operating with constant computational cost per token—meaning whether it processes 1,000 tokens or 1,000,000, the resource requirements remain fundamentally unchanged.The core innovation lies in Power Retention's recurrent architecture, which maintains a memory matrix that updates at each time step through learned gating signals, effectively compressing historical information into fixed-size latent states rather than performing exhaustive pairwise comparisons across sequences. This approach combines the expressive power of transformers with the efficiency of recurrent neural networks, using tensor powers of input to capture higher-order dependencies while avoiding attention's computational explosion.What makes this breakthrough particularly compelling is its economic implications: Manifest AI trained Brumby for just 60 hours on 32 Nvidia H100 GPUs at a total cost of approximately $4,000—less than 2% of conventional training expenses for models of comparable scale. However, as founder Jacob Buckman clarified, this staggering efficiency depends critically on leveraging existing transformer weights; training from scratch would require substantially more investment.The retraining process involved approximately 3,000 steps to recalibrate Qwen3's original weights to function within the new Power Retention architecture, a process analogous to teaching a virtuoso pianist to play guitar—the fundamental musical understanding remains, but the physical execution requires adaptation. Benchmark results demonstrate Brumby's remarkable capabilities: while it slightly trails transformers on knowledge-intensive tasks like MMLU-Pro, it matches or exceeds them in mathematical reasoning (achieving 0.62 on MATH versus Qwen3's 0. 54) and long-context reasoning, precisely where attention architectures typically struggle.The hardware efficiency gains are equally impressive, with Manifest's custom CUDA framework Vidrial achieving 80-85% hardware utilization compared to FlashAttention2's 70-75% or Mamba's 50-60%, while delivering hundred-fold speedups on extended contexts through linear complexity operations. This architectural shift carries profound implications for AI's future trajectory, potentially democratizing large-scale model development by enabling smaller research teams to repurpose existing transformer checkpoints without prohibitive compute budgets.Buckman projects that even 700-billion-parameter models could be retrained for $10,000-20,000 using this approach, fundamentally altering the economic landscape of AI research. The technical implementation appears deliberately accessible—companies can reportedly integrate Power Retention by simply installing a retention package and modifying one line of architecture code before resuming training.Yet the announcement sparked vigorous debate within research communities, with some critics arguing the '$4,000 foundation model' framing misleadingly implied training from scratch rather than retraining existing weights. This controversy highlights the tension between technical accuracy and communicative impact in AI discourse, while underscoring the field's ongoing struggle to define what constitutes genuine architectural innovation versus incremental improvement.Beyond immediate engineering implications, Brumby represents a philosophical shift toward modeling intelligent processes rather than mere artifacts of intelligence, aligning with Manifest's stated mission to 'train a neural network to model all human output. ' While the transformer era certainly isn't over, Brumby demonstrates that viable alternatives are emerging—architectures that preserve transformer capabilities while transcending their fundamental limitations.As the AI community grapples with scaling walls and economic constraints, Power Retention offers a compelling path forward: maintaining performance while radically improving efficiency, potentially reigniting architectural diversity after years of transformer monoculture. The true test will come as researchers independently validate these results and explore Power Retention's scaling laws, but Brumby unquestionably represents the most credible challenge to attention's dominance since the transformer revolution began.
#Power Retention
#Brumby-14B
#Transformer Alternative
#Qwen3
#Long Context
#featured