AIresearch & breakthroughsNew Model Architectures
New AI Model Replaces Attention with Power Retention
When the transformer architecture first emerged in 2017 through Google's seminal paper 'Attention Is All You Need,' it fundamentally reshaped the artificial intelligence landscape, establishing a new paradigm that has underpinned every major large language model since—from OpenAI's GPT series to Anthropic's Claude, Google's Gemini, and Meta's Llama. At the heart of this revolution lay the attention mechanism, a mathematical operation enabling models to dynamically weigh the importance of different parts of their input, effectively allowing them to 'focus' on relevant information across extensive contexts.This architectural choice endowed transformers with unprecedented flexibility and power, catalyzing rapid advancements in natural language understanding and generation. However, eight years later, the very mechanism that propelled AI's golden age is revealing significant limitations.Attention, while powerful, is computationally expensive; its costs scale quadratically with context length, meaning that processing a sequence twice as long demands roughly four times the computational resources and memory. This scalability issue has become an unsustainable bottleneck as models increasingly aim to reason across lengthy documents, expansive codebases, or video streams spanning hours, highlighting attention as the architecture's Achilles' heel and spurring the search for more efficient alternatives.On October 28, 2025, the relatively obscure AI startup Manifest AI introduced a radical departure from this status quo with their new model, Brumby-14B-Base. This model is a retrained variant of Qwen3-14B-Base, a leading open-source transformer, but its novelty lies in the complete abandonment of attention layers.Instead, Brumby incorporates a novel mechanism dubbed Power Retention—a recurrent, hardware-efficient architecture designed to store and update information over arbitrarily long contexts without the exponential memory growth characteristic of attention. Trained at a remarkably low cost of approximately $4,000, this 14-billion-parameter model demonstrates performance on par with established transformers like Qwen3-14B and GLM-4.5-Air, achieving near-state-of-the-art accuracy across various reasoning and comprehension benchmarks, thereby challenging the long-held assumption that attention is indispensable for high-performance AI. The core innovation of Manifest AI's approach resides in the Power Retention layer.In a traditional transformer, each token generates queries, keys, and values, followed by a matrix operation that computes pairwise similarities across the entire sequence—a process that grants attention its contextual awareness but also imposes its heavy computational burden. Power Retention retains the same inputs but replaces this global similarity operation with a recurrent state update.Each layer maintains a memory matrix S, which is incrementally updated at each time step using incoming keys, values, and a learned gating signal. This design mirrors recurrent neural networks more than transformers, as it continuously compresses historical information into a fixed-size latent state rather than recomputing attention over the full context.Consequently, the computational cost of Power Retention remains constant per token, irrespective of context length—whether processing 1,000 or 1,000,000 tokens, the resource demands do not escalate, marking a profound departure from transformer dynamics. Moreover, Power Retention preserves the expressive capacity that made attention successful; by involving tensor powers of the input, it can capture higher-order dependencies between past and present tokens, theoretically enabling the retention of long-term dependencies indefinitely while maintaining the efficiency of an RNN and the expressiveness of a transformer.The training process for Brumby-14B underscores its economic viability. Manifest AI trained the model for merely 60 hours on 32 Nvidia H100 GPUs, costing around $4,000—less than 2% of the typical expense for training a conventional model of similar scale from scratch.However, as Jacob Buckman, founder of Manifest AI, clarified, this cost efficiency is contingent on leveraging pre-existing transformer weights; training Brumby from scratch would not be feasible at that price. This retraining approach involved fundamentally altering Qwen3's architecture by excising its attention layers and substituting them with power retention mechanisms, effectively rewiring the model's internal processing while preserving its accumulated knowledge.Initially, this architectural shift caused Brumby to 'forget' some of its capabilities, as the original weights were tailored for attention-based dynamics. Through approximately 3,000 steps of additional training—akin to a world-class pianist learning to play guitar—the model recalibrated its weights to align with the power retention framework, rapidly recovering its performance and matching the original Qwen3's accuracy on benchmarks.This swift adaptation highlights that attention-free systems can inherit and adapt transformer capabilities with minimal retraining investment. Benchmark evaluations reveal Brumby-14B-Base performing at or near parity with transformer baselines.It slightly trails on knowledge-intensive tasks like MMLU-Pro but matches or surpasses counterparts on mathematical reasoning (e. g., GSM8K) and long-context reasoning, areas where attention architectures typically struggle, suggesting that recurrent or retention-based systems may possess inherent advantages for extended logical or temporal dependencies. Hardware efficiency is another standout benefit; Power Retention's local matrix operations enable linear complexity in sequence length during inference.Manifest AI's custom CUDA framework, Vidrial, reportedly achieves hardware utilization of 80–85%, outperforming FlashAttention2's 70–75% and Mamba's 50–60%—another post-transformer architecture using state-space mechanisms for linear processing. This efficiency, combined with reduced FLOPs and memory usage on long contexts, contributes to claimed speedups of up to 100 times over attention-based methods, though production-scale validation remains pending.The training cost of $4,000 for a 14-billion-parameter model signals a potential paradigm shift in foundation model economics, potentially democratizing large-scale AI experimentation by enabling smaller entities to retrain existing checkpoints affordably. Buckman projected that retraining even 700-billion-parameter models might cost only $10,000–$20,000, far below current transformer budgets.Integration is streamlined; developers can allegedly convert transformers to Power Retention models by installing a retention package, modifying one line of architecture code, and resuming training, with performance recovery occurring within few GPU-hours. Kernels are Triton-based, compatible with NVIDIA and AMD accelerators, with ongoing efforts to integrate into inference engines like vLLM.Distributed inference and context-parallel training are reportedly more straightforward with this recurrent-state architecture. Manifest AI's broader mission, as Buckman outlined, aims to model all human output by focusing on intelligent processes rather than mere artifacts, necessitating architectural reinvention.The Brumby release ignited debate on social media, with critics like Meta researcher Ariel questioning the '$4,000 foundation model' framing as misleading due to weight reuse, though Buckman defended it as part of a transparent discussion on retraining efficiency. Ultimately, Brumby-14B-Base represents not just an engineering feat but a proof-of-concept that challenges transformer hegemony, suggesting that performance parity is achievable with drastically lower computational costs and that the long-context bottleneck can be overcome without exotic hardware.This could democratize AI development and rejuvenate architectural diversity, ending years of transformer monoculture. As Buckman aptly noted, while the transformer era persists, this advancement marks a significant stride toward more efficient and scalable AI futures, potentially catalyzing a new wave of innovation in model design.
#featured
#Power Retention
#transformer alternative
#long context
#AI efficiency
#Brumby-14B
#Qwen3