For almost a decade, every frontier language model has carried the same hidden tax. The transformer architecture that powers GPT, Claude, and Gemini compares every token against every other token. Double the input and you roughly quadruple the work. That quadratic cost is the reason long documents get truncated, why "1 million token" context windows are expensive to actually use, and why retrieval-augmented generation exists at all.
In May 2026, a Miami startup called Subquadratic walked out of stealth with 29 million dollars and a claim that lands like a thunderclap: a frontier-class model, called SubQ, that does not pay the quadratic tax. A 12 million token context window. Roughly 50 times lower cost than leading models. If validated, it would be the most significant architectural departure since the transformer itself arrived in 2017.
Whether or not SubQ holds up under independent scrutiny, the direction it points toward matters for anyone building AI products. This guide explains the quadratic problem in plain terms, what subquadratic architectures actually change, and how to think about the shift without betting your roadmap on a single startup's benchmark.
The quadratic wall, explained simply
Attention is the mechanism that lets a model decide which earlier words matter when predicting the next one. In a standard transformer, attention is computed between every pair of tokens. For a sequence of length N, that is roughly N times N operations.
The numbers get brutal fast. Going from 1,000 to 10,000 tokens is a 10x increase in length but a 100x increase in attention compute. Push toward a million tokens and the memory and latency costs become the dominant expense of running the model. This is why most "long context" claims come with fine print: the window exists, but filling it is slow and expensive, and accuracy often degrades in the middle of very long inputs.
Every workaround you already use is a response to this wall. Chunking documents, embedding-based retrieval, summarization pipelines, sliding windows: all of them are clever ways to avoid feeding the model too much at once.
What "subquadratic" actually means
A subquadratic architecture changes how compute grows with length. Instead of scaling with N times N, the work scales closer to N, or N times the logarithm of N. The practical promise is simple: ten times more context should cost something closer to ten times more, not a hundred times more.
There is no single recipe. The research landscape in 2026 includes several families:
- State space models such as Mamba, which compress history into a fixed-size running state at linear cost. Fast and memory-light, but historically weaker at pulling an exact fact from an arbitrary position.
- Linear attention variants like RWKV, Gated Linear Attention, and Delta Networks, which reformulate attention so it never builds the full pairwise matrix.
- Hybrid models that interleave a few exact-attention layers with many cheap recurrent layers, balancing precise recall against efficiency.
- Sparse selection approaches, which is the camp SubQ sits in.
SubQ's mechanism, which it calls Subquadratic Sparse Attention, uses content-dependent selection. Rather than comparing a query token against all positions, the model first selects which positions actually matter, then computes exact attention only over that shortlist. The company reports attention compute dropping by nearly 1,000x versus a standard transformer at 12 million tokens, and roughly 52x faster than FlashAttention at 1 million tokens.
The interesting design choice is that this is not pure compression. By keeping exact attention over selected positions, the approach tries to preserve the precise retrieval that state space models struggle with, while still skipping the vast majority of irrelevant comparisons.
Why long context beats retrieval for some jobs
If a model can hold 12 million tokens cheaply, a lot of today's architecture becomes optional. Consider what that span covers: an entire codebase, years of a customer's support history, a full contract set, or a quarter of internal documentation, all in the prompt at once.
The advantage over retrieval-augmented generation is that nothing gets pre-chunked and nothing gets missed at the boundary. Retrieval can only surface what its similarity search happens to rank highly. A model reasoning over the whole corpus can connect a clause on page 3 to a footnote on page 900 without anyone having to anticipate that link. For tasks like cross-referencing legal documents, auditing large logs, or reasoning across a sprawling codebase, that is a genuine capability difference, not just a cost saving.
This does not kill retrieval. For knowledge bases measured in billions of tokens, you still need a retrieval layer to narrow things down. But the dividing line moves. Workloads that needed a vector database and a chunking pipeline yesterday may fit in a single context window tomorrow.
The skeptic's checklist
The AI community split within hours of SubQ's announcement, and the skepticism is healthy. Architectural claims have a long history of looking spectacular in a launch post and ordinary under independent testing. Before you re-plan anything, weigh these:
- Independent benchmarks. A 92 percent accuracy figure means little until someone outside the company reproduces it on public, contamination-free tests. Ask specifically about retrieval at depth, not just average scores.
- Recall at the extremes. Many efficient architectures are excellent at 100,000 tokens and quietly fall apart at 10 million. Demand needle-in-a-haystack results across the full advertised window.
- Quality, not just speed. Lower cost is easy if quality drops. The real question is whether subquadratic models match transformer reasoning, not whether they are faster.
- Ecosystem maturity. Transformers have years of tooling, fine-tuning recipes, and serving infrastructure. A new architecture starts that journey from scratch.
The honest position is that subquadratic attention is one of the most promising research directions in years, and that no single product launch settles it. Treat SubQ as a signal of where the field is heading, not a finished tool to migrate onto this quarter.
What this means for your business
You do not need to adopt an experimental architecture to benefit from this shift. The practical moves are about staying flexible:
- Abstract your model layer. If your application talks to a model through a gateway or a thin interface rather than hard-coded calls, swapping in a cheaper long-context model later is a configuration change, not a rewrite.
- Revisit problems you abandoned for cost reasons. Some use cases were shelved because feeding enough context was too expensive. Keep a list. As context gets cheaper, that list becomes a backlog of newly viable features.
- Do not over-engineer retrieval prematurely. If your corpus already fits comfortably in current long-context windows, an elaborate chunking and re-ranking pipeline may be solving a problem you do not have yet.
- Watch cost per useful token, not headline window size. A 12 million token window is only valuable if using it is affordable and accurate. Measure the price of the context you actually consume.
The bottom line
The transformer's quadratic cost has quietly shaped every AI product decision for years, from why we chunk documents to how much a long prompt costs. Subquadratic architectures, whether SubQ specifically or one of its rivals, aim to remove that constraint, and a 12 million token window changes what is worth attempting.
The right posture is informed patience. Build systems that can adopt a better model when it proves itself, keep a backlog of context-hungry ideas ready, and judge the new wave on independent evidence rather than launch-day numbers. The quadratic wall has defined the limits of practical AI for nine years. It is finally being tested, and that is worth watching closely.
At Noqta, we help businesses architect AI systems that stay adaptable as the underlying models evolve. If you want to build on a foundation that survives the next architectural shift, let's talk.